Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question for the extra conv layers in Encoder #24

Open
LokiXun opened this issue Oct 6, 2023 · 0 comments
Open

Question for the extra conv layers in Encoder #24

LokiXun opened this issue Oct 6, 2023 · 0 comments

Comments

@LokiXun
Copy link

LokiXun commented Oct 6, 2023

Hi, I have a question about the Why adding extra conv layers in the Encoder. And this encoder structure is used in the latter SOTA methods like E2FGVI, Pro-painter. In the FuseFormer paper says:

Other network structures including the CNN encoder, decoder and discriminator are the same as STTN [43], except that we insert several convolutional layers between encoder and the first Transformer block to compensate for aggressive channel reduction in patch tokenization.

I did not understand this compenstae aggressive channel reduction in patch tokenization, it seems like without any channel reduction, below is my understanding:
Before Tokenized the Frames' feature shape= [b, t, c=128, h/4=60, w/4=128]. After patching with (k, k) kernel and tokenize, the feature becames [b, t, (n1, n2), (c, k, k)], where n1,n2 denotes the patch_num on height&width.
If we reshape the patch_token back to image, it becames [b, t, c, k, k] the channel still c, so where is the channel reduction I am wondering. 🤔

Here is the Encoder structure I redraw by referencing the code:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant