Question for the extra conv layers in Encoder #24

LokiXun · 2023-10-06T04:46:57Z

Hi, I have a question about the Why adding extra conv layers in the Encoder. And this encoder structure is used in the latter SOTA methods like E2FGVI, Pro-painter. In the FuseFormer paper says:

Other network structures including the CNN encoder, decoder and discriminator are the same as STTN [43], except that we insert several convolutional layers between encoder and the first Transformer block to compensate for aggressive channel reduction in patch tokenization.

I did not understand this compenstae aggressive channel reduction in patch tokenization, it seems like without any channel reduction, below is my understanding:
Before Tokenized the Frames' feature shape= [b, t, c=128, h/4=60, w/4=128]. After patching with (k, k) kernel and tokenize, the feature becames [b, t, (n1, n2), (c, k, k)], where n1,n2 denotes the patch_num on height&width.
If we reshape the patch_token back to image, it becames [b, t, c, k, k] the channel still c, so where is the channel reduction I am wondering. 🤔

Here is the Encoder structure I redraw by referencing the code:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question for the extra conv layers in Encoder #24

Question for the extra conv layers in Encoder #24

LokiXun commented Oct 6, 2023

Question for the extra conv layers in Encoder #24

Question for the extra conv layers in Encoder #24

Comments

LokiXun commented Oct 6, 2023