You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a question about the Why adding extra conv layers in the Encoder. And this encoder structure is used in the latter SOTA methods like E2FGVI, Pro-painter. In the FuseFormer paper says:
Other network structures including the CNN encoder, decoder and discriminator are the same as STTN [43], except that we insert several convolutional layers between encoder and the first Transformer block to compensate for aggressive channel reduction in patch tokenization.
I did not understand this compenstae aggressive channel reduction in patch tokenization, it seems like without any channel reduction, below is my understanding:
Before Tokenized the Frames' feature shape= [b, t, c=128, h/4=60, w/4=128]. After patching with (k, k) kernel and tokenize, the feature becames [b, t, (n1, n2), (c, k, k)], where n1,n2 denotes the patch_num on height&width.
If we reshape the patch_token back to image, it becames [b, t, c, k, k] the channel still c, so where is the channel reduction I am wondering. 🤔
Here is the Encoder structure I redraw by referencing the code:
The text was updated successfully, but these errors were encountered:
Hi, I have a question about the Why adding extra conv layers in the Encoder. And this encoder structure is used in the latter SOTA methods like E2FGVI, Pro-painter. In the FuseFormer paper says:
I did not understand this
compenstae aggressive channel reduction in patch tokenization
, it seems like without any channel reduction, below is my understanding:Before Tokenized the Frames' feature shape=
[b, t, c=128, h/4=60, w/4=128]
. After patching with (k, k) kernel and tokenize, the feature becames[b, t, (n1, n2), (c, k, k)]
, wheren1,n2
denotes the patch_num on height&width.If we reshape the patch_token back to image, it becames
[b, t, c, k, k]
the channel stillc
, so where is the channel reduction I am wondering. 🤔Here is the Encoder structure I redraw by referencing the code:
The text was updated successfully, but these errors were encountered: