-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architecture of ALBEF #144
Comments
I think you can refer to the review of this paper for this architecture question. Reference: Reviewer XRzR here. |
@phphuc612 yeah I have read that reviews and the review you are talking about i am writing it here: I understand that they wanted the text encoder and multimodal encoder to have similar architecture. But I thought there might be some specific reason of doing this. I mean why 6 layers of the same architecture BERT base like they could have used all 12 layers of bert base as text encoder and 12 layers of same bert base or some other bert as multimodal encoder. Using same architecture bert base might overfit the model this is what I think. But using 2 different bert models like bert base as text encoder and distillbert as multimodal encoder could be considered. Also the multimodal encoder should have more layers as its task is comparitively difficult from text encoder like understanding the complex interactions between image and text. So what are you thoughts on these points? I would be happy to hear from you. :) |
Thanks for sharing a new insight @Asaad-Pak. From my view point, I more focused on the contribution of the authors in learning paradigm design instead of architecture design: (1) "momentum distillation" to train momentum contrast and find hard negative sample for image-text matching, and (2) their approach to align the multi-modal latent space through contrastive learning as a base pre-text task before fusing them for further task such as mask language modeling and image-text matching. Supposed I were the authors, I would approach the same way people often do in their first thought, choosing an architecture (BERT) with, first, a module to encode text and, second, a fusion module (e.g. cross attention), also, with slight modification (split number of layers half instead of tuning) that can quickly adapt to my purposes. This will help me alleviate the problem of architecture design while provide more space to discuss about the two aforementioned points. Clearly, they set aside a whole section for momentum distillation and especially a section for the "Mutual Information Maximization Perspective". This can be also shown in the way they do the ablation study where they focused on how their learning strategy impacting on model performance. These are my thoughts about why they chose that architecture. Back to the point of architecture design, I think your idea is worth investigating because the text continues to be fed into a same architecture and image seems to be a "conditional guiding" for the text through cross attention. However, we should note that the fusion stage is not only done through the BERT layer, but is also indirectly impacted by the contrastive learning, so, you can consider this point if you need to make comparison when doing the follow-up research. |
Hello I would like to do some experiments using ALBEF model. For this I reviewed your paper as well, but I am unable to understand why first six layers of bert base was used as text encoder and why last six layers are used as multimodal encoder? Why didn't the entire BERT_base with all 12 layers was used as text encoder and multimodal encoder? Your help in this regard would be greatly appreciated. @LiJunnan1992 @svc-scm @chenxwh
The text was updated successfully, but these errors were encountered: