[ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.
framework
transformer
image-captioning
visual-reasoning
multimodal-learning
visual-question-answering
model-acceleration
efficient-deep-learning
vision-language-transformer
image-text-retrieval
text-image-retrieval
token-ensemble
token-matching
-
Updated
Oct 4, 2023