Question on multi-image input #24

auhowielau · 2024-03-18T19:26:00Z

Some models (e.g. LLaVA 1.5) cannot input multiple (>3) images limited by input length (e.g., 2048). However, Evaluation Dimension 17-24 of SeedBench 2 may require inputs of up to 8 images. How do you handle such situations? Thanks!

Bohao-Lee · 2024-03-19T04:37:29Z

In our code, we concat images to handle such situations just like other models. In our experiment, llava model can output reasonable loss.

auhowielau · 2024-03-19T08:56:31Z

For the LLaVA 1.5 model, does the concat operation transform N input images into Nx576 visual tokens? If so, for an input of 8 frames, would there be a truncation issue, as 576x8=4608 far exceeds the input length limit of 2048? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on multi-image input #24

Question on multi-image input #24

auhowielau commented Mar 18, 2024

Bohao-Lee commented Mar 19, 2024 •

edited

Loading

auhowielau commented Mar 19, 2024

Question on multi-image input #24

Question on multi-image input #24

Comments

auhowielau commented Mar 18, 2024

Bohao-Lee commented Mar 19, 2024 • edited Loading

auhowielau commented Mar 19, 2024

Bohao-Lee commented Mar 19, 2024 •

edited

Loading