You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find it a bit odd to use different training datasets depending on the benchmark. For example, with Videochat2, all instruction datasets were trained together and then evaluated on various benchmarks (as far as I know). However, for ST-LLM, different instruction datasets are used for training based on the benchmark and then evaluated separately. Doesn’t this seem unfair? I’m curious about the rationale behind dividing the data this way.
Hello, thank you for pointing out the issue. The reason we did this is that using instruction data in the form of multiple-choice questions from datasets like K400, SSV2, and CLEVRER, although beneficial for MVBench, would severely impact the model's dialogue performance, leading to significant hallucinations. Our approach actually used less data to achieve better results.
The text was updated successfully, but these errors were encountered: