Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: 151859 is not in list #5

Open
Inistlwq opened this issue Dec 21, 2023 · 4 comments
Open

ValueError: 151859 is not in list #5

Inistlwq opened this issue Dec 21, 2023 · 4 comments

Comments

@Inistlwq
Copy link

image = image[ : image.index(self.config.visual['image_start_id'] + 2)]
ValueError: 151859 is not in list

image
image

使用原始的数据和代码,一直报这个错,可以看下吗

@TobiasLee
Copy link
Collaborator

hi, 好像你的input_ids 是float类型?应该是个 long 才对? 是不是哪里的处理有问题

@Inistlwq
Copy link
Author

Inistlwq commented Dec 23, 2023

hi, 好像你的input_ids 是float类型?应该是个 long 才对? 是不是哪里的处理有问题

没做处理哦,数据和代码基本没有改,输入开始确实是long,有float应该是inf导致的整体类型发生变化,看着有inf,我怀疑是fp16导致的(训练机器不支持bf16),有试过用fp16训练吗

@TobiasLee
Copy link
Collaborator

可能是 fp16 的精度的问题吧 试试强制cast一下 input_ids 为 Long ? 我们是在 A100 上开bf16的所以没试过fp16哈

@Inistlwq
Copy link
Author

Inistlwq commented Dec 23, 2023

可能是 fp16 的精度的问题吧 试试强制cast一下 input_ids 为 Long ? 我们是在 A100 上开bf16的所以没试过fp16哈

cast肯定不行,inf说明数据精度丢了,恢复不了原来的值,不过我昨天在3090上试过bf16确实可以,但是显存爆了。
我之前在qwen vl使用v100的fp16可以正常sft,但这份代码里用fp16的dpo的有问题,我再研究下实现差异,感觉fp16应该也支持

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants