Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed training interrupt bug #123

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

bobo0810
Copy link

Before repair:

TypeError: Caught TypeError in DataLoader worker process 6.

  File "/video_llama/datasets/datasets/webvid_datasets.py", line 70, in __getitem__

    video_path = self._get_video_path(sample_dict)

  File "/video_llama/datasets/datasets/webvid_datasets.py", line 50, in _get_video_path

    rel_video_fp = os.path.join(sample['page_dir'], str(sample['videoid']) + '.mp4')

  File "/opt/conda/lib/python3.10/posixpath.py", line 76, in join

    a = os.fspath(a)

TypeError: expected str, bytes or os.PathLike object, not float

After repair:

Train: data epoch: [1]  [ 150/2500]  eta: 0:10:20  lr: 0.000098  loss: 2.7766  time: 0.2573  data: 0.0000  max mem: 53623

[15:49:40]ERROR opening: /alluxio/multi-data/webvid/val_file/nan/24205120.mp4, No such file or directory

Failed to load examples with video: /alluxio/multi-data/webvid/val_file/nan/24205120.mp4. Will randomly sample an example as a replacement.

Train: data epoch: [1]  [ 200/2500]  eta: 0:10:04  lr: 0.000098  loss: 2.3127  time: 0.2587  data: 0.0000  max mem: 53623

@bobo0810
Copy link
Author

Fixed a bug that caused training to be interrupted when page_dir was Nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant