Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

several breakages due to recent datasets #542

Closed
stas00 opened this issue Jan 29, 2024 · 17 comments · Fixed by #578
Closed

several breakages due to recent datasets #542

stas00 opened this issue Jan 29, 2024 · 17 comments · Fixed by #578

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 29, 2024

It seems that datasets==2.16.0 and higher breaks evaluate

$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist

dist.init_process_group("nccl")

rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()

metric = load("accuracy",
                  experiment_id = "test4",
                  num_process = world_size,
                  process_id  = rank)
metric.add_batch(predictions=[], references=[])

Problem 1. umask isn't being respected when creating lock files

as we are in a group setting we use umask 000

but this script creates files with missing perms:

-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

which is invalid, since umask 000 should have led to:

-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

the problem applies to all other locks created during such run - that is a few more .lock files there.

this is the same issue that was reported and dealt with multiple times in datasets

if I downgrade to datasets==2.15.0 the files are created correctly with:

-rw-rw-rw- 

Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000  --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
    self._check_all_processes_locks()  # wait for everyone to be ready
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
    self._check_rendez_vous()  # wait for master to be ready and to let everyone go
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

The files are there:

-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock

if I downgrade to datasets==2.15.0 the above code starts to work.

with datasets<2.16 works, datasets>=2.16 breaks.

Using evaluate==0.4.1

Thank you!

@lhoestq

@williamberrios who reported this

@stas00 stas00 closed this as completed Jan 29, 2024
@stas00 stas00 changed the title umask isn't being respected when creating lock files several breakges due to recent datasets Jan 29, 2024
@stas00 stas00 reopened this Jan 29, 2024
@stas00 stas00 changed the title several breakges due to recent datasets several breakages due to recent datasets Jan 29, 2024
@stas00
Copy link
Contributor Author

stas00 commented Jan 29, 2024

@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.

@lhoestq
Copy link
Member

lhoestq commented Jan 30, 2024

It seems to be an issue with recent versions of filelock ? I was able to reproduce using the latest version 3.13.1

Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:

pip install "filelock==3.9.0"

@lhoestq
Copy link
Member

lhoestq commented Jan 30, 2024

I just opened huggingface/datasets#6631 in datasets to fix this.

Can you try it out ? Once I have your green light I can make a new release

@stas00
Copy link
Contributor Author

stas00 commented Jan 31, 2024

thanks a lot, @lhoestq

@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!

@williamberrios
Copy link

Hi @lhoestq, filelock==3.9.0 fixed my issue with distributed evaluation. Thanks a lot ❤️

@stas00
Copy link
Contributor Author

stas00 commented Feb 2, 2024

Thank you for confirming it solved your problem, William!

@jxmorris12
Copy link

Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?

@stas00
Copy link
Contributor Author

stas00 commented Mar 7, 2024

Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.

@jxmorris12
Copy link

sorry :)

@stas00
Copy link
Contributor Author

stas00 commented Apr 29, 2024

@lhoestq, is it possible to make a new release now that this issue has been fixed? Thank you!

@lhoestq
Copy link
Member

lhoestq commented Apr 30, 2024

just released 0.4.2 :)

@stas00
Copy link
Contributor Author

stas00 commented Apr 30, 2024

Thank you very much, Quentin!

@raghavm1
Copy link

raghavm1 commented Jul 9, 2024

Unfortunately, I'm facing the same error with the latest versions of evaluate (0.4.2), datasets (2.20.2) and filelock (3.15.4). Downgrading datasets/filelock also doesn't seem to fix the issue for me inspite of having the lockfiles in the cache_dir.
Any suggestions to troubleshoot this error?

@yaraaa7
Copy link

yaraaa7 commented Aug 25, 2024

Hi,
Did it end up working for you? I'm facing this issue now

@raghavm1
Copy link

raghavm1 commented Aug 28, 2024

Hi, Did it end up working for you? I'm facing this issue now

Unfortunately not.
I'm also noticing that evaluate is not being maintained actively anymore. I'm not sure but it might be useful to raise this issue in the accelerate repo for learning what could be the next steps on this bug/issue.
For now, I'm sticking to a single-node setup where this issue doesn't occur.

@lhoestq
Copy link
Member

lhoestq commented Aug 29, 2024

I could take a look if you can provide a google colab or script that reproduces the issue :)

@ffrancesco94
Copy link

+1 from me. In my case, it is the complete_nlp_example.py from the accelerate repo that fails if I run multi-gpu multi-node. The script needs to be slightly tweaked (the version on the repo doesn't set the number of processes and process id when evaluate.load() is called), but after that it dies in the metric evaluation part with this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants