several breakages due to recent `datasets` #542

stas00 · 2024-01-29T20:00:29Z

It seems that datasets==2.16.0 and higher breaks evaluate

$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist

dist.init_process_group("nccl")

rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()

metric = load("accuracy",
                  experiment_id = "test4",
                  num_process = world_size,
                  process_id  = rank)
metric.add_batch(predictions=[], references=[])

Problem 1. `umask` isn't being respected when creating lock files

as we are in a group setting we use umask 000

but this script creates files with missing perms:

-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

which is invalid, since umask 000 should have led to:

-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

the problem applies to all other locks created during such run - that is a few more .lock files there.

this is the same issue that was reported and dealt with multiple times in datasets

if I downgrade to datasets==2.15.0 the files are created correctly with:

-rw-rw-rw-

Problem 2. `Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.`

$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000  --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
    self._check_all_processes_locks()  # wait for everyone to be ready
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
    self._check_rendez_vous()  # wait for master to be ready and to let everyone go
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

The files are there:

-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock

if I downgrade to datasets==2.15.0 the above code starts to work.

with datasets<2.16 works, datasets>=2.16 breaks.

Using evaluate==0.4.1

Thank you!

@lhoestq

@williamberrios who reported this

The text was updated successfully, but these errors were encountered:

stas00 · 2024-01-29T22:25:32Z

@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.

lhoestq · 2024-01-30T11:07:56Z

It seems to be an issue with recent versions of filelock ? I was able to reproduce using the latest version 3.13.1

Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:

pip install "filelock==3.9.0"

lhoestq · 2024-01-30T12:56:44Z

I just opened huggingface/datasets#6631 in datasets to fix this.

Can you try it out ? Once I have your green light I can make a new release

stas00 · 2024-01-31T01:10:02Z

thanks a lot, @lhoestq

@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!

williamberrios · 2024-02-02T15:08:48Z

Hi @lhoestq, filelock==3.9.0 fixed my issue with distributed evaluation. Thanks a lot ❤️

stas00 · 2024-02-02T18:31:56Z

Thank you for confirming it solved your problem, William!

jxmorris12 · 2024-03-07T21:48:49Z

Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?

stas00 · 2024-03-07T21:54:44Z

Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.

jxmorris12 · 2024-03-07T22:12:26Z

sorry :)

stas00 · 2024-04-29T22:16:18Z

@lhoestq, is it possible to make a new release now that this issue has been fixed? Thank you!

lhoestq · 2024-04-30T09:45:57Z

just released 0.4.2 :)

stas00 · 2024-04-30T19:56:02Z

Thank you very much, Quentin!

raghavm1 · 2024-07-09T20:57:08Z

Unfortunately, I'm facing the same error with the latest versions of evaluate (0.4.2), datasets (2.20.2) and filelock (3.15.4). Downgrading datasets/filelock also doesn't seem to fix the issue for me inspite of having the lockfiles in the cache_dir.
Any suggestions to troubleshoot this error?

yaraaa7 · 2024-08-25T16:21:30Z

Hi,
Did it end up working for you? I'm facing this issue now

raghavm1 · 2024-08-28T16:47:45Z

Hi, Did it end up working for you? I'm facing this issue now

Unfortunately not.
I'm also noticing that evaluate is not being maintained actively anymore. I'm not sure but it might be useful to raise this issue in the accelerate repo for learning what could be the next steps on this bug/issue.
For now, I'm sticking to a single-node setup where this issue doesn't occur.

lhoestq · 2024-08-29T08:59:08Z

I could take a look if you can provide a google colab or script that reproduces the issue :)

ffrancesco94 · 2024-11-01T11:25:26Z

+1 from me. In my case, it is the complete_nlp_example.py from the accelerate repo that fails if I run multi-gpu multi-node. The script needs to be slightly tweaked (the version on the repo doesn't set the number of processes and process id when evaluate.load() is called), but after that it dies in the metric evaluation part with this error.

stas00 closed this as completed Jan 29, 2024

stas00 changed the title ~~umask isn't being respected when creating lock files~~ several breakges due to recent datasets Jan 29, 2024

stas00 reopened this Jan 29, 2024

stas00 changed the title ~~several breakges due to recent datasets~~ several breakages due to recent datasets Jan 29, 2024

lhoestq mentioned this issue Jan 30, 2024

Fix filelock: use current umask for filelock >= 3.10 huggingface/datasets#6631

Merged

jxmorris12 mentioned this issue Mar 7, 2024

Evaluate fails to acquire lock interrupting single-node multi-GPU training #540

Closed

qubvel mentioned this issue Apr 10, 2024

Update python to 3.8 #571

Merged

lhoestq mentioned this issue Apr 18, 2024

Fix FileFreeLock #578

Merged

lhoestq closed this as completed in #578 Apr 18, 2024

raghavm1 mentioned this issue Jul 9, 2024

[Metrics] Can't find / acquire lock files in distributed multi-node shared file system #481

Open

raghavm1 mentioned this issue Jul 11, 2024

[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

Open

This was referenced Nov 2, 2024

Multinode, multigpu example fails huggingface/accelerate#3206

Open

Fix slurm multinode example huggingface/accelerate#3229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

several breakages due to recent `datasets` #542

several breakages due to recent `datasets` #542

stas00 commented Jan 29, 2024 •

edited

Loading

stas00 commented Jan 29, 2024 •

edited

Loading

lhoestq commented Jan 30, 2024

lhoestq commented Jan 30, 2024

stas00 commented Jan 31, 2024

williamberrios commented Feb 2, 2024

stas00 commented Feb 2, 2024

jxmorris12 commented Mar 7, 2024

stas00 commented Mar 7, 2024

jxmorris12 commented Mar 7, 2024

stas00 commented Apr 29, 2024

lhoestq commented Apr 30, 2024 •

edited

Loading

stas00 commented Apr 30, 2024

raghavm1 commented Jul 9, 2024

yaraaa7 commented Aug 25, 2024

raghavm1 commented Aug 28, 2024 •

edited

Loading

lhoestq commented Aug 29, 2024

ffrancesco94 commented Nov 1, 2024

several breakages due to recent datasets #542

several breakages due to recent datasets #542

Comments

stas00 commented Jan 29, 2024 • edited Loading

Problem 1. umask isn't being respected when creating lock files

Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

stas00 commented Jan 29, 2024 • edited Loading

lhoestq commented Jan 30, 2024

lhoestq commented Jan 30, 2024

stas00 commented Jan 31, 2024

williamberrios commented Feb 2, 2024

stas00 commented Feb 2, 2024

jxmorris12 commented Mar 7, 2024

stas00 commented Mar 7, 2024

jxmorris12 commented Mar 7, 2024

stas00 commented Apr 29, 2024

lhoestq commented Apr 30, 2024 • edited Loading

stas00 commented Apr 30, 2024

raghavm1 commented Jul 9, 2024

yaraaa7 commented Aug 25, 2024

raghavm1 commented Aug 28, 2024 • edited Loading

lhoestq commented Aug 29, 2024

ffrancesco94 commented Nov 1, 2024

several breakages due to recent `datasets` #542

several breakages due to recent `datasets` #542

stas00 commented Jan 29, 2024 •

edited

Loading

Problem 1. `umask` isn't being respected when creating lock files

Problem 2. `Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.`

stas00 commented Jan 29, 2024 •

edited

Loading

lhoestq commented Apr 30, 2024 •

edited

Loading

raghavm1 commented Aug 28, 2024 •

edited

Loading