Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PawsX eval splits #316

Merged
merged 1 commit into from
Apr 5, 2024
Merged

Fix PawsX eval splits #316

merged 1 commit into from
Apr 5, 2024

Conversation

imenelydiaker
Copy link
Contributor

@imenelydiaker imenelydiaker commented Apr 4, 2024

Fix for issue #309. Eval splits didn't have the correct name.

@imenelydiaker imenelydiaker linked an issue Apr 4, 2024 that may be closed by this pull request
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run one of the smaller models on this just to check that it works without issue?

@MartinBernstorff
Copy link
Contributor

Looks good! Agree with Kenneth.

As a meta-question, maybe it's possible to check if the splits exist in the dataset without downloading it? Might be worth it if we get more errors like this.

@imenelydiaker
Copy link
Contributor Author

imenelydiaker commented Apr 4, 2024

Looks good! Agree with Kenneth.

As a meta-question, maybe it's possible to check if the splits exist in the dataset without downloading it? Might be worth it if we get more errors like this.

Yep was thinking about adding it to the test that checks if a dataset + revision id exists

@KennethEnevoldsen
Copy link
Contributor

As a meta-question, maybe it's possible to check if the splits exist in the dataset without downloading it? Might be worth it if we get more errors like this.

As I understand there is no API for this (at least documented), but you could probably do something like:

>>> ds = load_dataset("paws-x", "en", revision="8a04d940a42cd40658986fdd8e3da561533a3646", streaming=True)
>>> ds.keys()
dict_keys(['train', 'test', 'validation'])

It only download metadata I believe:

Downloading builder script: 100%|██████████| 6.54k/6.54k [00:00<00:00, 9.68MB/s]
Downloading metadata: 100%|████████████████| 17.6k/17.6k [00:00<00:00, 9.76MB/s]
Downloading readme: 100%|██████████████████| 11.8k/11.8k [00:00<00:00, 7.18MB/s]

Might be a way to generalize the existing tests.

@KennethEnevoldsen
Copy link
Contributor

Yep was thinking about adding it to the test that check if a dataset + revision id exists

@imenelydiaker this test already exists (based on a recent pr)

@imenelydiaker
Copy link
Contributor Author

Can you run one of the smaller models on this just to check that it works without issue?

Yep I used average_word_embeddings_komninos with mteb CLI and it worked 🙂

@imenelydiaker
Copy link
Contributor Author

imenelydiaker commented Apr 4, 2024

As a meta-question, maybe it's possible to check if the splits exist in the dataset without downloading it? Might be worth it if we get more errors like this.

As I understand there is no API for this (at least documented), but you could probably do something like:

>>> ds = load_dataset("paws-x", "en", revision="8a04d940a42cd40658986fdd8e3da561533a3646", streaming=True)
>>> ds.keys()
dict_keys(['train', 'test', 'validation'])

It only download metadata I believe:

Downloading builder script: 100%|██████████| 6.54k/6.54k [00:00<00:00, 9.68MB/s]
Downloading metadata: 100%|████████████████| 17.6k/17.6k [00:00<00:00, 9.76MB/s]
Downloading readme: 100%|██████████████████| 11.8k/11.8k [00:00<00:00, 7.18MB/s]

Might be a way to generalize the existing tests.

So you just need to download the dataset in streaming mode to not store it locally and check the keys of the dataset object? Looks like a good alternative 🤔

@imenelydiaker
Copy link
Contributor Author

Should I merge the fix and we'll open another PR for the test?

@imenelydiaker imenelydiaker merged commit 15285d4 into main Apr 5, 2024
5 checks passed
@imenelydiaker imenelydiaker deleted the fix/309-keyerror-testfull branch April 5, 2024 08:20
@MartinBernstorff
Copy link
Contributor

Yeah, I'll give it a stab today 👍

MartinBernstorff pushed a commit that referenced this pull request Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError: 'test.full'
3 participants