-
Notifications
You must be signed in to change notification settings - Fork 25
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[load_from_hf_hub] Add dataset_length, set_index (#339)
This PR adds 2 things to the `load_from_hf_hub` reusable component: - a `dataset_length` argument, which is required in case the user specifies `n_rows_to_load`. The reason why I added this is because I hit an issue when `n_rows_to_load` was larger than the partition size. The current code loads only the first partition, so even though I specified `n_rows_to_load` to be 150k, I only got 69,000 rows. So to solve this I calculate the size of a single partition, then return approximately the requested `n_rows_to_load`. - adds a monotonically increasing index as suggested by this Stackoverflow post, to solve the issue of duplicate indices due to every partition having indices that start at 0.
- Loading branch information
1 parent
74aca21
commit dfdcee3
Showing
2 changed files
with
18 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters