Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming datasets V2 #2

Merged
merged 37 commits into from
Aug 10, 2022
Merged

Streaming datasets V2 #2

merged 37 commits into from
Aug 10, 2022

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Aug 4, 2022

Formats: MDS, JSON, XSV (CSV, TSV)
Sample sizes moved to shards
V2 small JSON index format (faster startup time at scale)
Sample fields now have types instead of having to manually provide decoders
Prefetch factor
Fast process pool Dataset.download()
Dataset supports random access of samples, lazily loading their shards
Offsets live on disk instead of memory for scalability (two seeks instead of one for sample access)
Various compression algorithms
Various hash/checksum algorithms on shards (TODO: on index)
Datasets: CIFAR10, ImageNet (TODO: more)

@knighton knighton mentioned this pull request Aug 5, 2022
@hanlint
Copy link
Contributor

hanlint commented Aug 9, 2022

Before merging, could you setup:

This would at minimum auto-format your code into your yahp standard to reduce diff noise / conflicts in the future.

Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't review for correctness, but some suggestions from API perspective.

setup.py Outdated Show resolved Hide resolved
streaming/base/compression/compression.py Outdated Show resolved Hide resolved
streaming/base/compression/compression.py Outdated Show resolved Hide resolved
streaming/base/compression/compression.py Outdated Show resolved Hide resolved
streaming/base/compression/compression.py Outdated Show resolved Hide resolved
streaming/base/download.py Show resolved Hide resolved
streaming/base/format/base/writer.py Outdated Show resolved Hide resolved
streaming/base/format/base/reader.py Outdated Show resolved Hide resolved
streaming/base/format/base/writer.py Outdated Show resolved Hide resolved
test.py Outdated Show resolved Hide resolved
@hanlint hanlint requested a review from dblalock August 9, 2022 17:20
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yolo

setup.py Outdated Show resolved Hide resolved
@knighton knighton merged commit 0d0a730 into main Aug 10, 2022
@knighton knighton deleted the dev2 branch August 10, 2022 19:39
@knighton knighton mentioned this pull request May 23, 2023
knighton added a commit that referenced this pull request Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants