Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Workflow to Allow IterableDataset Inputs #8263

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

ericspod
Copy link
Member

Description

This modifies the behaviour of Workflow to permit IterableDataset to be used correctly. A check against the epoch_length value is removed, to allow that value to be None, and a test is added to verify this. The length of a data loader is not defined when using iterable datasets, so try/raise is added to allow that to be queried safely. This is related to my work on the streaming support, in my prototype gist I had to provide a bogus epoch length value in the then change it to None later once the evaluator object was created. This PR will remove the need for this hack.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Eric Kerfoot <eric.kerfoot@kcl.ac.uk>
pre-commit-ci bot and others added 4 commits December 13, 2024 20:56
Signed-off-by: Eric Kerfoot <eric.kerfoot@kcl.ac.uk>
…gine_stream_fix

Signed-off-by: Eric Kerfoot <eric.kerfoot@kcl.ac.uk>
Signed-off-by: Eric Kerfoot <eric.kerfoot@kcl.ac.uk>
@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 16, 2024

/build

Copy link
Contributor

@KumoLiu KumoLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr, looks good to me. One minor comments inline.

raise ValueError("If data_loader is not PyTorch DataLoader, must specify the epoch_length.")
try:
epoch_length = len(data_loader)
except TypeError: # raised when data_loader is given an iterable dataset which has no length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should add a warning here or validate the dataset instance in the data_loader?

isinstance(data_loader.dataset, Iterable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data_loader argument itself isn't always a DataLoader instance so may not have a dataset member, it can be a Iterable itself which might have a length. It think there's a use case where you'd pass in just a list of items you already have a so not use a DataLoader at all and so this is the more robust way of attempting to get a length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like if epoch_length is None: this if should be moved to the outside level, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants