Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update padding of ragged features to enable dataloader change #647

Merged

Conversation

oliverholworthy
Copy link
Member

Goals ⚽

Implementation Details 🚧

  • Implements the equivalent of the padding currently in the dataloader
    • Uses ragged -> sparse -> dense conversion which appears to be faster than an alternative approach assigning values to a tensor constructed with zeros

Testing Details 🔍

  • Existing tests should cover existing usage of the Merlin Dataloader
  • Adds unit tests for padding function to check pad batch

@oliverholworthy oliverholworthy added the chore Maintenance for the repository label Mar 14, 2023
@oliverholworthy oliverholworthy self-assigned this Mar 14, 2023
@oliverholworthy oliverholworthy added this to the Merlin 23.03 milestone Mar 14, 2023
@github-actions
Copy link

conts=None,
labels=None,
):
schema = schema.select_by_name(conts + cats + labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove the =None for cats, conts & labels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clearer without the default now. it wouldn't have worked as None before this change either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I've put the None back and added support for the None to the method by setting to an empty list if not provided.

It doesn't appear to be captured by any tests, but this would enable training a model with only continuous or only categorical features. Currently you need to have at least one of each for the current version of this MerlinDataloader to work.


batch_padded = {}
for k, values in batch.items():
if k.endswith("__values"):
Copy link
Contributor

@marcromeyn marcromeyn Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to put __values and __offsets as constants somewhere in merlin-core? Or some util-method, like is_value or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps TensorTable would be useful here since that seems to handle both the values/offsets dictionary and tuple variants for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 43d535d

return pad_fn

@staticmethod
def _augment_schema(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this from schema_utiils to here so that it's closer to the only place it's called in the codebase.

@karlhigley karlhigley merged commit 8c13823 into NVIDIA-Merlin:main Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Maintenance for the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants