Update padding of ragged features to enable dataloader change #647

oliverholworthy · 2023-03-14T16:43:19Z

Goals ⚽

Enable a change to the merlin dataloader (Remove sparse tensor output type for list features dataloader#103) that removes sparse output type and padding based on the value count property.
Implement as a backwards compatible change for the current version as well as the future version Update dataloader to provide new output structure dataloader#101

Implementation Details 🚧

Implements the equivalent of the padding currently in the dataloader
- Uses ragged -> sparse -> dense conversion which appears to be faster than an alternative approach assigning values to a tensor constructed with zeros

Testing Details 🔍

Existing tests should cover existing usage of the Merlin Dataloader
Adds unit tests for padding function to check pad batch

github-actions · 2023-03-14T17:11:39Z

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-647

marcromeyn · 2023-03-15T12:40:46Z

transformers4rec/torch/utils/data_utils.py

+        conts=None,
+        labels=None,
+    ):
+        schema = schema.select_by_name(conts + cats + labels)


Maybe we should remove the =None for cats, conts & labels?

that's clearer without the default now. it wouldn't have worked as None before this change either.

On second thought, I've put the None back and added support for the None to the method by setting to an empty list if not provided.

It doesn't appear to be captured by any tests, but this would enable training a model with only continuous or only categorical features. Currently you need to have at least one of each for the current version of this MerlinDataloader to work.

marcromeyn · 2023-03-15T12:41:48Z

transformers4rec/torch/utils/padding.py

+
+    batch_padded = {}
+    for k, values in batch.items():
+        if k.endswith("__values"):


Wouldn't it be better to put __values and __offsets as constants somewhere in merlin-core? Or some util-method, like is_value or something.

Perhaps TensorTable would be useful here since that seems to handle both the values/offsets dictionary and tuple variants for us.

updated in 43d535d

oliverholworthy · 2023-03-15T13:06:31Z

transformers4rec/torch/utils/data_utils.py

+        return pad_fn
+
+    @staticmethod
+    def _augment_schema(


Moved this from schema_utiils to here so that it's closer to the only place it's called in the codebase.

This reverts commit eab5492.

oliverholworthy added 5 commits March 14, 2023 09:18

Add Padding transform to enable removing sparse output from dataloader

f8e5830

Handle padding of 1-d and 2-d values/offsets

ba923e2

Move padding function to padding module

86bfddb

Move _augment_schema to a method of MerlinDataLoader

ced153f

Move get_pad_fn to staticmethod on MerlinDataLoader

f4b91c3

oliverholworthy added the chore Maintenance for the repository label Mar 14, 2023

oliverholworthy self-assigned this Mar 14, 2023

oliverholworthy added this to the Merlin 23.03 milestone Mar 14, 2023

oliverholworthy added 2 commits March 14, 2023 16:44

Merge branch 'main' into dataloader-dense-padding

1431f5d

Correct name of pad method in MerlinDataLoader

86662c6

Temporary change to run-on to check if 2PGU is working

eab5492

marcromeyn reviewed Mar 15, 2023

View reviewed changes

oliverholworthy added 2 commits March 15, 2023 12:57

Use TensorTable to simplify padding implementation

43d535d

Remove None defaults in _augment_schema

1a9b768

oliverholworthy commented Mar 15, 2023

View reviewed changes

oliverholworthy and others added 3 commits March 15, 2023 13:11

Restore None default in _augment_schema and set default to empty list

f65dd66

Revert "Temporary change to run-on to check if 2PGU is working"

25564ad

This reverts commit eab5492.

Merge branch 'main' into dataloader-dense-padding

363ab92

karlhigley approved these changes Mar 15, 2023

View reviewed changes

karlhigley merged commit 8c13823 into NVIDIA-Merlin:main Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update padding of ragged features to enable dataloader change #647

Update padding of ragged features to enable dataloader change #647

oliverholworthy commented Mar 14, 2023

github-actions bot commented Mar 14, 2023

marcromeyn Mar 15, 2023

oliverholworthy Mar 15, 2023

oliverholworthy Mar 15, 2023

marcromeyn Mar 15, 2023 •

edited

Loading

oliverholworthy Mar 15, 2023

oliverholworthy Mar 15, 2023

oliverholworthy Mar 15, 2023

Update padding of ragged features to enable dataloader change #647

Update padding of ragged features to enable dataloader change #647

Conversation

oliverholworthy commented Mar 14, 2023

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

github-actions bot commented Mar 14, 2023

Documentation preview

marcromeyn Mar 15, 2023

Choose a reason for hiding this comment

oliverholworthy Mar 15, 2023

Choose a reason for hiding this comment

oliverholworthy Mar 15, 2023

Choose a reason for hiding this comment

marcromeyn Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

oliverholworthy Mar 15, 2023

Choose a reason for hiding this comment

oliverholworthy Mar 15, 2023

Choose a reason for hiding this comment

oliverholworthy Mar 15, 2023

Choose a reason for hiding this comment

marcromeyn Mar 15, 2023 •

edited

Loading