Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change comparison in partitions to include equals #587

Merged
merged 4 commits into from
Feb 12, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion streaming/base/partition/orig.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,12 @@ def get_partitions_orig(num_samples: int,
padding = node_ratio - overflow
padded_samples_per_canonical_node = samples_per_canonical_node + padding

if num_samples > num_canonical_nodes:
# For samples to be properly split across canonical nodes, there must be more samples than nodes.
# The edge case is when the number of samples is equal to the number of canonical nodes, but this only works when
# there is an equal or greater number of canonical nodes than physical nodes.
# If these conditions are not met, an alternative sampling approach is used that leads to many repeats.
if num_samples > num_canonical_nodes or (num_samples == num_canonical_nodes and
num_canonical_nodes >= num_physical_nodes):
# Create the initial sample ID matrix.
#
# ids: (canonical nodes, padded samples per canonical node).
Expand Down
Loading