Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add standalone partition assignment operation #2556

Merged

Conversation

westonpace
Copy link
Contributor

This also fixes up the tqdm progress bar for partition assignment so that it has a definite end. This also makes one small tweak to the indices builder, moving the column argument into the builder constructor, since this argument will be shared by all methods in the class.


dataset: LanceDataset
the dataset containing the data
column: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IndicesBuilder is vector only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can probably rename in future PR.

row_id: uint64
partition: uint32

Note: There is no advantage to separately computing the partition assignment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is also useful for checkpoint purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...perhaps. Though I think a better way to checkpoint the "transform vectors" stage will be to break up the work by fragments.

def assign_ivf_partitions(
self,
ivf_model: IvfModel,
dst_dataset_uri: Optional[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst_dataset_uri => output_uri? we use less abbreviation in python SDKs.

Btw, should this be a dict arg, i.e., output_uri = None if we expect that user won't set it for most of the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to output_uri. I could set it to output_uri = None (which would use a random tmp path?). I think that would be confusing though since the other methods are going to need paths. E.g. if a user is going down the "piece by piece" standalone route then they want to know and be aware of all the temp files because that is what they are trying to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I came around. I changed to output_uri and default to None.

@westonpace westonpace force-pushed the feat/standalone-partition-assignment branch from f446aa8 to 81759a6 Compare July 11, 2024 23:55
@westonpace westonpace merged commit 5a8fb8c into lancedb:main Jul 12, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants