Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Word ngram based minhashes #15055

Open
ayushdg opened this issue Feb 14, 2024 · 0 comments
Open

[FEA] Word ngram based minhashes #15055

ayushdg opened this issue Feb 14, 2024 · 0 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@ayushdg
Copy link
Member

ayushdg commented Feb 14, 2024

Is your feature request related to a problem? Please describe.
The current minhashes functionality/API #12961 computes minhashes by taking a fixed character width sliding window and computes the minhashes for a document using character ngrams.

Another alternative approach to minhashing documents is to instead use word based ngrams instead of char based and research seems to suggest it leads to better quality results (lower false positives during Locality sensitive hashing).

Describe the solution you'd like
Add a word ngram based minhash support that's accelerated on GPUs.

Describe alternatives you've considered
Currently most CPU libraries achieve this by first tokenizing the string into word ngrams, and then looping over these tokens computing the minhash.
It is possible to somewhat mimic that with a str.word_tokenize + str.hash_values + groupby + min in cuDF but requires a lot more intermediate memory and reduces the batch size of documents processed. It is also slower than a custom minhash implementation.

Another challenge is that the definition of a "word" can be different for different languages and there's often different approaches to create word_ngrams based on language.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@ayushdg ayushdg added the feature request New feature or request label Feb 14, 2024
@davidwendt davidwendt self-assigned this Feb 14, 2024
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Feb 15, 2024
rapids-bot bot pushed a commit that referenced this issue Sep 17, 2024
Experimental implementation for #15055 
The input is a lists column of strings where each string in each row is expected as a word to be hashed. The minimum hash for that row is returned in a lists column where each row contains a minhash per input hash seed.
Here the caller is expected to produce the words to be hashed.

```
std::unique_ptr<cudf::column> word_minhash(
  cudf::lists_column_view const& input,
  cudf::device_span<uint32_t const> seeds,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);
```

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15368
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
Status: In progress
Development

No branches or pull requests

3 participants