Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document similarity worflow #247

Open
saggu opened this issue Aug 3, 2018 · 2 comments
Open

Document similarity worflow #247

saggu opened this issue Aug 3, 2018 · 2 comments
Assignees

Comments

@saggu
Copy link
Member

saggu commented Aug 3, 2018

20180803_104442

Pipeline should work as follows:

Process each incoming document: create sentence vectors indices

Store the indices so that it can be re created if the process dies

For each query: compute vector, find k nearest matches irrespective of any threshold and return the ranked result which is a list of document ids with similarity scores

Fetch the documents from ES and return to DIG UI

If the user chooses a facet, add filter to the list of documents for a query, re rank the results and return to DIG UI. So, if originally we had k documents, adding a facet will always return <= k documents. The facets act as a filter

@saggu
Copy link
Member Author

saggu commented Aug 6, 2018

Updated Pipeline
20180806_140425

@saggu
Copy link
Member Author

saggu commented Aug 28, 2018

The followings tasks are done:

  1. Vectorize each sentence using tensorflow
  2. Index the vectors in FAISS index, and store the link in hbase
  3. Able to query a string and return k similar docs back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants