-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Cleanlab <> Pinecone RAG workflow #19
Conversation
"metadata": {}, | ||
"source": [ | ||
"# How to build a reliable, curated, and accurate RAG system using Cleanlab and Pinecone" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: how come there is no mention of document chunking? I would've thought the steps are:
- chunk documents and embed/ingest chunks into pinecone DB, so there's already a pre-existing DB before Cleanlab enters the picture.
- run Studio on every chunk as a separate example (with the text for the chunk coming from a file exported out of pinecone DB)
- show how to map these results back to pinecone DB (for instance, delete: PII, non-English, toxic, and extra exact duplicate copies from the DB).
Not sure there's any need to run studio on entire documents at all in this tutorial, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you've framed this as a one-time cleanup of documents prior to entering them into the RAG DB. While that's a valid use-case, I imagine it's less common than wanting to cleanup the existing RAG DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, yeah I thought it would simplify things to do it as a one-time cleanup and then in a more sophisticated version of the notebook (tutorial) we can do it by cleaning up an existing RAG DB. But for the purpose of sharing a RAG example use case with Pinecone (and to use internally for RAG demos), using the one-time version makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will do metadata in pt 2 of this notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to have a bunch of bad document chunks in the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
workflow is:
pinecone DB -> Cleanlab -> better version of pinecone DB
"# Simple query regarding our documents\n", | ||
"question = \"Tell me about the sales principles at SAP Business One\"\n", | ||
"\n", | ||
"top_doc = rag_pipeline.search(question, top_k=1, filter_query={\"topic\": {\"$eq\": \"sales\"}})\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this part, you should write some text before the code cell to justify where the metadata filter is coming from. Eg:
Suppose our application asks this user which topic their question is about from a pre-defined list. Alternatively, we could train a classifier to predict the topic from the question (automatically using Cleanlab Studio AutoML).
"Documents: {top_docs} \\n\\\n", | ||
"Question: {question}\"\n", | ||
"\n", | ||
"tlm = studio.TLM(quality_preset=\"low\",)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should explain why you're setting low here.
should also explain that:
You can try setting the quality_preset="best"
to get even more accurate LLM answers along with the trustworthiness scores.
…for the index upsert/deletion on chunks and how to use Cleanlab Studio and then update the index. Also working TLM examples and added more markdown to explain workflow. Also updated README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all Nit comments that you can ignore.
I really like Jonas' comments about showing a more continuous-integration of TLM as opposed to a one and done.
"outputs": [], | ||
"source": [ | ||
"class PineconeRAGPipeline:\n", | ||
" def __init__(self, model_name: str = 'paraphrase-MiniLM-L6-v2', index_name: str = 'document-index', cloud: str = 'aws', region: str = 'us-east-1'):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but I think this is longer than black allows so I'd line break it (or whatever the linter suggests)
…anged the handling of the PINECONE_API_KEY env var to prompt the user to provide it if it's not found.
Minor changes to Cleanlab_Pinecone_RAG notebook
…that are higher quality for this workflow
…rating how to set up a reliable RAG system using TLM
…de examples of using TLM for extraction more robust to changing document chunks in new runs of this workflow
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
…e and also how bad chunks and PII chunks are obtained
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install -r requirements.txt\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always install in a different cell than the imports. Also install cleanlab studio right after that before any imports
"outputs": [], | ||
"source": [ | ||
"%pip install -r requirements.txt\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check you actually need each of these imports, not sure why there are so many and so many packages in requirements.txt
"id": "69f026d6-fc2d-4b9a-ac5d-eb227ec9462b", | ||
"metadata": {}, | ||
"source": [ | ||
"## Analyze Documents Data\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the point of this whole length-analysis section? It seems unnecesary, so I recommend cutting it
"id": "47c191e8", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's now use Cleanlab's TLM to do zero-shot classification and classify the text (tag) into different topics. We will make use of code from [Cleanlab's TLM Zero-Shot Classification Tutorial](https://help.cleanlab.ai/tutorials/zero_shot_classification/) to do this. This includes the two helper functions `parse_category()` and `classify()` that can be found below." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont need to write classification all of the time, you can just write classification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go ahead and merge as soon as you've addressed my comments, so we can get a stable link asap. Ask Chris to test-run the merged notebook
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
…xample more robust
A notebook showing how to use Pinecone + Cleanlab's TLM to build an accurate, reliable, and trustworthy RAG System. The goal is to have this live in Pinecone's Knowledge Center (or other documentation that is most suitable)!