Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Cleanlab <> Pinecone RAG workflow #19

Merged
merged 39 commits into from
Sep 27, 2024
Merged

Conversation

mturk24
Copy link
Contributor

@mturk24 mturk24 commented Aug 9, 2024

A notebook showing how to use Pinecone + Cleanlab's TLM to build an accurate, reliable, and trustworthy RAG System. The goal is to have this live in Pinecone's Knowledge Center (or other documentation that is most suitable)!

"metadata": {},
"source": [
"# How to build a reliable, curated, and accurate RAG system using Cleanlab and Pinecone"
]
Copy link
Member

@jwmueller jwmueller Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: how come there is no mention of document chunking? I would've thought the steps are:

  1. chunk documents and embed/ingest chunks into pinecone DB, so there's already a pre-existing DB before Cleanlab enters the picture.
  2. run Studio on every chunk as a separate example (with the text for the chunk coming from a file exported out of pinecone DB)
  3. show how to map these results back to pinecone DB (for instance, delete: PII, non-English, toxic, and extra exact duplicate copies from the DB).

Not sure there's any need to run studio on entire documents at all in this tutorial, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you've framed this as a one-time cleanup of documents prior to entering them into the RAG DB. While that's a valid use-case, I imagine it's less common than wanting to cleanup the existing RAG DB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, yeah I thought it would simplify things to do it as a one-time cleanup and then in a more sophisticated version of the notebook (tutorial) we can do it by cleaning up an existing RAG DB. But for the purpose of sharing a RAG example use case with Pinecone (and to use internally for RAG demos), using the one-time version makes sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will do metadata in pt 2 of this notebook

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to have a bunch of bad document chunks in the dataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflow is:

pinecone DB -> Cleanlab -> better version of pinecone DB

"# Simple query regarding our documents\n",
"question = \"Tell me about the sales principles at SAP Business One\"\n",
"\n",
"top_doc = rag_pipeline.search(question, top_k=1, filter_query={\"topic\": {\"$eq\": \"sales\"}})\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this part, you should write some text before the code cell to justify where the metadata filter is coming from. Eg:

Suppose our application asks this user which topic their question is about from a pre-defined list. Alternatively, we could train a classifier to predict the topic from the question (automatically using Cleanlab Studio AutoML).

"Documents: {top_docs} \\n\\\n",
"Question: {question}\"\n",
"\n",
"tlm = studio.TLM(quality_preset=\"low\",)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should explain why you're setting low here.

should also explain that:

You can try setting the quality_preset="best" to get even more accurate LLM answers along with the trustworthiness scores.

…for the index upsert/deletion on chunks and how to use Cleanlab Studio and then update the index. Also working TLM examples and added more markdown to explain workflow. Also updated README
@mturk24 mturk24 changed the title Added Cleanlab <> Pinecone RAG notebook so far Added Cleanlab <> Pinecone RAG workflow Aug 15, 2024
Copy link
Contributor

@nelsonauner nelsonauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all Nit comments that you can ignore.

I really like Jonas' comments about showing a more continuous-integration of TLM as opposed to a one and done.

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
"outputs": [],
"source": [
"class PineconeRAGPipeline:\n",
" def __init__(self, model_name: str = 'paraphrase-MiniLM-L6-v2', index_name: str = 'document-index', cloud: str = 'aws', region: str = 'us-east-1'):\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but I think this is longer than black allows so I'd line break it (or whatever the linter suggests)

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved
cwaddingham and others added 6 commits August 21, 2024 14:45
…anged the handling of the PINECONE_API_KEY env var to prompt the user to provide it if it's not found.
Minor changes to Cleanlab_Pinecone_RAG notebook
…rating how to set up a reliable RAG system using TLM
README.md Outdated Show resolved Hide resolved
"metadata": {},
"outputs": [],
"source": [
"%pip install -r requirements.txt\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always install in a different cell than the imports. Also install cleanlab studio right after that before any imports

"outputs": [],
"source": [
"%pip install -r requirements.txt\n",
"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check you actually need each of these imports, not sure why there are so many and so many packages in requirements.txt

"id": "69f026d6-fc2d-4b9a-ac5d-eb227ec9462b",
"metadata": {},
"source": [
"## Analyze Documents Data\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of this whole length-analysis section? It seems unnecesary, so I recommend cutting it

"id": "47c191e8",
"metadata": {},
"source": [
"Let's now use Cleanlab's TLM to do zero-shot classification and classify the text (tag) into different topics. We will make use of code from [Cleanlab's TLM Zero-Shot Classification Tutorial](https://help.cleanlab.ai/tutorials/zero_shot_classification/) to do this. This includes the two helper functions `parse_category()` and `classify()` that can be found below."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont need to write classification all of the time, you can just write classification.

@jwmueller jwmueller self-requested a review September 25, 2024 17:08
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go ahead and merge as soon as you've addressed my comments, so we can get a stable link asap. Ask Chris to test-run the merged notebook

mturk24 and others added 21 commits September 25, 2024 13:55
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
@mturk24 mturk24 merged commit 3f510f8 into main Sep 27, 2024
@mturk24 mturk24 deleted the cleanlab-pinecone-RAG-workflow branch September 27, 2024 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants