Added Cleanlab <> Pinecone RAG workflow #19

mturk24 · 2024-08-09T02:43:28Z

A notebook showing how to use Pinecone + Cleanlab's TLM to build an accurate, reliable, and trustworthy RAG System. The goal is to have this live in Pinecone's Knowledge Center (or other documentation that is most suitable)!

jwmueller · 2024-08-09T05:39:52Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+   "metadata": {},
+   "source": [
+    "# How to build a reliable, curated, and accurate RAG system using Cleanlab and Pinecone"
+   ]


Curious: how come there is no mention of document chunking? I would've thought the steps are:

chunk documents and embed/ingest chunks into pinecone DB, so there's already a pre-existing DB before Cleanlab enters the picture.

run Studio on every chunk as a separate example (with the text for the chunk coming from a file exported out of pinecone DB)

show how to map these results back to pinecone DB (for instance, delete: PII, non-English, toxic, and extra exact duplicate copies from the DB).

Not sure there's any need to run studio on entire documents at all in this tutorial, WDYT?

I guess you've framed this as a one-time cleanup of documents prior to entering them into the RAG DB. While that's a valid use-case, I imagine it's less common than wanting to cleanup the existing RAG DB.

Gotcha, yeah I thought it would simplify things to do it as a one-time cleanup and then in a more sophisticated version of the notebook (tutorial) we can do it by cleaning up an existing RAG DB. But for the purpose of sharing a RAG example use case with Pinecone (and to use internally for RAG demos), using the one-time version makes sense to me

we will do metadata in pt 2 of this notebook

make sure to have a bunch of bad document chunks in the dataset

workflow is:

pinecone DB -> Cleanlab -> better version of pinecone DB

jwmueller · 2024-08-09T05:41:57Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+    "# Simple query regarding our documents\n",
+    "question = \"Tell me about the sales principles at SAP Business One\"\n",
+    "\n",
+    "top_doc = rag_pipeline.search(question, top_k=1, filter_query={\"topic\": {\"$eq\": \"sales\"}})\n",


for this part, you should write some text before the code cell to justify where the metadata filter is coming from. Eg:

Suppose our application asks this user which topic their question is about from a pre-defined list. Alternatively, we could train a classifier to predict the topic from the question (automatically using Cleanlab Studio AutoML).

jwmueller · 2024-08-09T05:55:19Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+    "Documents: {top_docs} \\n\\\n",
+    "Question: {question}\"\n",
+    "\n",
+    "tlm = studio.TLM(quality_preset=\"low\",)\n",


should explain why you're setting low here.

should also explain that:

You can try setting the quality_preset="best" to get even more accurate LLM answers along with the trustworthiness scores.

…for the index upsert/deletion on chunks and how to use Cleanlab Studio and then update the index. Also working TLM examples and added more markdown to explain workflow. Also updated README

nelsonauner

Almost all Nit comments that you can ignore.

I really like Jonas' comments about showing a more continuous-integration of TLM as opposed to a one and done.

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

nelsonauner · 2024-08-19T18:11:24Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+   "outputs": [],
+   "source": [
+    "class PineconeRAGPipeline:\n",
+    "    def __init__(self, model_name: str = 'paraphrase-MiniLM-L6-v2', index_name: str = 'document-index', cloud: str = 'aws', region: str = 'us-east-1'):\n",


nit but I think this is longer than black allows so I'd line break it (or whatever the linter suggests)

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

…anged the handling of the PINECONE_API_KEY env var to prompt the user to provide it if it's not found.

Minor changes to Cleanlab_Pinecone_RAG notebook

…that are higher quality for this workflow

…rating how to set up a reliable RAG system using TLM

README.md

…de examples of using TLM for extraction more robust to changing document chunks in new runs of this workflow

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

…e and also how bad chunks and PII chunks are obtained

jwmueller · 2024-09-25T04:23:23Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -r requirements.txt\n",


always install in a different cell than the imports. Also install cleanlab studio right after that before any imports

jwmueller · 2024-09-25T04:24:00Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+   "outputs": [],
+   "source": [
+    "%pip install -r requirements.txt\n",
+    "\n",


double check you actually need each of these imports, not sure why there are so many and so many packages in requirements.txt

jwmueller · 2024-09-25T04:24:37Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+   "id": "69f026d6-fc2d-4b9a-ac5d-eb227ec9462b",
+   "metadata": {},
+   "source": [
+    "## Analyze Documents Data\n",


what is the point of this whole length-analysis section? It seems unnecesary, so I recommend cutting it

jwmueller · 2024-09-25T04:26:48Z

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

+   "id": "47c191e8",
+   "metadata": {},
+   "source": [
+    "Let's now use Cleanlab's TLM to do zero-shot classification and classify the text (tag) into different topics. We will make use of code from [Cleanlab's TLM Zero-Shot Classification Tutorial](https://help.cleanlab.ai/tutorials/zero_shot_classification/) to do this. This includes the two helper functions `parse_category()` and `classify()` that can be found below."


you dont need to write classification all of the time, you can just write classification.

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

jwmueller

go ahead and merge as soon as you've addressed my comments, so we can get a stable link asap. Ask Chris to test-run the merged notebook

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

…xample more robust

…n RAG outputs

Added Cleanlab <> Pinecone RAG notebook so far

a287497

jwmueller reviewed Aug 9, 2024

View reviewed changes

Added significant changes to the RAG workflow. Have working code now …

6265e8e

…for the index upsert/deletion on chunks and how to use Cleanlab Studio and then update the index. Also working TLM examples and added more markdown to explain workflow. Also updated README

mturk24 requested review from jwmueller and nelsonauner August 15, 2024 22:47

mturk24 changed the title ~~Added Cleanlab <> Pinecone RAG notebook so far~~ Added Cleanlab <> Pinecone RAG workflow Aug 15, 2024

mturk24 added 2 commits August 15, 2024 19:02

Removed very long output of indexing

0cfbaaa

Removed unnecesary display cell

3183c73

nelsonauner reviewed Aug 19, 2024

View reviewed changes

cwaddingham and others added 6 commits August 21, 2024 14:45

Added requirements.txt; added pip install to the first cell block; ch…

e0bffc5

…anged the handling of the PINECONE_API_KEY env var to prompt the user to provide it if it's not found.

Merge pull request #20 from cwaddingham/cleanlab-pinecone-RAG-workflow

941d203

Minor changes to Cleanlab_Pinecone_RAG notebook

Updated requirements to use proper version of numpy that is runnable

9b9cfc5

Renamed documents csv and replaced it with a better csv of documents …

3a9949d

…that are higher quality for this workflow

Made significant updates to the Cleanlab <> Pinecone workflow demonst…

484bdc0

…rating how to set up a reliable RAG system using TLM

Updated README

3ba2f7f

jwmueller reviewed Sep 6, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

mturk24 and others added 8 commits September 6, 2024 17:20

Modified Pinecone RAG class implementation to clean it up and also ma…

3ad3b55

…de examples of using TLM for extraction more robust to changing document chunks in new runs of this workflow

Update README.md

1b41aea

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Added diagram to beginning of notebook and introduction section

235c97f

Added diagram file and renamed it

6d7534f

Updated Pinecone Cleanlab diagram

7a92cae

Changed how classification (tagging and intent classification) is don…

6234e28

…e and also how bad chunks and PII chunks are obtained

Merge branch 'main' into cleanlab-pinecone-RAG-workflow

d7a570e

Changed formatting of a few cells and switched back to try_prompt

fb9d798

jwmueller reviewed Sep 25, 2024

View reviewed changes

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Show resolved Hide resolved

jwmueller reviewed Sep 25, 2024

View reviewed changes

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved

jwmueller reviewed Sep 25, 2024

View reviewed changes

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Outdated Show resolved Hide resolved

jwmueller reviewed Sep 25, 2024

View reviewed changes

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Show resolved Hide resolved

jwmueller reviewed Sep 25, 2024

View reviewed changes

cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb Show resolved Hide resolved

jwmueller self-requested a review September 25, 2024 17:08

jwmueller approved these changes Sep 25, 2024

View reviewed changes

mturk24 and others added 21 commits September 25, 2024 13:55

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

77571b1

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

13e3edc

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

ec4f314

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

e3dc99d

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

2161bdc

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

7f6729e

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

4e35d14

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

9c5df0b

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

aea860b

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Update cleanlab_pinecone_rag_workflow/Cleanlab_Pinecone_RAG.ipynb

2b96852

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Merge branch 'main' into cleanlab-pinecone-RAG-workflow

6f95321

Addressed a few of the latest comments

b2e5d96

Removed document length analysis section

848c1ed

Removed %%time statements and unnecessary zero-shot terminology

8ce6acd

Merge branch 'main' into cleanlab-pinecone-RAG-workflow

32e5d74

Added only relevant libraries to requirements.txt

250c592

Fixed markdown

a378574

Fixed topic filtering logic, removed broken cells, made chunk query e…

b2f89a6

…xample more robust

Fixed hallucination, removed unnecesary markdown, and fixed section o…

b6f4a9b

…n RAG outputs

Merge branch 'main' into cleanlab-pinecone-RAG-workflow

e4d57e6

Fixed README

86ee1e1

mturk24 merged commit 3f510f8 into main Sep 27, 2024

mturk24 deleted the cleanlab-pinecone-RAG-workflow branch September 27, 2024 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Cleanlab <> Pinecone RAG workflow #19

Added Cleanlab <> Pinecone RAG workflow #19

mturk24 commented Aug 9, 2024 •

edited

Loading

jwmueller Aug 9, 2024 •

edited

Loading

jwmueller Aug 9, 2024

mturk24 Aug 13, 2024

jwmueller Aug 13, 2024

jwmueller Aug 13, 2024

jwmueller Aug 13, 2024

jwmueller Aug 9, 2024

jwmueller Aug 9, 2024

nelsonauner left a comment

nelsonauner Aug 19, 2024

jwmueller Sep 25, 2024

jwmueller Sep 25, 2024

jwmueller Sep 25, 2024

jwmueller Sep 25, 2024

jwmueller left a comment

Added Cleanlab <> Pinecone RAG workflow #19

Added Cleanlab <> Pinecone RAG workflow #19

Conversation

mturk24 commented Aug 9, 2024 • edited Loading

jwmueller Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nelsonauner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwmueller left a comment

Choose a reason for hiding this comment

mturk24 commented Aug 9, 2024 •

edited

Loading

jwmueller Aug 9, 2024 •

edited

Loading