This project currently works best with English documents.
this project
- utilizes Pinecone vector database (VDB) and OpenAI (vector) embedding model to turn texts into vectors.
- works with any
.md
file, so it works perfectly with Notion & Obsidian (though for Notion you have to export it to.md
manually first) - is the author's practice of Feynman technique.
- is probably a weaker duplicate of privateGPT and llama_index, if you want a beautifully-crafted document query program, you should use llama_index instead of this toy.
- Each markdown file in the target directory is cut into lots of small chunks using
langchain.textsplitter
- Each chunck is turned into a vector via OpenAI's embedding model (
langchain.embeddings.OpenAIEmbeddings
) - The vectors are then uploaded to
Pinecone
vector database. - Queries are also converted to vectors using the vector embedding model and uploaded to Pinecone.
- To retrieve search results, we compare the query vector with vector database using Pinecone (by cosine similarity).
- Closest 3 results are retrieved and fed into GPT-3 along with the question, and GPT-3 will generate an answer in natural language.
- add a
--help
option - deploy to Streamlit
- Prepare Pinecone and OpenAI API key:
- To export the Pinecone and OpenAI API key to system environment
now in Python use
export PINECONE_API_KEY="your_pinecone_api_key" export OPENAI_API_KEY="your_openai_api_key"
to check if you have them exported to system environment, ifimport os os.environ["PINECONE_API_KEY"] os.environ["OPENAI_API_KEY"]
KeyError
, then restart the terminal upon completion (and your IDE if you are using one).
- clone this repo to your local machine
git clone https://github.com/madeyexz/markdown-file-query.git
- Install the dependencies
pip install pinecone langchain tqdm
- Prepare the markdown file(s) and put them in a
FOLDER
(or any name you like, but you have to change the code accordingly). Notice this should be in the same directory asmain.py
. - If this is your first time querying a certain document, run the
main.py
programpython3 main.py "PATH_OF_FOLDER" "QUESTION"
- The query results and the reference GPT used to generate the answer will be saved in
answer.txt
andcontents.txt
respectively. - If you want to query the same batch of documents again, then run the
query_only.py
to avoid re-embedding the documents.python3 query_only.py "QUESTION"
- I have a folder called
markdown_database
which contains a bunch of.md
files, I want to query this database with the question "Whats the strange situation"❯ python3 main.py "markdown_database" "what's the strange situation"
initiating pinecone index... digesting docs... uploading datas to pinecone... 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 60/65 [00:29<00:02, 1.87it/s] let's wait for 60 seconds to avoid RateLimitError... \(since im not a paid user\)) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00, 1.00s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65/65 [01:32<00:00, 1.42s/it] querying pinecone... querying gpt... writing results to answer.txt and contents.txt done! the answer to 'what's the strange situation' is: ' The Strange Situation is a standardized procedure devised by Mary Ainsworth in the 1970s to observe attachment security in children within the context of caregiver relationships. It applies to infants between the age of nine and 18 months and involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. The procedure is used to observe the quality of a young child’s attachment to his or her mother, and can also be applied to other attachment figures, such as God, through the use of Emotionally Focused Therapy (EFT) and religious beliefs, such as the saying “there are no atheists in foxholes”.'
- If I want to query the same database again, I can use
query_only.py
to avoid re-embedding the documents.❯ python3 query_only.py "Who is Mary Ainsworth?"
connecting to pinecone index... getting docs querying pinecone... querying gpt... done! the answer to 'Who is Mary Ainsworth?' is: ' Mary Ainsworth was a developmental psychologist who devised the Strange Situation in the 1970s to observe attachment security in children within the context of caregiver relationships. The Strange Situation involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. Ainsworth is also known for her observation that if you want to see the quality of a young child’s attachment to his or her mother, watch what the child does, not when Mother leaves, but when she returns. She is also known for her research on anxious babies and their inability to use their mothers as a secure base.'
-
If you use Pinecone, then whenever you want to query a new document (i.e. creating a new database), you should probably create a new Pinecone index (for you don't want answers from the old document), or delete the old index. This is because Pinecone does not support updating the index (yet).
To delete the old index:
python3 delete_pinecone_index.py NAME_OF_INDEX
Huge shout out to the open-source community for providing straight-forward examples and comprehensive tutorials!
- openai-cookbook: using vector database for embeddings search
- Build a Personal Search Engine Web App using Open AI Text Embeddings - Avra
- this project is heavily inspired by hwchase17/notion-qa
- Langchain, a Python library for manipulating LLMs elegently.