This project provides powerful tools designed to facilitate efficient document searching and querying from websites and PDF files. Leveraging advanced natural language processing techniques, these tools can extract relevant information, summarize content, and provide accurate answers to user queries. The project includes:
- website_search_tools.py: Tool for extracting and querying information from websites.
- pdf_search_tool.py: Tool for extracting and querying information from PDF documents.
- law_tasks.py: Defines tasks for website and PDF searches.
- law_agents.py: Defines agents specialized in legal document searches.
- law_crew.py: Integrates agents and tasks to handle user queries through a streamlined process.
File: website_search_tools.py
The Website Search Tool scrapes website content, stores it, and makes it searchable through natural language queries. Key features include:
- Sitemap Parsing: Fetches and parses the sitemap XML of a website to extract URLs.
- Document Loading: Loads web documents from the extracted URLs.
- Data Storage: Stores website data in a CSV file for persistent storage and future use.
- Document Processing: Converts loaded documents into a format suitable for embedding and searching.
- Text Splitting: Splits documents into smaller chunks for efficient processing.
- Embeddings and Vector Search: Utilizes OpenAI embeddings and FAISS vector search to find relevant documents based on user queries.
- Question Answering: Implements a question-answering chain to provide precise answers from the relevant documents.
File: pdf_search_tool.py
The PDF Search Tool searches and queries information from PDF documents. Key features include:
- PDF Reading: Reads and extracts text from PDF files in specified directories.
- Text Extraction: Extracts text from each page of the PDFs.
- Text Splitting: Splits extracted text into manageable chunks for efficient processing.
- Embeddings and Vector Search: Uses OpenAI embeddings and FAISS vector search to find relevant documents based on user queries.
- Question Answering: Implements a question-answering chain to provide precise answers from the relevant documents.
File: law_tasks.py
This file defines various tasks related to legal document searches. Key features include:
- Website Search Task: Creates tasks for searching information on websites.
- Web Search Task: Creates tasks for conducting general web searches.
- PDF Search Task: Creates tasks for searching information within PDF documents.
File: law_agents.py
This file defines agents specialized in legal document searches. Key features include:
- PDF Searcher: An agent specialized in scrutinizing legal documents in PDF format.
- Website Searcher: An agent specialized in scouring legal websites and online databases for relevant information.
- Web Searcher: An agent specialized in conducting web searches to retrieve legal information.
File: law_crew.py
This file integrates agents and tasks to handle user queries through a streamlined process. Key features include:
- Initialization: Sets up agents and tasks based on user queries.
- Query Processing: Processes user queries using the integrated agents and tasks.
Clone the Repository:
git clone https://github.com/your-repo/document-search-tools.git
cd document-search-tools
Set Up API Keys: Ensure you have your OpenAI API key set up in your environment:
export OPENAI_API_KEY='your-openai-api-key'
- Website URLs: Update the
url
variable inwebsite_search_tools.py
with the sitemap URL of the website you want to scrape. - PDF Directory: Update the directory path in
pdf_search_tool.py
to point to the location of your PDF files.
- Running the Script:: The script will leverage the CrewAI framework to process the idea and generate a landing page.
streamlit run app.py
Sansita(https://github.com/sansistrying)
Palak (https://github.com/palak180)
These tools are designed to streamline the process of extracting, storing, and querying information from websites and PDF documents. By leveraging advanced NLP techniques, they ensure that users can efficiently find and retrieve relevant information with ease. This project aims to significantly enhance productivity and decision-making processes within the organization.
This project is released under the MIT License.