This provides investment insights for angel investors. It extracts key information from startup websites about their offerings and founders, and utilises web scraping and natural language processing to analyse content.
- Install Required Packages
sudo apt update
sudo apt install python3 python3-pip python3-venv
- Install pipenv
pip3 install pipenv
- Clone the Repository
git clone https://github.com/ibnbayo/startup-info
cd <project_folder>
- Install Dependencies with pipenv
pipenv install
- Activate Virtual Environment
pipenv shell
- Obtain an OpenAI API key and replace YOUR_OPENAI_API_KEY on line 10 of main.py with it.
api_key = "sk-..."
- Install requirements
pip install -r requirements.txt
- Run the script
python main.py
- Python 3.6+
- OpenAI API key
- Scrapes homepages and about/team pages to extract raw content
- Leverages OpenAI's GPT-3.5 Turbo model for natural language processing
- Structured output containing company offering, founders, and custom ratings/metrics
- Detailed logging and error handling for robustness
- Easily extensible to scrape additional pages and information
The script accepts a list of domain names to scrape. It will visit each domain's home page and about page (if found) to extract text content from all <p>
tags.
This extracted text is then sent to the LLM API with a prompt asking for the company's offerings and founder names.
LLM's response is printed for each domain.
The script runs the following steps:
- Accepts list of company domains
- Scrapes home page and about page (if found)
- Extracts text content from
<p>
tags - Sends extracted text to LLM API
- Prints LLM response for each domain
Domains are hardcoded in the script. Edit the domains
list to scrape different sites.
The OpenAI API key should be set in the script.
The output will contain the LLM's response for each domain with extracted company info in JSON format.
Any errors during scraping will be logged.
The prompt sent to LLM can be customized by editing the message_log
payload in send_message()
.
Scraping logic can be adapted by changing the scrape_domain()
function.
Additional parsing steps can be added to extract and structure data before sending to ChatGPT.
- Batch requests to OpenAI API to avoid getting rate-limited due to free tier limitations.
-
Scrape dynamic content using a headless browser or JavaScript engine to execute scripts on page and render HTML content before parsing with BeautifulSoup.
-
Extract data from other pages or similar pages with varying names.
- Incorporate data from Crunchbase to get more structured founder, funding, and category data
- Use Google search results to find additional pages and sources about the company and founders
- Access Alexa or SimilarWeb to get traffic and engagement metrics for each site
- Use more advanced NLP techniques like named entity recognition to identify founders, products, etc.
- Build a categorization model to classify companies into sectors like health, edtech, etc.
- Use sentiment analysis to gauge positive/negative language on site as a proxy for company reception
- Expand criteria analysis to include more categories and finer granularity
- Include scoring system that ranks companies across multiple categories and metrics
- Visualize company ratings and extracted information in a dashboard
- Write unit tests for individual modules
- Implement integration testing framework across full data pipeline
- Build repeatable processes to catch site changes and maintain scraper
- Combine internet data with information from parsed pitch decks in relational database for structured scouting
- Create visual representations (graphs, charts) based on data obtained, aiding investors in quick comprehension