Analyze long-term trends from weekly news publications.
To replicate the environment and support Jupyter notebooks, follow these steps:
# Install pipenv
pip install pipenv
# Enter the virtual environment
pipenv shell
# Install required packages
pipenv install ipykernel notebook jupyterlab python-dotenv openai feedparser pandas pyarrow tqdm
- Create a
.env
file in the project root with your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
- To run a notebook:
pipenv shell
pipenv run jupyter notebook
The project includes several scripts for data extraction and processing:
- Fetches initial RSS feed data
- Stores raw feed data in JSON format for further processing
- Preserves metadata like title, link, description, and language
- Handles enclosures and optional fields gracefully
- Processes RSS feed entries using gpt-4o-mini model
- Implements retry mechanism (3 attempts with 5-second delays) for robust API calls
- Tracks processing time for performance monitoring
- Extracts two types of content:
- Individual News:
- Start and end dates
- Ticker symbol
- News count
- Growth percentage
- News text
- Market News (1-day and 1-week summaries):
- Model name
- Time period
- News count
- Market summary text
- Individual News:
- Adds source link to each entry
- Saves data in a flattened Parquet format with Brotli compression for optimal storage efficiency
To run the content extraction:
python scripts/02_get_content_data_flattened.py
The project implements text search capabilities using minsearch
, allowing efficient search across all data fields:
type
: News entry type (individual/market)start_date
&end_date
: Time period of the newsticker
: Company/stock ticker symbolscount
: Number of news itemsgrowth
: Growth percentagetext
: Main news contentmodel
: Model name for market summaries
- Full-text search across all fields
- Field boosting (prioritizes matches in important fields):
- text (3x boost)
- type and ticker (2x boost)
- growth and model (1.5x boost)
- other fields (1x boost)
- Link-based filtering for source tracking
Example usage in notebooks:
# Basic search
results = search_news("technology growth")
# Search with link filtering
results = search_news("market analysis", link="specific_url")
# Custom field boosting
custom_boost = {
"ticker": 3,
"text": 2,
"type": 1
}
results = search_news("AAPL earnings", boost_dict=custom_boost)
RSS feed with news (mostly weekly, some weeks are missing)—around 46 weeks or 1 year of data:
- RSS Feed URL: https://pythoninvest.com/rss-feed-612566707351.xml
- This represents the weekly financial news feed section of the website: https://pythoninvest.com/#weekly-fin-news-feed
The processed data is saved in Parquet format with Brotli compression for efficient storage and fast read performance. The data structure is as follows:
- Individual News Entries:
{
"type": "individual",
"start_date": "date",
"end_date": "date",
"ticker": "symbol",
"count": number,
"growth": percentage,
"text": "news content",
"link": "source_url"
}
- Market News Entries:
{
"type": "market_[period]", # period can be "1day" or "1week"
"end_date": "date",
"start_date": "date",
"ticker": "multiple_tickers",
"count": number,
"model": "model_name",
"text": "market summary",
"link": "source_url"
}
The data is saved to data/news_feed_flattened.parquet
. The Brotli compression algorithm is used for its superior compression ratio while maintaining good decompression speed, making it ideal for this type of textual data.