Long-Term News LLM RAG

Analyze long-term trends from weekly news publications.

Replicate

To replicate the environment and support Jupyter notebooks, follow these steps:

# Install pipenv
pip install pipenv

# Enter the virtual environment
pipenv shell

# Install required packages
pipenv install ipykernel notebook jupyterlab python-dotenv openai feedparser pandas pyarrow tqdm

Environment Setup

Create a .env file in the project root with your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

To run a notebook:

pipenv shell
pipenv run jupyter notebook

Scripts

The project includes several scripts for data extraction and processing:

1. RSS Feed Data Collection (`scripts/01_get_rss_data.py`)

Fetches initial RSS feed data
Stores raw feed data in JSON format for further processing
Preserves metadata like title, link, description, and language
Handles enclosures and optional fields gracefully

2. Content Data Extraction (`scripts/02_get_content_data_flattened.py`)

Processes RSS feed entries using gpt-4o-mini model
Implements retry mechanism (3 attempts with 5-second delays) for robust API calls
Tracks processing time for performance monitoring
Extracts two types of content:
1. Individual News:
  - Start and end dates
  - Ticker symbol
  - News count
  - Growth percentage
  - News text
2. Market News (1-day and 1-week summaries):
  - Model name
  - Time period
  - News count
  - Market summary text
Adds source link to each entry
Saves data in a flattened Parquet format with Brotli compression for optimal storage efficiency

To run the content extraction:

python scripts/02_get_content_data_flattened.py

Search Functionality

The project implements text search capabilities using minsearch, allowing efficient search across all data fields:

Searchable Fields

type: News entry type (individual/market)
start_date & end_date: Time period of the news
ticker: Company/stock ticker symbols
count: Number of news items
growth: Growth percentage
text: Main news content
model: Model name for market summaries

Search Features

Full-text search across all fields
Field boosting (prioritizes matches in important fields):
- text (3x boost)
- type and ticker (2x boost)
- growth and model (1.5x boost)
- other fields (1x boost)
Link-based filtering for source tracking

Example usage in notebooks:

# Basic search
results = search_news("technology growth")

# Search with link filtering
results = search_news("market analysis", link="specific_url")

# Custom field boosting
custom_boost = {
    "ticker": 3,
    "text": 2,
    "type": 1
}
results = search_news("AAPL earnings", boost_dict=custom_boost)

Data

Input Data

RSS feed with news (mostly weekly, some weeks are missing)—around 46 weeks or 1 year of data:

RSS Feed URL: https://pythoninvest.com/rss-feed-612566707351.xml
This represents the weekly financial news feed section of the website: https://pythoninvest.com/#weekly-fin-news-feed

Output Data

The processed data is saved in Parquet format with Brotli compression for efficient storage and fast read performance. The data structure is as follows:

Individual News Entries:

{
    "type": "individual",
    "start_date": "date",
    "end_date": "date",
    "ticker": "symbol",
    "count": number,
    "growth": percentage,
    "text": "news content",
    "link": "source_url"
}

Market News Entries:

{
    "type": "market_[period]",  # period can be "1day" or "1week"
    "end_date": "date",
    "start_date": "date",
    "ticker": "multiple_tickers",
    "count": number,
    "model": "model_name",
    "text": "market summary",
    "link": "source_url"
}

The data is saved to data/news_feed_flattened.parquet. The Brotli compression algorithm is used for its superior compression ratio while maintaining good decompression speed, making it ideal for this type of textual data.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Long-Term News LLM RAG

Replicate

Environment Setup

Scripts

1. RSS Feed Data Collection (`scripts/01_get_rss_data.py`)

2. Content Data Extraction (`scripts/02_get_content_data_flattened.py`)

Search Functionality

Searchable Fields

Search Features

Data

Input Data

Output Data

About

Releases

Packages

Languages

License

realmistic/long-term-news-llm-rag

Folders and files

Latest commit

History

Repository files navigation

Long-Term News LLM RAG

Replicate

Environment Setup

Scripts

1. RSS Feed Data Collection (scripts/01_get_rss_data.py)

2. Content Data Extraction (scripts/02_get_content_data_flattened.py)

Search Functionality

Searchable Fields

Search Features

Data

Input Data

Output Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. RSS Feed Data Collection (`scripts/01_get_rss_data.py`)

2. Content Data Extraction (`scripts/02_get_content_data_flattened.py`)

Packages