Always the same boring english datasets.
Out of curiosity and as an avid reader of Le Monde, here a Dataset
collected from my fav newspaper : 1 year coverage of the Ukraine Invasion (Feb 24 2022 -> 2023) as well as the tools used to build it.
You might want to check the subsequent analysis I made out of this data on the sibling repo or access this rendered version
Important : the data is collected and shared by me for educational & research purposes only ; premium articles (suscriber only) have been truncated to first 2500 characters.
Download /dataset (Compressed Parquet, 40mb)
236 k comments and associated articles (2 k unique), title, content (truncated if premium), desc & date
- Articles truly about Ukraine War, not a simple mention, using a prior filter on articles tags.
- Lives and Blog type articles not collected; all other types are (Edito etc.)
- Articles authors (journalists) not collected, purposely.
- No distinction between comments and replies-to-comment.
- No timestamp, only associated article (last) publication date.
Custom API (lmd_ukr/api.py
)
- To be seen as a good, but not top (i.e. scalable etc.) one-shot-project" API, shared "as is"
- Le Monde does not offer a public API --as the New York Times ;)
- Personal credentials (suscriber) are required, because comments are suscribers-only
- Built using httpx
for requests & selectolax
for parsing
- API & use examples with caching are available in lmd_ukr/examples
; added some documentation in-code (rate limits etc.)
- Checkout lmd_ukr/build_sqlite_dataset.py
and build_parquet_dataset.ipynb
- Parsed data populated into an sqlite db with two tables articles and comments with shared key article_id
- This was optional, but wanted to refresh my skills and it allows to remove duplicates when building our db
- Formating / cleaning using Polars
, wanted to benchmark v. Pandas
(cf. notebook)
- Final file is a joined articles-comments (tidy) parquet file.
I created and shared this dataset for educational purposes only. Just wanted to have a French dataset, --if possible from my favorite newspaper on a topic I'm following daily; instead of exploring the same boring english datasets we're used to. It could be used for various natural language processing tasks.
- Topic modeling
- Troll detection (not enough fields though in my opinion)
- Generate summaries or headlines for articles (and compare to "desc" for instance)
- trending & various generative tasks of your choice
- (...)