📰 Ukrainian News Classification Experiments

Welcome to the repository containing Jupyter notebooks and findings from our research paper on Ukrainian news classification. Dive in to discover our methodologies, key findings, and comparisons of various pretrained models for the Ukrainian language. This is the set of experiments conducted by Stepan Tytarenko, whose solution used XLM-R and has won the in-class competition.

🎯 Abstract

In the vast expanse of natural language processing, languages like Ukrainian face a pressing issue: the lack of datasets. This paper unveils a pioneering approach to dataset creation with minimal overhead, setting the stage for Ukrainian news classification.

📌 Key Findings

ukr-RoBERTa, ukr-ELECTRA, and XLM-R are the crème de la crème of models.
XLM-R is the go-to for longer texts.
ukr-RoBERTa is a beacon for shorter sequences.
NB-SVM baseline? A dark horse with commendable performance on a large dataset!

🔬 Experiments

Our experiments spread across:

Small training set with titles only 📃
Small training set, full text immersion 📜
Large training set, titles at the forefront 📃📃
Large training set, full text deluge 📜📜

Models were put to the test, each having a time window of 24 hours on a single P100 GPU.

📊 Benchmark Results

Model	Short texts / small training set	Long texts / small training set	Short texts / large training set	Long texts / large training set
NB-SVM baseline	0.533	0.900	0.708	0.910
mBERT	0.790	0.910	0.675	0.907
Slavic BERT	0.636	0.907	0.620	0.940
ukr-RoBERTa	0.853	0.948	0.903	0.950
ukr-ELECTRA	0.685	0.950	0.745	0.948
XLM-R	0.840	0.915	0.909	0.915

🏆 Note: XLM-R takes the gold with an F1 score of 0.95 on the large full-text training set.

🧐 Observations

mBERT & Slavic BERT: Not the stars of the show when it comes to F1-scores.
ukr-RoBERTa: Climbs the ranks, especially on short-text terrain.
ukr-ELECTRA: Balances the act between different text lengths.
XLM-R: Reigns supreme with long texts, but faces hurdles with short ones.

📁 Dataset

Want the dataset? Fetch it on Kaggle.

📝 Citation

If our work aids your research, show some love with a citation:

D. Panchenko et al. (2021). Ukrainian News Corpus As Text Classification Benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
effnet-extract_images_features.ipynb		effnet-extract_images_features.ipynb
lgb-effnet-xlm-roberta.ipynb		lgb-effnet-xlm-roberta.ipynb
ukr-bert-base.ipynb		ukr-bert-base.ipynb
ukr-electra-base.ipynb		ukr-electra-base.ipynb
xlm-roberta-large.ipynb		xlm-roberta-large.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📰 Ukrainian News Classification Experiments

🎯 Abstract

📌 Key Findings

🔬 Experiments

📊 Benchmark Results

🧐 Observations

📁 Dataset

📝 Citation

About

Releases

Packages

Languages

StepanTita/news-contest

Folders and files

Latest commit

History

Repository files navigation

📰 Ukrainian News Classification Experiments

🎯 Abstract

📌 Key Findings

🔬 Experiments

📊 Benchmark Results

🧐 Observations

📁 Dataset

📝 Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages