🖋 Authors: Eshwaran Venkat, under the guidance of Jennifer Zhu
Supervised Learning meets Media Analysis: Simple Topic Classification to Explore Bias in News Coverage. Final project for UC Berkeley MIDS 266 (Natural Language Processing with Deep Learning). See Course Repository
Our study introduces an approach to analyze news content, centering on the development of a topic classifier using the extensive All The News v2
dataset. Our methodology progresses from baseline classifiers to more advanced models, culminating in a fine-tuned BERT classifier, adept at categorizing news articles into distinct topics such as 'Sports,' 'Finance', etc., based on textual features and news metadata.
This classifier is augmented with sentiment analysis and other indicators for a supplemental exploration into media bias, aiming to delineate its various manifestations. The core of our research lies in the robust topic classification, with media bias analysis providing additional insights. We’ve made the code, notebooks, models and newly generated (topic classified) dataset publicly available. The newly created dataset is listed as All The News v2.1 on Kaggle and a fine-tuned BERT classifier for the same is also made available online.
- Project Report: Download PDF
- Presentation: Download PDF
- GitHub: cricksmaidiene/snowplough
- Kaggle: Coming Soon
- Hugging Face: Coming Soon
AllTheNews is a popular dataset of news articles that has two versions. Version 1 & 2.
- Version 2.0 has 2.7 million articles from a number of sources.
- It is a published dataset that is readily downloadable.
- The date range of articles is from January 1, 2016 to April 2, 2020.
- The only metadata available is the article
title
,publication
,section
,author
,date
, andcontent
. We use a subset of these as labels for our classifiers.
NB Order Number | Notebook | Section | Description |
---|---|---|---|
01 | Ingest Dataset | Ingestion | Ingests the All The News v2 dataset into a Delta Lake table. |
02 | Exploratory Data Analysis | Analysis | Performs exploratory data analysis on the All The News v2 dataset for Summary Statistics. |
03 | Word Counts & Sentiments Processor | Engineering | Transformation layer that adds word count fields and sentiment score fields per article |
04 | Sentiment Analysis | Analysis | Looks at descriptive statistics on sentiment scores across articles, publications and authors to find signals for bias |
05 | News Section Analysis | Analysis | Explores newspaper sections for topic-level coalescing and assignment |
06 | Topic Processor | Engineering | Transformation layer that adds topic fields per article using a topic lexicon, and performs additional processing |
07 | Topic & Author Analysis | Analysis | Explores the newly labeled and created topics, and how they interact with author distributution and slants |
08 | Standard Classification Models | Machine Learning | Comprehensive set of non-neural network models for Topic & Optional Author classification - Random Forests, Logistic Regression, & Naive Bayes |
09 | Neural Network Classifiers | Machine Learning | Bi-Directional LSTM and CNN networks are trained for classification of news topics from news titles |
10 | BERT Simple Classifier | Machine Learning | A model that minimally fine-tunes a pre-trained BERT Model to classify news topics |
11 | BERT Complex Classifier | Machine Learning | A model that adds LSTM and CNN layers on top of a pre-trained BERT model to train the classifier |
12 | Bias Analysis | Analysis | Systematically performs a simple bias analysis on newly labeled topics and sentiments on the news data |