Skip to content

Latest commit

 

History

History
62 lines (47 loc) · 5.96 KB

index.md

File metadata and controls

62 lines (47 loc) · 5.96 KB

Snowplough 🏂

🖋 Authors: Eshwaran Venkat, under the guidance of Jennifer Zhu

Supervised Learning meets Media Analysis: Simple Topic Classification to Explore Bias in News Coverage. Final project for UC Berkeley MIDS 266 (Natural Language Processing with Deep Learning). See Course Repository

About 📰

Our study introduces an approach to analyze news content, centering on the development of a topic classifier using the extensive All The News v2 dataset. Our methodology progresses from baseline classifiers to more advanced models, culminating in a fine-tuned BERT classifier, adept at categorizing news articles into distinct topics such as 'Sports,' 'Finance', etc., based on textual features and news metadata.

This classifier is augmented with sentiment analysis and other indicators for a supplemental exploration into media bias, aiming to delineate its various manifestations. The core of our research lies in the robust topic classification, with media bias analysis providing additional insights. We’ve made the code, notebooks, models and newly generated (topic classified) dataset publicly available. The newly created dataset is listed as All The News v2.1 on Kaggle and a fine-tuned BERT classifier for the same is also made available online.


Data 📇

AllTheNews

AllTheNews is a popular dataset of news articles that has two versions. Version 1 & 2.

  • Version 2.0 has 2.7 million articles from a number of sources.
  • It is a published dataset that is readily downloadable.
  • The date range of articles is from January 1, 2016 to April 2, 2020.
  • The only metadata available is the article title, publication, section, author, date, and content. We use a subset of these as labels for our classifiers.

Notebooks 📙

NB Order Number Notebook Section Description
01 Ingest Dataset Ingestion Ingests the All The News v2 dataset into a Delta Lake table.
02 Exploratory Data Analysis Analysis Performs exploratory data analysis on the All The News v2 dataset for Summary Statistics.
03 Word Counts & Sentiments Processor Engineering Transformation layer that adds word count fields and sentiment score fields per article
04 Sentiment Analysis Analysis Looks at descriptive statistics on sentiment scores across articles, publications and authors to find signals for bias
05 News Section Analysis Analysis Explores newspaper sections for topic-level coalescing and assignment
06 Topic Processor Engineering Transformation layer that adds topic fields per article using a topic lexicon, and performs additional processing
07 Topic & Author Analysis Analysis Explores the newly labeled and created topics, and how they interact with author distributution and slants
08 Standard Classification Models Machine Learning Comprehensive set of non-neural network models for Topic & Optional Author classification - Random Forests, Logistic Regression, & Naive Bayes
09 Neural Network Classifiers Machine Learning Bi-Directional LSTM and CNN networks are trained for classification of news topics from news titles
10 BERT Simple Classifier Machine Learning A model that minimally fine-tunes a pre-trained BERT Model to classify news topics
11 BERT Complex Classifier Machine Learning A model that adds LSTM and CNN layers on top of a pre-trained BERT model to train the classifier
12 Bias Analysis Analysis Systematically performs a simple bias analysis on newly labeled topics and sentiments on the news data