This project focuses on the classification of news articles into categories such as 'agreed', 'disagreed', and 'unrelated'. The aim is to identify and categorize fake news effectively. This repository contains the code for training and evaluating different machine learning and deep learning models on a dataset of news article pairs.
The dataset used in this project is a collection of news article pairs. Each pair of articles is labeled as 'agreed', 'disagreed', or 'unrelated', based on their content.
We employ several models for this task:
- Logistic Regression: A baseline model for classification.
- Random Forest: An ensemble learning method for classification.
- Neural Network with GRU: A deep learning approach using Gated Recurrent Units (GRU).
- Python 3.x
- Pandas
- NumPy
- Scikit-Learn
- NLTK
- TensorFlow
- Matplotlib
- Seaborn
To install the required packages, run the following command:
pip install pandas numpy scikit-learn nltk tensorflow matplotlib seaborn
-
Data Preprocessing:
- Load the dataset.
- Clean and preprocess the text data.
-
Feature Extraction:
- Convert text data into numerical form using TF-IDF Vectorization.
-
Model Training:
- Train the Logistic Regression and Random Forest models.
- Construct and train the Neural Network with GRU.
-
Evaluation:
- Evaluate the models on a test set.
- Generate classification reports.
-
Prediction on Unseen Data:
- Use the trained models to predict labels on new data.
- Output the results to a CSV file.
train.csv
andtest.csv
: The training and testing datasets.train_data
andtest_data
: Python scripts for training and testing the models.results.csv
: The output file with predictions on the test set.
The models are evaluated based on their accuracy and loss. The results are visualized using Matplotlib.