Due to the strong development of the internet in recent years, news all over the world are transmitted faster and faster through news websites, social media, blogs, etc. However, this also make it easier for Fake news to spread, which, sometimes, can lead to the outbreak of mass hysteria and negative impact on society.
In this paper, we have we researched and built a machine learning model that can distinguish fake news from real news. Firstly, we introduce two datasets of English and Vietnamese articles. Secondly, we use different algorithms to train our model to predict on each dataset. Finally, we compare the results and draw conclusions. Specifically, we discovered that the English test using TF-IDF Vectorization with Passive-Aggressive Classifier produced the highest accuracy of 96.12%
.
Read our report here
This experiment is split into English and Vietnamese.
For the English language, we use the Kaggle dataset and other datasets from large news websites. The Kaggle dataset from 2017 contains 45000 newspapers evenly divided between real and fake, the real news come from Reuters website whereas fake news are from websites deemed unreliable by Wikipedia. Other datasets come from different sources such as CNN, BBC, FOX, etc. But similar in date and distribution of real and fake. Together with the Kaggle dataset, this collection of data contains nearly 90000 newspapers.
For the Vietnamese test we use the Vietnamese Fake News Dataset (VFND). This contains over 200 news, collected in the period from 2017 to 2019 and uses several, cross-referencing sources, classified by the community.
All data are included in this repo
For inference, run main.py
All other files are for training. To retrain the models, uncomment the following lines:
- In main.py
model_viet = load('models/model_viet.joblib')
model=load('models/model.joblib')
tfidf_vectorizer_viet = load('models/tfidf_viet.joblib')
tfidf_vectorizer = load('models/tfidf.joblib')
# lưu vectorizer
dump(tfidf_vectorizer,'models/tfidf.joblib')
dump(count_vectorizer, 'models/count.joblib')
# lưu model
dump(model,'models/model.joblib')
dump(model,'models/modelc.joblib')
#lưu vectorizer
dump(tfidf_vectorizer,'tfidf_viet.joblib')
#lưu model
dump(model, 'models/model_viet.joblib')
w2v_model = Word2Vec(sentences=x, vector_size=1, window=5, min_count=1)
#save model
w2v_model.save('models/w2v.model')
#save model
dump(model,'models/model_w2v.joblib')
english_test.py: use CountVectorizer and TFIDFVectorizer. Train with PasiveAgressiveClassifier
vietnamese_test.py: similar to english_test.py, but use Underthesea tokenizer(for Vietnamese) before vectorization
english_Word2vec_PAC.py: similar to english_test.py, but use Word2Vec