Skip to content

Fake News analysis and prediction in R Script. Naive Bayes, Random Forest, SVM, NNET, ROC, Confusion Matrix, Accuracy, F1 score.

Notifications You must be signed in to change notification settings

trajceskijovan/Fake-News

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Fake-News

Fake News analysis in R Script: EDA & Prediction (Naive Bayes, Random Forest, SVM, NNET).

Code:

https://github.com/trajceskijovan/Fake-News/blob/main/Fake%20News.R

Presentation:

https://github.com/trajceskijovan/Fake-News/blob/main/Presentation.pdf

EDA:

  1. Top Words by "Title" (only words over 5 characters allowed):

  1. Merge fake and true news and create a plot over time:

  • Fake news are more frequent in 2016, 2017 and first half of 2018.
  • In Q4 2018 fake and true news are balanced
  1. Are datasets balanced?

The merged dataset is Balanced - this will make it easier for prediction.

  1. There are 631 missing values in the “Text” column. They are removed.

  1. News count by each Subject, and, Subject by category plots:

Pre-processing & data cleanup steps are outlined below:

Create a corpus (type of object expected by "tm" library)

Text to lower case

Remove numbers

Remove Punctuations

Remove Stopwords

Remove specific words (example: we should remove the name of the newspaper –“Reuters”-its on every news)

Remove Whitespace

I have decided not to do “stem words” as it tweaks and cuts the words and they might lose their meaning

Remove other punctuation issues (example: "[[:punct:]]" )

Lemmatization

Create Document Term Matrix with control list (example: “wordLengths=c(5, 20)”)

I enforced lower and upper limit to the length of the words included (between 5 and 20 characters) to speed up data processing and eliminate noise

After that, I removed all terms whose sparsity is greater than the threshold. Sparsity dropped from 100% to 77% and term length dropped from 20 to 14 (example: “sparse = 0.85”)

Convert DTM to matrix to DataFrame

Post-processing analysis:

Return all terms that occur more than 20,000 times in the entire corpus:

findFreqTerms(dtm.clean,lowfreq=20000)

[1] "american" "campaign" "clinton" "country" "donald" "election" "government" "house"
[9] "include" "obama" "official" "party" "people" "president" "report" "republican" [17] "right" "state" "trump" "unite" "white"

Correlation limit inspection and associations among: “Trump”,” Obama”, “Russia”, “State”:

WordClouds:

Modeling, Prediction, Performance:

  1. Naive Bayes Model

  2. Logistic Regression Model

  3. Random Forest Model

  1. SVM

  2. NNET

  1. Evaluation: ROC

  1. Evaluation: Confusion Matrix

  1. Evaluation: Summary Table