A news classifier in the fight against disinformation by HUANG, Sheng & LI, Yik Wai. 🤝
- Group project for COMP3359 Artificial Intelligence Applications @ HKU 🏫
🏷️ "If you tell a lie big enough and keep repeating it, people will eventually come to believe it. The lie can be maintained only for such time as the State can shield the people from the political, economic and/or military consequences of the lie. It thus becomes vitally important for the State to use all of its powers to repress dissent, for the truth is the mortal enemy of the lie, and thus by extension, the truth is the greatest enemy of the State."
Joseph Goebbels, Reich Minister of Propaganda, Nazi Germany
What is our relationship with the truth (or, the reality)? 😕 That is a philosophical question. 🤓
Great minds struggle with this question.
Young Sheldon Cooper struggled with this problem and wanted to switch major to philosophy. But in the end, he returned to science when he realized physics theories could explain more patterns in nature.
As fake news spreads wildly today, we have been in a similar crisis. 😱 With so much information, how can you tell what is real and what is not? 😨 Especially, with the cost of reading a news article so low and the cost of verifying the facts so high, how can anyone make judgments on the authenticity of the news content?
Indeed we CAN find linguistic patterns in news articles. 😃 These patterns may well have correlations to the realness or fakeness of these articles. This may be related to the fact that a lot of the fake news originates from authoritative governments and malign forces that don't usually give their writers complete journalism training. But it's not that simple. 🙃 Remember that news media have biases. For example, in the United States, conservative media such as Fox News and liberal news media such as MSNBC and CNN have different styles when reporting news 🥶, but that doesn't automatically mean that one style is equal to fake news reporting. (Check out Media Bias Ratings) To avoid such biases when training our model, we recognize news articles that cannot be easily categorized as real or fake, such as pieces that are strong in opinions. (Check out the types we have here 👈)
We concede that this approach is still flawed, which will be discussed in Limitations. 😑 However, it can give us somewhat of a reference when we are judging the authenticity of an article, as we have given explicit descriptions on how we categorize the articles. 🙂 Still, users should keep in mind that we are not the arbitor of truth and that the model cannot replace the work of a professional fact checker - it cannot visit the places where events happened; it cannot interview people involved in the stories; it cannot know the intention of the publishers when they put out the story... 🙃
THERE IS NO ALGORITHM FOR TRUTH.
It's highly 🔝 recommended to run the app on a Unix-like system (GNU/Linux, macOS, ...).
git clone https://github.com/vicw0ng-hk/fake-real-news.git
Or, clone through SSH for better security. 🔐
git clone git@github.com:vicw0ng-hk/fake-real-news.git
Or, clone with GitHub CLI
gh repo clone vicw0ng-hk/fake-real-news
Due to the large size of our model, it is stored with Git LFS, and because of GitHub's bandwidth limit 🚧, please use this link 👈 to download app/model/model.pkl
and replace the file in the cloned directory.
This may be different depending on the virtualization technology you are using 🤷, but generally do
cd app/
pip3 install -r requirements.txt
python3 app.py
Check out the Methodology document.
Check out the Functionalities document.
-
One major limitation is from the categorization of our dataset. 1️⃣ The dataset we have is a single-label dataset. But this is not in accord with the reality. For example, many conspiracies are highly political, hence a lot of the articles with the
conspiracy
tag may also fit into thepolitical
tag. Hence, by this feature of the dataset, accuracy of training has not been very high for some of the test cases. And it is susceptible to overfitting if we train too much for higher accuracy, which is why we chose to present the predictions in the app by probabilities. (Check out Functionalities) -
Another limitation is our development time and resources. 2️⃣ We have a very large dataset (Check out Methodology). However, we cannot make full use of it because we have limited time and resourses allocated by GPU Farm is relatively restrictive compared to the size of our dataset. Hence, we used only a portion of the total data to train our model.
-
There is also the limitation of the capabilities of machines. 3️⃣ We can only use the content (plus its URL, Title and Authors) to decide the categorization of new articles. For some articles, humans could easily tell their nature and authenticity based on common sense and general knowledge. However, the model cannot think that way, so some of the easy-to-recognize evidence to a human is difficult to find for the model.
In addition to the restrictions of GNU Affero General Public License v3.0 of this repo, you also agree to the following terms and conditions:
YOUR USE OF THIS WEB APP CONSTITUTES YOUR AGREEMENT
TO BE BOUND BY THESE TERMS AND CONDITIONS OF USE.
1. The classification of the text you submit to this
web app is in no way legal recognition. The web app
and/or its authors bear no legal responsiblities for
its result. If you choose to publish the result, the
web app and/or its authors shall not bear any legal
consequences relating to this action.
2. You shall be liable for the legal reponsibilities
of the copyright of the text you submit to this web
app. You shall gain the right to copy the text before
you submit it to the web app.
3. This web app shall not be used by any political
organization and/or any entity, partially or entirely,
directly or indirectly, funded and/or controlled by a
political organization in any jurisdiction.
4. In case of any discrepency with any other licenses,
terms or conditions associated with this web app
and/or its repository, this agreement shall prevail.