Every day there are thousands of notable transactions over the globe; protests, market dips, terrorist attacks, etc.
The question is, Do global events lead to influence in edits of Wikipedia articles?
UNBIASED is a tool to serve moderators and researchers to leverage open data to understand and further research patterns in Wikipedia edits contribution.
Type | Source | Size | Update Frequency | Location |
---|---|---|---|---|
GDELT, Global Database of Events, Language, and Tone | 6+ TB | 15 minutes | Public S3 | |
Wikipedia Metadata | ~500 GB | Varies | Private S3 |
GDELT:
The GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.Historical and Current dump of English Wikipedia consisting metadata including edits, commits, messages, userids', timestamp of each edit on the wikipedia article.
Entity | Purpose | Type |
---|---|---|
AWS S3 | Raw Data Storage | - |
AWS EC2 | Spark Cluster, Decompressor |
Master - 1 x m5a.large Worker - 5 x m5a.large |
AWS EC2 | TimescaleDB | 1 x m5.xlarge |
AWS EC2 | Web App | 1 x t3.large |
AWS EC2 | Airflow Scheduler | 1 x m5.large |
- Splitting, keyword generation and binning.
- Fuzzy pattern matching.
- Data Modeling
- Query processing optimization.
- Database parameter optimization.
- PySpark tuning.
/
│
├── assets
│ ├── logo.png
│ ├── pipeline.png
│ ├── dataingestion
│ └── dataingestion
│
├── src
│ │
│ ├── dataingestion
│ │ ├── scraper.py
│ │ ├── scraperModules
│ │ │ ├── __init__.py
│ │ │ ├── linkGenerator.py
│ │ │ ├── fileWriter.py
│ │ ├── lists
│ │ │ ├── current_urls.txt
│ │ │ └── historic_urls.txt
│ │ └── runScrapper.sh
│ │
│ ├── decompressor
│ │ └── decompressor.sh
│ │
│ ├── processor
│ │ ├── dbWriter.py
│ │ ├── wikiScraper.py
│ │ ├── gdeltProc.py
│ │ ├── gdeltModules
│ │ │ ├── __init__.py
│ │ │ ├── eventsProcessor.py
│ │ │ ├── geographiesProcessor.py
│ │ │ ├── mentionsProcessor.py
│ │ │ └── typeCaster.py
│ │ ├── wikiModules
│ │ │ ├── __init__.py
│ │ │ ├── metaProcessor.py
│ │ │ └── tableProcessor.py
│ │ ├── gdelt_run.sh
│ │ └── wiki_run.sh
│ │
│ ├── frontend
│ │ ├── __init__.py
│ │ ├── application.py
│ │ ├── appModules
│ │ │ ├── __init__.py
│ │ │ ├── dbConnection.py
│ │ │ └── dataFetch.py
│ │ ├── requirements.txt
│ │ ├── queries
│ │ │ ├── articleQuery.sql
│ │ │ └── scoreQuery.sql
│ │ └── assets
│ │ ├── layout.css
│ │ ├── main.css
│ │ └── logo.png
│ │
│ └── airflow
│ └── dag.py
│
├── License.md
├── README.md
├── config.ini
└── .gitignore
-
Setup AWS Cluster
Follow instructions below, link by link to setup a cluster and spin up instances as mentioned above in Architectural Components
a. https://blog.insightdatascience.com/simply-install-spark-cluster-mode-341843a52b88
b. https://blog.insightdatascience.com/how-to-access-s3-data-from-spark-74e40e0b2231 -
Setup TimescaleDB
Follow instructions from official blog of TimescaleDB
https://blog.timescale.com/tutorials/tutorial-installing-timescaledb-on-aws-c8602b767a98/Follow this video to setup connection to cluster
https://www.youtube.com/watch?v=5dYeYIWaXjc&feature=youtu.beUse this website to optimize databse capabilities
https://pgtune.leopard.in.ua/#/ -
Setup frontend framework
Follow this guide from Digital Ocean:
a. https://www.digitalocean.com/community/tutorials/how-to-install-nginx-on-ubuntu-18-04
i. Be sure to do sudo ufw allow for SSH as well when you get to that step, or you will not be able to SSH into your instance!
ii.When my ufw status is listed as inactive, it fixed it to run sudo ufw enable
b. https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-gunicorn-and-nginx-on-ubuntu-18-04
i. The normal port for Dash is 8080, not 5000
ii. These instructions are applicable to the underlying flask app. To expose the underlying Flask app, put server = app.server at the top of your main Dash script. Now, substituteserver
forapp
in the instructions. Otherwise you will get errors saying the app is not callable.
iii.When you go to deploy the app, if you made a file/sim link for your domain in/etc/nginx/sites-available/
and in/etc/nginx/sites-enabled/
, this may now conflict with the new files you made. Get rid of the original files. -
Setup Airflow
Setup Airflow as per instructions from this Medium Blog:
https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660
- Scraping
sh src/dataingestion
sh dataingestion/scraper.sh
- Decompression
cd src/decompression
sh decompressor.sh
- Processor
cd src/processor
sh gdelt_run.sh
sh wiki_run.sh
- Dashboard
cd src/frontend
python application.py
- Unpigz
- Data Modeling
- Query optimization
- Database parameters
- Serializing
- Oversubscription
- Partitioning
- Spark-Submit
This project is licensed under the AGPL-3.0 License - see the LICENSE.md file for details