UNBIASED
Spatio Temporal Event Based Influence on Wikipedia Edits

`Presentation` `Demo`

Motivation🚀

Every day there are thousands of notable transactions over the globe; protests, market dips, terrorist attacks, etc.

The question is, Do global events lead to influence in edits of Wikipedia articles?

UNBIASED is a tool to serve moderators and researchers to leverage open data to understand and further research patterns in Wikipedia edits contribution.

Data🪣

Type	Source	Size	Update Frequency	Location
	GDELT, Global Database of Events, Language, and Tone	6+ TB	15 minutes	Public S3
	Wikipedia Metadata	~500 GB	Varies	Private S3

GDELT:

The GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

Wikipedia Metadata:

Historical and Current dump of English Wikipedia consisting metadata including edits, commits, messages, userids', timestamp of each edit on the wikipedia article.

Pipeline Architecture🔗

Architectural Components🗜️

Entity	Purpose	Type
AWS S3	Raw Data Storage	-
AWS EC2	Spark Cluster, Decompressor	Master - 1 x m5a.large Worker - 5 x m5a.large
AWS EC2	TimescaleDB	1 x m5.xlarge
AWS EC2	Web App	1 x t3.large
AWS EC2	Airflow Scheduler	1 x m5.large

Challenges🤕

Data

Splitting, keyword generation and binning.
Fuzzy pattern matching.
Data Modeling
Query processing optimization.

Architectural

Database parameter optimization.
PySpark tuning.

UI🖥

Directory Structure🗂️

/
│
├── assets
│     ├── logo.png
│     ├── pipeline.png
│     ├── dataingestion
│     └── dataingestion
│
├──  src
│     │ 
│     ├── dataingestion
│     │     ├── scraper.py
│     │     ├── scraperModules
│     │     │      ├── __init__.py 
│     │     │      ├── linkGenerator.py
│     │     │      ├── fileWriter.py
│     │     ├── lists
│     │     │      ├── current_urls.txt
│     │     │      └── historic_urls.txt
│     │     └── runScrapper.sh
│     │
│     ├── decompressor  
│     │     └── decompressor.sh
│     │
│     ├── processor
│     │     ├── dbWriter.py
│     │     ├── wikiScraper.py
│     │     ├── gdeltProc.py
│     │     ├── gdeltModules
│     │     │      ├── __init__.py
│     │     │      ├── eventsProcessor.py
│     │     │      ├── geographiesProcessor.py
│     │     │      ├── mentionsProcessor.py
│     │     │      └── typeCaster.py
│     │     ├── wikiModules
│     │     │      ├── __init__.py
│     │     │      ├── metaProcessor.py
│     │     │      └── tableProcessor.py
│     │     ├── gdelt_run.sh
│     │     └── wiki_run.sh
│     │
│     ├── frontend
│     │     ├── __init__.py
│     │     ├── application.py
│     │     ├── appModules
│     │     │      ├── __init__.py
│     │     │      ├── dbConnection.py
│     │     │      └── dataFetch.py
│     │     ├── requirements.txt
│     │     ├── queries
│     │     │      ├── articleQuery.sql
│     │     │      └── scoreQuery.sql
│     │     └── assets
│     │            ├── layout.css
│     │            ├── main.css
│     │            └── logo.png
│     │
│     └── airflow
│           └── dag.py
│
├── License.md
├── README.md
├── config.ini
└── .gitignore

Instructions📝

Setup

Setup AWS Cluster

Follow instructions below, link by link to setup a cluster and spin up instances as mentioned above in Architectural Components

a. https://blog.insightdatascience.com/simply-install-spark-cluster-mode-341843a52b88
b. https://blog.insightdatascience.com/how-to-access-s3-data-from-spark-74e40e0b2231
Setup TimescaleDB

Follow instructions from official blog of TimescaleDB
https://blog.timescale.com/tutorials/tutorial-installing-timescaledb-on-aws-c8602b767a98/

Follow this video to setup connection to cluster
https://www.youtube.com/watch?v=5dYeYIWaXjc&feature=youtu.be

Use this website to optimize databse capabilities
https://pgtune.leopard.in.ua/#/
Setup frontend framework

Follow this guide from Digital Ocean:

a. https://www.digitalocean.com/community/tutorials/how-to-install-nginx-on-ubuntu-18-04
i. Be sure to do sudo ufw allow for SSH as well when you get to that step, or you will not be able to SSH into your instance!
ii.When my ufw status is listed as inactive, it fixed it to run sudo ufw enable
b. https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-gunicorn-and-nginx-on-ubuntu-18-04
i. The normal port for Dash is 8080, not 5000
ii. These instructions are applicable to the underlying flask app. To expose the underlying Flask app, put server = app.server at the top of your main Dash script. Now, substitute server for app in the instructions. Otherwise you will get errors saying the app is not callable.
iii.When you go to deploy the app, if you made a file/sim link for your domain in /etc/nginx/sites-available/ and in /etc/nginx/sites-enabled/, this may now conflict with the new files you made. Get rid of the original files.
Setup Airflow
Setup Airflow as per instructions from this Medium Blog:
https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660

Code Execution

Scraping
sh src/dataingestion
sh dataingestion/scraper.sh
Decompression
cd src/decompression
sh decompressor.sh
Processor
cd src/processor
sh gdelt_run.sh
sh wiki_run.sh
Dashboard
cd src/frontend
python application.py

Optimizations⚙️

Unpigz
Data Modeling
Query optimization
Database parameters
Serializing
Oversubscription
Partitioning
Spark-Submit

License🔑

This project is licensed under the AGPL-3.0 License - see the LICENSE.md file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UNBIASED
Spatio Temporal Event Based Influence on Wikipedia Edits

`Presentation` `Demo`

Motivation🚀

Data🪣

Pipeline Architecture🔗

Architectural Components🗜️

Challenges🤕

Data

Architectural

UI🖥

Directory Structure🗂️

Instructions📝

Setup

Code Execution

Optimizations⚙️

License🔑

© All product names, logos, and brands are property of their respective owners.

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini

License

pratikwatwani/Unbiased

Folders and files

Latest commit

History

Repository files navigation

UNBIASEDSpatio Temporal Event Based Influence on Wikipedia Edits

Presentation Demo

Motivation🚀

Data🪣

Pipeline Architecture🔗

Architectural Components🗜️

Challenges🤕

Data

Architectural

UI🖥

Directory Structure🗂️

Instructions📝

Setup

Code Execution

Optimizations⚙️

License🔑

© All product names, logos, and brands are property of their respective owners.

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

UNBIASED
Spatio Temporal Event Based Influence on Wikipedia Edits

`Presentation` `Demo`