Exploring Wikipedia clickstream data

Description

Wikipedia regularly releases clickstream datasets that capture aggregated page-to-page user visits to Wikipedia articles. These datasets are very large, and while standard statistical methods can be used to get traffic volume statistics and top visited articles, they leave out the insights contained in the interconnections between the Wikipedia articles.
In this project, we use network analysis to derive the insights from the connections in the data. We model the clickstream data as a graph/network, describe the resulting graph and its most influential nodes, conduct community detection and natural language processing analyses to identify any themes/topics within the clickstream data, and use network shell decomposition to investigate obscure browsing on Wikipedia.

Results

The initial findings are described in this blogpost, along with the visualizations.

Visualizations

for the December 2018 clickstream data

Analysis steps

1. Data quality analysis of available datasets

Jupyter notebook on NBViewer: data_quality_analysis.ipynb
- Key takeaway from this notebook: data cleaning steps and notes on further processing
- can run on local machine

2. Exploratory data analysis of the English Wikipedia clickstream dataset for December 2018

data: the English Wikipedia clickstream dataset for December 2018
Jupyter notebook on NBViewer: English_Wikipedia_EDA.ipynb
- can run on local machine
Visualization demo: Wikipedia clickstream traffic breakdown by type

3. Graph modeling and data import to neo4j

data: the English Wikipedia clickstream dataset for December 2018
Jupyter notebook on NBViewer: English_Wikipedia_graph_modeling_AWS.ipynb
- run on AWS EC2 (the data may be too large to run on local)

4. Network analysis of the English Wikipedia clickstream

data: the English Wikipedia clickstream dataset for December 2018
Jupyter notebook on NBViewer: English_Wikipedia_network_analysis_AWS.ipynb
- run on AWS EC2 (the data may be too large to run on local)

5. Defining community topics with NLP

data: the English Wikipedia clickstream dataset for December 2018
Jupyter notebook on NBViewer: English_Wikipedia_NLP_AWS.ipynb
- run on AWS EC2 (the data may be too large to run on local)

6. Exploring obscure browsing on Wikipedia

data: the English Wikipedia clickstream dataset for December 2018
Jupyter notebook on NBViewer: English_Wikipedia_deepWiki.ipynb
- was run on AWS EC2, but after the neo4j data pull the rest of the code is fine to run on local

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
cypher_scripts		cypher_scripts
drafts		drafts
img		img
old_notebooks		old_notebooks
pickles		pickles
viz		viz
.gitignore		.gitignore
English_Wikipedia_EDA.ipynb		English_Wikipedia_EDA.ipynb
English_Wikipedia_NLP_AWS.ipynb		English_Wikipedia_NLP_AWS.ipynb
English_Wikipedia_deepWiki.ipynb		English_Wikipedia_deepWiki.ipynb
English_Wikipedia_graph_modeling_AWS.ipynb		English_Wikipedia_graph_modeling_AWS.ipynb
English_Wikipedia_network_analysis_AWS.ipynb		English_Wikipedia_network_analysis_AWS.ipynb
README.md		README.md
custom_utils.py		custom_utils.py
data_quality_analysis.ipynb		data_quality_analysis.ipynb
english_wiki_25_rels_graph.png		english_wiki_25_rels_graph.png
get_n4j_pass.sh		get_n4j_pass.sh
graph_db_schema.png		graph_db_schema.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Wikipedia clickstream data

Description

Results

Visualizations

for the December 2018 clickstream data

Analysis steps

1. Data quality analysis of available datasets

2. Exploratory data analysis of the English Wikipedia clickstream dataset for December 2018

3. Graph modeling and data import to neo4j

4. Network analysis of the English Wikipedia clickstream

5. Defining community topics with NLP

6. Exploring obscure browsing on Wikipedia

About

Releases

Packages

Languages

33eyes/wiki-clickstream-graph

Folders and files

Latest commit

History

Repository files navigation

Exploring Wikipedia clickstream data

Description

Results

Visualizations

for the December 2018 clickstream data

Analysis steps

1. Data quality analysis of available datasets

2. Exploratory data analysis of the English Wikipedia clickstream dataset for December 2018

3. Graph modeling and data import to neo4j

4. Network analysis of the English Wikipedia clickstream

5. Defining community topics with NLP

6. Exploring obscure browsing on Wikipedia

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages