Wikipedia regularly releases clickstream datasets that capture aggregated page-to-page user visits to Wikipedia articles. These datasets are very large, and while standard statistical methods can be used to get traffic volume statistics and top visited articles, they leave out the insights contained in the interconnections between the Wikipedia articles.
In this project, we use network analysis to derive the insights from the connections in the data. We model the clickstream data as a graph/network, describe the resulting graph and its most influential nodes, conduct community detection and natural language processing analyses to identify any themes/topics within the clickstream data, and use network shell decomposition to investigate obscure browsing on Wikipedia.
The initial findings are described in this blogpost, along with the visualizations.
- An interactive viz of Wikipedia traffic breakdown by type
- A network/graph viz of Wikipedia articles interconnected by the clickstream traffic between them
- An interactive viz of the article communities graph
- An interactive viz of the article communities graph with highlighting by topic terms
- Jupyter notebook on NBViewer: data_quality_analysis.ipynb
- Key takeaway from this notebook: data cleaning steps and notes on further processing
- can run on local machine
- data: the English Wikipedia clickstream dataset for December 2018
- Jupyter notebook on NBViewer: English_Wikipedia_EDA.ipynb
- can run on local machine
- Visualization demo: Wikipedia clickstream traffic breakdown by type
- data: the English Wikipedia clickstream dataset for December 2018
- Jupyter notebook on NBViewer: English_Wikipedia_graph_modeling_AWS.ipynb
- run on AWS EC2 (the data may be too large to run on local)
- data: the English Wikipedia clickstream dataset for December 2018
- Jupyter notebook on NBViewer: English_Wikipedia_network_analysis_AWS.ipynb
- run on AWS EC2 (the data may be too large to run on local)
- data: the English Wikipedia clickstream dataset for December 2018
- Jupyter notebook on NBViewer: English_Wikipedia_NLP_AWS.ipynb
- run on AWS EC2 (the data may be too large to run on local)
- data: the English Wikipedia clickstream dataset for December 2018
- Jupyter notebook on NBViewer: English_Wikipedia_deepWiki.ipynb
- was run on AWS EC2, but after the neo4j data pull the rest of the code is fine to run on local