Skip to content

Wikipedia clickstream data exploration using network analysis

Notifications You must be signed in to change notification settings

33eyes/wiki-clickstream-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring Wikipedia clickstream data

Description

Wikipedia regularly releases clickstream datasets that capture aggregated page-to-page user visits to Wikipedia articles. These datasets are very large, and while standard statistical methods can be used to get traffic volume statistics and top visited articles, they leave out the insights contained in the interconnections between the Wikipedia articles.
In this project, we use network analysis to derive the insights from the connections in the data. We model the clickstream data as a graph/network, describe the resulting graph and its most influential nodes, conduct community detection and natural language processing analyses to identify any themes/topics within the clickstream data, and use network shell decomposition to investigate obscure browsing on Wikipedia.

Results

The initial findings are described in this blogpost, along with the visualizations.

Visualizations

for the December 2018 clickstream data

Analysis steps

1. Data quality analysis of available datasets

2. Exploratory data analysis of the English Wikipedia clickstream dataset for December 2018

3. Graph modeling and data import to neo4j

4. Network analysis of the English Wikipedia clickstream

5. Defining community topics with NLP

  • data: the English Wikipedia clickstream dataset for December 2018
  • Jupyter notebook on NBViewer: English_Wikipedia_NLP_AWS.ipynb
    • run on AWS EC2 (the data may be too large to run on local)

6. Exploring obscure browsing on Wikipedia

  • data: the English Wikipedia clickstream dataset for December 2018
  • Jupyter notebook on NBViewer: English_Wikipedia_deepWiki.ipynb
    • was run on AWS EC2, but after the neo4j data pull the rest of the code is fine to run on local