In this project, by using the Hadoop MapReduce framework to calculate each Wikipedia pages, produce an ordered list of page titles, along with their ranks. Moreover, this project prevents dead-ends and spider-traps suitiation by corporating the factor β into matrix by multiplying each of its elements by 0.2.
After iterated three times, I selected Page ID =6 as an example (black point). This shows the pagerank value of selected page relative to other pages.
- transition.txt has the following format:
from1: to11 to12 to13 ...
from2: to21 to22 to23 ...
...
where from1
is an integer labelling a page that has links from and where to1
is an interger labelling a page link to, separating by '\t'.
- pr.txt has the following format:
page_id prob
where page_id is ID of the page and prob is probaility of each page, separating by '\t'.
PageRank is a calculation evaluates the quality and quantity of links to a webpage to determine a relatvie score of that page's importantce and authority. This project proposes a generalization of the PageRank aglorithm based on both out-links and in-links by using a iterative algorithm as an alternative interpretation of the matrix based techniques Also the PageRank algorithm can specify a probability at any step that a person will continue clicking outgoing links.
-
A group of pages is a spider-trap if there are no links from within the group to outside the group. This leads some of columns will sum to 1 rather than 0.
-
Pages with no outlinks are dead-ends for the random surfer. If dead ends are allowed, the transition matrix of the Web is no longer stochastic, some of the columns will sum to 0 rather than 1.
To avoid those problems, modify the calculation of PageRank by multipiling teleporting parameter to transition matrix:
PR(N)=(1-β)*PR(N-1)*Transition Matrix + β*PR(N-1)
β is teleport parameter , usually range from 0.1 to 0.2
-
UnitMultiplicaiton.java: This class is the mapper class. It takes an input file which has graph data in form of node and its adjacency lists and generates files containing nodes, their page ranks and their adjacency lists. Finally emit a key value pair of (page-id,prob)
-
UnitSum.java: This class is the reducer class. It receives a key: page-id and a list of its values: probabilities. For each value in the list If it is a node, the initialize the node with this node. Else if it a pagerank value sum it up and set the pagerank by its convergence value into the list.
-
Driver.java
hadoop com.sun.tools.javac.Main *.java
jar cf pr.jar *.class
hadoop jar pr.jar Driver /transition /pagerank /output 3 #3: iterate three times
#args0: dir of transition.txt
#args1: dir of PageRank.txt
#args2: dir of unitMultiplication result
#args3: times of convergence(make sure the code run successfully when args3=1, then test args3=40)