Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

Closed
wants to merge 22 commits into from

Conversation

hardiksahi
Copy link

The title of this pull-request should be a brief description of what the pull-request fixes/improves/changes. Ideally 50 characters or less.

This PR converts an RDD of WARC into GraphX object, performs PageRank (dynamic/ static) and creates a GraphML file.


GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

What does this Pull Request do?

Creates a new file ExtractGraphXSLS.scala that is to be called from an external script. PageRank is implemented and added to GraphML file.

How should this be tested?

Run the following script:
TestScript.txt
Check the GraphML file created at the specified path in script.

Additional Notes:

Example:

  • This change requires updating of the documentation.

Thanks in advance for your help with the Archives Unleashed Toolkit!

import org.apache.spark.rdd.RDD

/** Extracts a site link structure using Spark's GraphX utility. */
object ExtractGraphXSLS {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does XSLS stand for?
Let's write this using the same command-line framework that @TitusAn created?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it as an abbreviation for SiteLinkStructure.
I will write this using same command line framework. Is there any documentation for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant code is in https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/CommandLineAppRunner.scala. Handlers can be added to extractors, which maps strings denoting operations to functions that process RDDs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to re-write this as ExtractGraph, replacing the existing instance.

*/
def pageHash(url: String): VertexId = {
url.hashCode.toLong
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If hash codes collide, the vertex ids won't be unique, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true. Can you suggest any other way to uniquely assign long values to vertices as an identifier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash situation is going to be a problem, especially for very large graphs.

My research says the solution is some implementation of .zipWithIndex() or .zipWithUniqueId(). I think the latter can cause some problems if the partitions are not the same size, but it does not trigger a spark job. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

I'm not going to try and fix this here however.

val vertices: RDD[(VertexId, VertexData)] = extractedLinks
.flatMap(r => List(r._1, r._2))
.distinct
.map(r => (pageHash(r), VertexData(r)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

danger here, see above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

val edges: RDD[Edge[EdgeData]] = extractedLinks
.map(r => Edge(pageHash(r._1), pageHash(r._2), EdgeData(1)))


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill extra lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific coding formatter you use to format scala code that can be used in Eclipse?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 space indenting, and keep lines around 80 chars long is the general principle.



val graph = Graph(vertices, edges).partitionBy(PartitionStrategy.RandomVertexCut).groupEdges((e1,e2) => EdgeData(e1.edgeCount+e2.edgeCount))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overly-long line, wrap.

}

}
else{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix code formatting.

}

def runPageRankAlgorithm(graph: Graph[VertexData, EdgeData], dynamic: Boolean = false,
tolerance: Double = 0.005, numIter: Int = 20, resetProb: Double = 0.15): Graph[VertexDataPR, EdgeData] ={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

four space indent.

*/
object WriteGraphXML {

//case class EdgeData(edgeCount: Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove dead code.

* @param graphmlPath output file
* @return true on successful run.
*/
def makeFile (graph: Graph[VertexDataPR, EdgeData], graphmlPath: String): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give method a better name. something like writeToFile?

@@ -0,0 +1,83 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this file is part of this PR?
It's already checked in? https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/WriteGraphML.scala

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is different because it creates GraphMl file from GraphX object not a simple RDD. Also, it has an extra field for pageRank. Other fields will be added as the code progresses.

@codecov
Copy link

codecov bot commented May 17, 2018

Codecov Report

Merging #228 into master will decrease coverage by 2.99%.
The diff coverage is 0%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #228   +/-   ##
=======================================
- Coverage   58.68%   55.68%   -3%     
=======================================
  Files          38       40    +2     
  Lines         743      783   +40     
  Branches      137      144    +7     
=======================================
  Hits          436      436           
- Misses        266      303   +37     
- Partials       41       44    +3
Impacted Files Coverage Δ
...la/io/archivesunleashed/app/ExtractGraphXSLS.scala 0% <0%> (ø)
...scala/io/archivesunleashed/app/WriteGraphXML.scala 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bdc740...7077035. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented May 30, 2018

@hardiksahi can you pull in master to update this PR? Once that is done, I'd like to see what CodeCov reports again.

@ruebot
Copy link
Member

ruebot commented Jul 25, 2018

Closing this PR. @greebie will continue the work on the issue-203 branch.

@ruebot ruebot closed this Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants