-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228
Conversation
import org.apache.spark.rdd.RDD | ||
|
||
/** Extracts a site link structure using Spark's GraphX utility. */ | ||
object ExtractGraphXSLS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does XSLS stand for?
Let's write this using the same command-line framework that @TitusAn created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used it as an abbreviation for SiteLinkStructure.
I will write this using same command line framework. Is there any documentation for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant code is in https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/CommandLineAppRunner.scala. Handlers can be added to extractors
, which maps strings denoting operations to functions that process RDDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to re-write this as ExtractGraph, replacing the existing instance.
*/ | ||
def pageHash(url: String): VertexId = { | ||
url.hashCode.toLong | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If hash codes collide, the vertex ids won't be unique, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true. Can you suggest any other way to uniquely assign long values to vertices as an identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash situation is going to be a problem, especially for very large graphs.
My research says the solution is some implementation of .zipWithIndex() or .zipWithUniqueId(). I think the latter can cause some problems if the partitions are not the same size, but it does not trigger a spark job. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex
I'm not going to try and fix this here however.
val vertices: RDD[(VertexId, VertexData)] = extractedLinks | ||
.flatMap(r => List(r._1, r._2)) | ||
.distinct | ||
.map(r => (pageHash(r), VertexData(r))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
danger here, see above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
val edges: RDD[Edge[EdgeData]] = extractedLinks | ||
.map(r => Edge(pageHash(r._1), pageHash(r._2), EdgeData(1))) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kill extra lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific coding formatter you use to format scala code that can be used in Eclipse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 space indenting, and keep lines around 80 chars long is the general principle.
|
||
|
||
val graph = Graph(vertices, edges).partitionBy(PartitionStrategy.RandomVertexCut).groupEdges((e1,e2) => EdgeData(e1.edgeCount+e2.edgeCount)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overly-long line, wrap.
} | ||
|
||
} | ||
else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix code formatting.
} | ||
|
||
def runPageRankAlgorithm(graph: Graph[VertexData, EdgeData], dynamic: Boolean = false, | ||
tolerance: Double = 0.005, numIter: Int = 20, resetProb: Double = 0.15): Graph[VertexDataPR, EdgeData] ={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
four space indent.
*/ | ||
object WriteGraphXML { | ||
|
||
//case class EdgeData(edgeCount: Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove dead code.
* @param graphmlPath output file | ||
* @return true on successful run. | ||
*/ | ||
def makeFile (graph: Graph[VertexDataPR, EdgeData], graphmlPath: String): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give method a better name. something like writeToFile
?
@@ -0,0 +1,83 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this file is part of this PR?
It's already checked in? https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/WriteGraphML.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is different because it creates GraphMl file from GraphX object not a simple RDD. Also, it has an extra field for pageRank. Other fields will be added as the code progresses.
Codecov Report
@@ Coverage Diff @@
## master #228 +/- ##
=======================================
- Coverage 58.68% 55.68% -3%
=======================================
Files 38 40 +2
Lines 743 783 +40
Branches 137 144 +7
=======================================
Hits 436 436
- Misses 266 303 +37
- Partials 41 44 +3
Continue to review full report at Codecov.
|
@hardiksahi can you pull in master to update this PR? Once that is done, I'd like to see what CodeCov reports again. |
The title of this pull-request should be a brief description of what the pull-request fixes/improves/changes. Ideally 50 characters or less.
This PR converts an RDD of WARC into GraphX object, performs PageRank (dynamic/ static) and creates a GraphML file.
GitHub issue(s):
If you are responding to an issue, please mention their numbers below.
What does this Pull Request do?
Creates a new file ExtractGraphXSLS.scala that is to be called from an external script. PageRank is implemented and added to GraphML file.
How should this be tested?
Run the following script:
TestScript.txt
Check the GraphML file created at the specified path in script.
Additional Notes:
Example:
Thanks in advance for your help with the Archives Unleashed Toolkit!