Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

hardiksahi · 2018-05-17T16:06:08Z

The title of this pull-request should be a brief description of what the pull-request fixes/improves/changes. Ideally 50 characters or less.

This PR converts an RDD of WARC into GraphX object, performs PageRank (dynamic/ static) and creates a GraphML file.

GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

Refactor ExtractGraph and assess value of GraphX for producing network graphs #203

What does this Pull Request do?

Creates a new file ExtractGraphXSLS.scala that is to be called from an external script. PageRank is implemented and added to GraphML file.

How should this be tested?

Run the following script:
TestScript.txt
Check the GraphML file created at the specified path in script.

Additional Notes:

Example:

This change requires updating of the documentation.

Thanks in advance for your help with the Archives Unleashed Toolkit!

lintool · 2018-05-17T16:08:23Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+import org.apache.spark.rdd.RDD
+
+/** Extracts a site link structure using Spark's GraphX utility. */
+object ExtractGraphXSLS {


What does XSLS stand for?
Let's write this using the same command-line framework that @TitusAn created?

I used it as an abbreviation for SiteLinkStructure.
I will write this using same command line framework. Is there any documentation for it?

The relevant code is in https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/CommandLineAppRunner.scala. Handlers can be added to extractors, which maps strings denoting operations to functions that process RDDs.

Going to re-write this as ExtractGraph, replacing the existing instance.

lintool · 2018-05-17T16:08:56Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+   */
+  def pageHash(url: String): VertexId = {
+    url.hashCode.toLong
+  }


If hash codes collide, the vertex ids won't be unique, right?

Yes, that is true. Can you suggest any other way to uniquely assign long values to vertices as an identifier?

The hash situation is going to be a problem, especially for very large graphs.

My research says the solution is some implementation of .zipWithIndex() or .zipWithUniqueId(). I think the latter can cause some problems if the partitions are not the same size, but it does not trigger a spark job. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

I'm not going to try and fix this here however.

lintool · 2018-05-17T16:09:40Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+    val vertices: RDD[(VertexId, VertexData)] = extractedLinks
+      .flatMap(r => List(r._1, r._2))
+      .distinct
+      .map(r => (pageHash(r), VertexData(r)))


danger here, see above.

Same as above

lintool · 2018-05-17T16:09:58Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+    val edges: RDD[Edge[EdgeData]] = extractedLinks
+      .map(r => Edge(pageHash(r._1), pageHash(r._2), EdgeData(1)))
+
+


kill extra lines.

Is there any specific coding formatter you use to format scala code that can be used in Eclipse?

2 space indenting, and keep lines around 80 chars long is the general principle.

lintool · 2018-05-17T16:10:11Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+
+
+    val graph = Graph(vertices, edges).partitionBy(PartitionStrategy.RandomVertexCut).groupEdges((e1,e2) => EdgeData(e1.edgeCount+e2.edgeCount))
+


overly-long line, wrap.

lintool · 2018-05-17T16:10:27Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+      }
+
+    }
+    else{


fix code formatting.

lintool · 2018-05-17T16:10:33Z

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

+  }
+
+  def runPageRankAlgorithm(graph: Graph[VertexData, EdgeData], dynamic: Boolean = false,
+            tolerance: Double = 0.005, numIter: Int = 20, resetProb: Double = 0.15): Graph[VertexDataPR, EdgeData] ={


four space indent.

lintool · 2018-05-17T16:10:48Z

src/main/scala/io/archivesunleashed/app/WriteGraphXML.scala

+  */
+object WriteGraphXML {
+
+  //case class EdgeData(edgeCount: Int)


remove dead code.

lintool · 2018-05-17T16:11:23Z

src/main/scala/io/archivesunleashed/app/WriteGraphXML.scala

+   * @param graphmlPath output file
+   * @return true on successful run.
+   */
+  def makeFile (graph: Graph[VertexDataPR, EdgeData], graphmlPath: String): Boolean = {


give method a better name. something like writeToFile?

lintool · 2018-05-17T16:12:43Z

src/main/scala/io/archivesunleashed/app/WriteGraphXML.scala

@@ -0,0 +1,83 @@
+/*


I don't understand how this file is part of this PR?
It's already checked in? https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/app/WriteGraphML.scala

This file is different because it creates GraphMl file from GraphX object not a simple RDD. Also, it has an extra field for pageRank. Other fields will be added as the code progresses.

codecov · 2018-05-17T16:20:33Z

Codecov Report

Merging #228 into master will decrease coverage by 2.99%.
The diff coverage is 0%.

@@           Coverage Diff           @@
##           master     #228   +/-   ##
=======================================
- Coverage   58.68%   55.68%   -3%     
=======================================
  Files          38       40    +2     
  Lines         743      783   +40     
  Branches      137      144    +7     
=======================================
  Hits          436      436           
- Misses        266      303   +37     
- Partials       41       44    +3

Impacted Files	Coverage Δ
...la/io/archivesunleashed/app/ExtractGraphXSLS.scala	`0% <0%> (ø)`
...scala/io/archivesunleashed/app/WriteGraphXML.scala	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bdc740...7077035. Read the comment docs.

ruebot · 2018-05-30T13:31:51Z

@hardiksahi can you pull in master to update this PR? Once that is done, I'd like to see what CodeCov reports again.

ruebot · 2018-07-25T16:27:10Z

Closing this PR. @greebie will continue the work on the issue-203 branch.

hardiksahi added 22 commits May 4, 2018 02:34

pom.xml change for GraphX

62dfb3f

pom.xml change for GraphX

2778bf5

Changes for GraphXSLS

13e6723

Changes for GraphXSLS

5f5a4b0

Changes for SLS graph

8adb2b3

Changes for SLS graph

54c9133

Change

d64fe13

Changes

3f63b3c

Changes

e64e298

Changes

e22f01e

Changes

37e9aa7

Changes

2b81550

Changes

ff7dd7d

Changes

afba7b6

Changes for GraphX

41f6ef8

Changes

1ddd484

Changes

12c3ded

Changes for GraphX

eeb18c2

Changes

e5c9be7

Changes

6fdadc5

Changes

ae434c3

Changes for converting WARC RDD to GraphX object

7077035

lintool reviewed May 17, 2018

View reviewed changes

src/main/scala/io/archivesunleashed/app/ExtractGraphXSLS.scala

}

}

else{

Copy link

Member

lintool May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix code formatting.

lintool reviewed May 17, 2018

View reviewed changes

greebie mentioned this pull request Jul 20, 2018

Replace hashing of unique ids with .zipWithUniqueId() #243

Closed

ruebot closed this Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

hardiksahi commented May 17, 2018

lintool May 17, 2018

hardiksahi May 17, 2018

TitusAn May 17, 2018

greebie Jul 20, 2018

lintool May 17, 2018

hardiksahi May 17, 2018

greebie Jul 20, 2018

lintool May 17, 2018

hardiksahi May 17, 2018

lintool May 17, 2018

hardiksahi May 17, 2018

ruebot May 21, 2018

lintool May 17, 2018

lintool May 17, 2018

lintool May 17, 2018

lintool May 17, 2018

lintool May 17, 2018

lintool May 17, 2018

hardiksahi May 17, 2018

codecov bot commented May 17, 2018 •

edited

Loading

ruebot commented May 30, 2018

ruebot commented Jul 25, 2018

		val edges: RDD[Edge[EdgeData]] = extractedLinks
		.map(r => Edge(pageHash(r._1), pageHash(r._2), EdgeData(1)))



		val graph = Graph(vertices, edges).partitionBy(PartitionStrategy.RandomVertexCut).groupEdges((e1,e2) => EdgeData(e1.edgeCount+e2.edgeCount))

+                    }
+                  }
+                  else{

Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

Conversation

hardiksahi commented May 17, 2018

What does this Pull Request do?

How should this be tested?

Additional Notes:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 17, 2018 • edited Loading

Codecov Report

ruebot commented May 30, 2018

ruebot commented Jul 25, 2018

codecov bot commented May 17, 2018 •

edited

Loading