Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

Merged
merged 34 commits into from
Jul 29, 2018

Conversation

greebie
Copy link
Contributor

@greebie greebie commented Jul 26, 2018

GitHub issue(s):

#203

What does this Pull Request do?

Adds ExtractGraphX algorithm and GraphML output to go with GraphX output.
Adds feature to calculate pagerank and weak and strong component calculations.
Provides some lint fixes for other files (usually removing Magic Numbers)
Deprecates ExtractGraph as outdated (although there is some discussion that comes with that).

How should this be tested?

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.app._
import io.archivesunleashed.util._
import org.apache.spark.graphx._

val graph = ExtractGraphX.extractGraphX(RecordLoader.loadArchives("/Users/USERNAME/WARCFOLDER/", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")).subgraph(epred = eTriplet => eTriplet.attr.edgeCount>5)

val pRank = ExtractGraphX.runPageRankAlgorithm(graph)

WriteGraphXML(pRank, "graphML-path.graphml/")

Additional Notes:

The main feature is strong and weak connected components which can be used in Graphpass to reduce the size of a network graph if it is > 50k nodes.
strong connected components are also interesting as a potential measure of who is driving the overall conversation, or if factions exist in a community.

Interested parties

Tag (@ mention) interested parties.

Thanks in advance for your help with the Archives Unleashed Toolkit!

@greebie
Copy link
Contributor Author

greebie commented Jul 26, 2018

ExtractGraph does provide a rudimentary json output that is not included in ExtractGraphX. I think it makes sense to create a more generic CreateJSONGraph or even a more detailed output class that will do whatever format we want. Either way, I'm not sure if the json creator is used very much.

@codecov
Copy link

codecov bot commented Jul 26, 2018

Codecov Report

Merging #245 into master will increase coverage by 1.85%.
The diff coverage is 91.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #245      +/-   ##
==========================================
+ Coverage   68.71%   70.57%   +1.85%     
==========================================
  Files          39       41       +2     
  Lines         911      982      +71     
  Branches      168      179      +11     
==========================================
+ Hits          626      693      +67     
- Misses        231      232       +1     
- Partials       54       57       +3
Impacted Files Coverage Δ
...chivesunleashed/app/DomainFrequencyExtractor.scala 100% <ø> (ø) ⬆️
...c/main/scala/io/archivesunleashed/df/package.scala 90.47% <ø> (+3.51%) ⬆️
...o/archivesunleashed/app/DomainGraphExtractor.scala 100% <ø> (ø) ⬆️
.../scala/io/archivesunleashed/app/ExtractGraph.scala 0% <0%> (ø) ⬆️
src/main/scala/io/archivesunleashed/package.scala 84.11% <0%> (ø) ⬆️
...scala/io/archivesunleashed/app/WriteGraphXML.scala 100% <100%> (ø)
...ain/scala/io/archivesunleashed/ArchiveRecord.scala 83.33% <100%> (ø) ⬆️
.../scala/io/archivesunleashed/app/WriteGraphML.scala 100% <100%> (ø) ⬆️
...o/archivesunleashed/app/ExtractPopularImages.scala 100% <100%> (ø) ⬆️
...scala/io/archivesunleashed/app/ExtractGraphX.scala 92.1% <92.1%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 290b6aa...d1d1603. Read the comment docs.

@ianmilligan1
Copy link
Member

Thanks for this Ryan!

For the test script, what should we expect to see as a result?

@greebie
Copy link
Contributor Author

greebie commented Jul 26, 2018

It should produce a network graph at "graphML-path.graphml/" in graphml format with pagerank and other metadata.

@ianmilligan1
Copy link
Member

OK great, thanks @greebie - I will test it, probably tomorrow morning!

@ruebot
Copy link
Member

ruebot commented Jul 27, 2018

What the roadmap is for this functionality in auk post the next release of aut? Is it a drop-in replacement, or will the spark background job need significant updates? How does this chain with GraphPass?

@ianmilligan1
Copy link
Member

Successfully generated the file with:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.app._
import io.archivesunleashed.util._
import org.apache.spark.graphx._

val graph = ExtractGraphX.extractGraphX(RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/*200912*", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")).subgraph(epred = eTriplet => eTriplet.attr.edgeCount>5)

val pRank = ExtractGraphX.runPageRankAlgorithm(graph)

WriteGraphXML(pRank, "/mnt/vol1/derivative_data/test/graphML-test.graphml")

However, when I attempted to open the ensuing graphML-test.graphml file in Gephi, received the following error message:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[10,51]
Message: Element type "key" must be followed by either attribute specifications, ">" or "/>".
	at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:158)
Caused: java.lang.RuntimeException
	at org.gephi.io.importer.plugin.file.ImporterGraphML.execute(ImporterGraphML.java:181)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:199)
	at org.gephi.io.importer.impl.ImportControllerImpl.importFile(ImportControllerImpl.java:169)
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:341)
Caused: java.lang.RuntimeException
	at org.gephi.desktop.importer.DesktopImportControllerUI$4.run(DesktopImportControllerUI.java:349)
[catch] at org.gephi.utils.longtask.api.LongTaskExecutor$RunningLongTask.run(LongTaskExecutor.java:274)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something weird happening in WriteGraphXML?

@greebie
Copy link
Contributor Author

greebie commented Jul 27, 2018

Re: auk.

The first thing is that this update offers some good features for network analysis, so it is worthwhile adding, even if it does not work for AUK.

The AUK spark script would need revision, which is mostly wrapping the current flatMap script in the GraphX graph object and using .runPageRankAlgorithm(). It will afford an additional advantage that the raw Gephi graphs could also be slightly more attractive, because they use pagerank instead of degree as the default sizing.

I'll definitely do some testing before we move forward on changing the algorithm. The new approach will take generally longer for all graphs, but much much shorter than what GraphPass does for huge ones.

Also, Hardik and I were looking at a paper that examined the differences in pageRank calculations from Spark to Igraph.

The "quickrun" feature in GraphPass would include something like this pseudocode:

i f NODESIZE (graph) > 50 0000:
 - check for "Weak" and/or "Strong" attribute
 - create the subgraph from largest weak components if the attr exists and check if the new graph < 50000 nodes
    otherwise
- create the strong subgraph if the attr exists and check if the new graph < 50000 nodes
- otherwise fail graphpass.

- if we have a good new graph, then run the usual quickpass stuff.
- maybe send a _WEAKPASS or _STRONGPASS file to tell AUK we didn't use the whole graph.

The downside is longer Spark calls, possibly for all graphs if we do not add some conditional reasoning behind choosing to run the PageRank algorithm

The upside is that this would be able to provide visualizations for very large graphs in a way that makes theoretical sense. Since the web archives are not going to get smaller in the long term, it is important we have a solution, even if its not optimal.

Further, we have a way forward if SNAP is simply beyond my capabilities or does not have the ability to do what we need it to.

Graphpass would not to eat up resources for huge graphs.

@greebie
Copy link
Contributor Author

greebie commented Jul 27, 2018

@ianmilligan1 Yup. That looks like I accidentally added or removed a "<" somewhere. Will revise.

@greebie
Copy link
Contributor Author

greebie commented Jul 27, 2018

@ianmilligan1 The latest update should work in Gephi this time. I tested it with a pretty good set of WARCs. The weak component works nicely!

@ianmilligan1
Copy link
Member

Works now, thanks @greebie!

screen shot 2018-07-27 at 1 22 26 pm

greebie added 2 commits July 27, 2018 17:35
It was challenging to test the XML output - id & component generation is a little wonky.
@greebie
Copy link
Contributor Author

greebie commented Jul 27, 2018

Hey 70% Codecov! That's a milestone! :)

@@ -98,14 +98,14 @@ class ArchiveRecordImpl(r: SerializableWritable[ArchiveRecordWritable]) extends
new String(getContentBytes)
}

val getMimeType = {
val getMimeType: String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question - what's this change for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of lint change request for scalastyle that need to be addressed.

I decided to pick some of these off to reduce the errors that show up in the build.

Part of that is requiring explicit types for any public method. I'm pretty sure mime types are always string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants