Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align RDD and DF output for DomainGraphExtractor. #437

Merged
merged 7 commits into from
Apr 8, 2020
Merged

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Apr 8, 2020

GitHub issue(s): #436

What does this Pull Request do?

Align RDD and DF output for DomainGraphExtractor.

- Resolves #436
- Remove WWW prefix for RDD was double escaping
- Update DF so it matches RDD output (it wasn't even close before
:facepalm:)
- Update tests so they're basically testing the same thing

How should this be tested?

TravisCI + Some version of this:

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/issue-436-rdd --output-format TEXT --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/issue-436-df --output-format TEXT --df --partition 1

The output of these two file should have the same line count:

[nruest@wombat:app-output]$ wc -l issue-436-rdd/part-00000 issue-436-df/part-00000-10a96d3c-7f35-4bba-9239-8fb23997612c-c000.csv
  4874 issue-436-rdd/part-00000
  4874 issue-436-df/part-00000-10a96d3c-7f35-4bba-9239-8fb23997612c-c000.csv
  9748 total

Additional Notes

This should unblock #435.

It's also worth noting this:

https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala#L42 vs https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/package.scala#L184

They should be doing the same thing, but on the DataFrame side, we still get empty src or dest values. That's why https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala#L62-L63 is there.

@ruebot ruebot requested review from lintool and ianmilligan1 April 8, 2020 21:56
@codecov
Copy link

codecov bot commented Apr 8, 2020

Codecov Report

Merging #437 into master will increase coverage by 0.05%.
The diff coverage is 97.36%.

@@            Coverage Diff             @@
##           master     #437      +/-   ##
==========================================
+ Coverage   77.99%   78.04%   +0.05%     
==========================================
  Files          43       43              
  Lines        1554     1558       +4     
  Branches      286      286              
==========================================
+ Hits         1212     1216       +4     
  Misses        217      217              
  Partials      125      125              

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - tried it out.

One proviso - the output of this has node IDs like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Whereas if we were to run a script like this one in aut-docs, we get:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

The behaviour of DomainGraphExtractor is preferable to the WriteGraph(links, "links-for-gephi.gexf") output.

@ianmilligan1 ianmilligan1 merged commit 96899f4 into master Apr 8, 2020
@ianmilligan1 ianmilligan1 deleted the issue-436 branch April 8, 2020 22:33
@ruebot
Copy link
Member Author

ruebot commented Apr 8, 2020

@ianmilligan1 can you open up an issue for that? That's a good catch. Those should all be aligned.

@ianmilligan1
Copy link
Member

@ruebot Will do tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants