-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align RDD and DF output for DomainGraphExtractor. #437
Conversation
- Resolves #436 - Remove WWW prefix for RDD was double escaping - Update DF so it matches RDD output (it wasn't even close before :facepalm:) - Update tests so they're basically testing the same thing
Codecov Report
@@ Coverage Diff @@
## master #437 +/- ##
==========================================
+ Coverage 77.99% 78.04% +0.05%
==========================================
Files 43 43
Lines 1554 1558 +4
Branches 286 286
==========================================
+ Hits 1212 1216 +4
Misses 217 217
Partials 125 125 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - tried it out.
One proviso - the output of this has node IDs like:
<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />
Whereas if we were to run a script like this one in aut-docs, we get:
<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />
The behaviour of DomainGraphExtractor
is preferable to the WriteGraph(links, "links-for-gephi.gexf")
output.
@ianmilligan1 can you open up an issue for that? That's a good catch. Those should all be aligned. |
@ruebot Will do tomorrow! |
GitHub issue(s): #436
What does this Pull Request do?
How should this be tested?
TravisCI + Some version of this:
The output of these two file should have the same line count:
Additional Notes
This should unblock #435.
It's also worth noting this:
https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala#L42 vs https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/package.scala#L184
They should be doing the same thing, but on the DataFrame side, we still get empty
src
ordest
values. That's why https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala#L62-L63 is there.