Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ExtractDomain to Better Isolate Domains #269

Closed
ianmilligan1 opened this issue Sep 13, 2018 · 10 comments
Closed

Improve ExtractDomain to Better Isolate Domains #269

ianmilligan1 opened this issue Sep 13, 2018 · 10 comments
Assignees
Labels

Comments

@ianmilligan1
Copy link
Member

Describe the bug
ExtractDomain should be producing domains like:

www.archive.org
www.liberal.ca

etc.

At times we see domains like this, however

seetorontonow.canada-booknow.com\booking_results.php

This is probably due to the URL having a backslash rather than the expected forward slash.

Expected behavior
In the above example, we should probably have:

seetorontonow.canada-booknow.com

This has impacts on the ensuing GEXF files.

What should we do?
Improve the ExtractDomain UDF so that it captures domain based on backslash as well.

@ianmilligan1
Copy link
Member Author

FWIW, in ExtractDomain we use the Java URL class to extract the host for us.

We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.

@borislin
Copy link
Collaborator

borislin commented Oct 8, 2018

@ruebot Quick question: in ExtractDomain, why do we check source first then url? I think source will only be used when url doesn't contain any valid domain host.

@lintool
Copy link
Member

lintool commented Oct 9, 2018

@borislin According to git blame that's @greebie 's code.

Either way, we'll need more test cases and better coverage here...

@greebie
Copy link
Contributor

greebie commented Oct 9, 2018

Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.

Having the additional test case would help prevent this error in future.

@borislin
Copy link
Collaborator

@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?

@ianmilligan1
Copy link
Member Author

I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget job and park it somewhere on tuna?

@borislin
Copy link
Collaborator

@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna and let me know when it's done and the path to the collection.

@ianmilligan1
Copy link
Member Author

Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl

@borislin
Copy link
Collaborator

@ianmilligan1 Can you give me the script you run before that produces this URL issue?

I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt.

Could you please provide me with all the steps you've taken to reproduce this issue?

@ianmilligan1
Copy link
Member Author

It was appearing when running the link generator, i.e. the standard AUK job:

val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

Try running the GraphML generator on the collection? Thanks @borislin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants