Improve ExtractDomain to Better Isolate Domains #269

ianmilligan1 · 2018-09-13T14:42:53Z

Describe the bug
ExtractDomain should be producing domains like:

www.archive.org
www.liberal.ca

etc.

At times we see domains like this, however

seetorontonow.canada-booknow.com\booking_results.php

This is probably due to the URL having a backslash rather than the expected forward slash.

Expected behavior
In the above example, we should probably have:

seetorontonow.canada-booknow.com

This has impacts on the ensuing GEXF files.

What should we do?
Improve the ExtractDomain UDF so that it captures domain based on backslash as well.

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2018-09-13T15:40:15Z

FWIW, in ExtractDomain we use the Java URL class to extract the host for us.

We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.

borislin · 2018-10-08T22:25:15Z

@ruebot Quick question: in ExtractDomain, why do we check source first then url? I think source will only be used when url doesn't contain any valid domain host.

lintool · 2018-10-09T09:42:59Z

@borislin According to git blame that's @greebie 's code.

Either way, we'll need more test cases and better coverage here...

greebie · 2018-10-09T15:54:23Z

Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.

Having the additional test case would help prevent this error in future.

borislin · 2018-10-10T18:16:18Z

@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?

ianmilligan1 · 2018-10-10T19:30:00Z

I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget job and park it somewhere on tuna?

borislin · 2018-10-10T20:03:58Z

@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna and let me know when it's done and the path to the collection.

ianmilligan1 · 2018-10-11T11:33:45Z

Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl

borislin · 2018-10-13T18:38:48Z

@ianmilligan1 Can you give me the script you run before that produces this URL issue?

I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt.

Could you please provide me with all the steps you've taken to reproduce this issue?

ianmilligan1 · 2018-10-13T18:52:16Z

It was appearing when running the link generator, i.e. the standard AUK job:

val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

Try running the GraphML generator on the collection? Thanks @borislin !

ianmilligan1 added the bug label Sep 13, 2018

ianmilligan1 assigned borislin Sep 18, 2018

borislin mentioned this issue Oct 16, 2018

Patch for #269: Replace backslash with forward slash in URL #276

Merged

ruebot closed this as completed in 7c3a80d Oct 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ExtractDomain to Better Isolate Domains #269

Improve ExtractDomain to Better Isolate Domains #269

ianmilligan1 commented Sep 13, 2018

ianmilligan1 commented Sep 13, 2018

borislin commented Oct 8, 2018

lintool commented Oct 9, 2018

greebie commented Oct 9, 2018

borislin commented Oct 10, 2018

ianmilligan1 commented Oct 10, 2018

borislin commented Oct 10, 2018

ianmilligan1 commented Oct 11, 2018

borislin commented Oct 13, 2018

ianmilligan1 commented Oct 13, 2018

Improve ExtractDomain to Better Isolate Domains #269

Improve ExtractDomain to Better Isolate Domains #269

Comments

ianmilligan1 commented Sep 13, 2018

ianmilligan1 commented Sep 13, 2018

borislin commented Oct 8, 2018

lintool commented Oct 9, 2018

greebie commented Oct 9, 2018

borislin commented Oct 10, 2018

ianmilligan1 commented Oct 10, 2018

borislin commented Oct 10, 2018

ianmilligan1 commented Oct 11, 2018

borislin commented Oct 13, 2018

ianmilligan1 commented Oct 13, 2018