-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ExtractDomain to Better Isolate Domains #269
Comments
FWIW, in ExtractDomain we use the Java URL class to extract the host for us. We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea. |
@ruebot Quick question: in |
Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach. Having the additional test case would help prevent this error in future. |
@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test? |
I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a |
@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to |
Sorry for delay (am in European timezone today) – the collection is @ |
@ianmilligan1 Can you give me the script you run before that produces this URL issue? I'm not able to reproduce this issue with this collection. The script I'm using is Could you please provide me with all the steps you've taken to reproduce this issue? |
It was appearing when running the link generator, i.e. the standard AUK job:
Try running the GraphML generator on the collection? Thanks @borislin ! |
Describe the bug
ExtractDomain should be producing domains like:
etc.
At times we see domains like this, however
This is probably due to the URL having a backslash rather than the expected forward slash.
Expected behavior
In the above example, we should probably have:
This has impacts on the ensuing GEXF files.
What should we do?
Improve the
ExtractDomain
UDF so that it captures domain based on backslash as well.The text was updated successfully, but these errors were encountered: