-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix relative links extraction #504
Conversation
Codecov Report
@@ Coverage Diff @@
## main #504 +/- ##
============================================
- Coverage 88.85% 88.83% -0.03%
Complexity 57 57
============================================
Files 43 43
Lines 1014 1012 -2
Branches 86 85 -1
============================================
- Hits 901 899 -2
Misses 74 74
Partials 39 39 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting the issue, and submitting a fix @yxzhu16. This is great!
I have a PR against your PR which fixes the test you were having issues with. Once you merge that, this PR should automatically get updated, and we should be good to go.
@@ -83,14 +83,6 @@ class ArcTest extends FunSuite with BeforeAndAfter { | |||
assert(discardMatches.count == 284L) | |||
} | |||
|
|||
test("Count links RDD") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a better solution here than just deleting the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go: yxzhu16#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ruebot !
- Remove unnecessary test results comment
Fix "Count links RDD" test.
Set baseUri to be `src` instead of `base` when extracting links, and deleted `base` parameter. The issue occurred because relative links cannot be extracted by ` link.attr("abs:href")` when baseUri is not set. As I look through the code, param `base` is never provided anywhere when `ExtractLinks` is called, so default value `""` is always used, and baseUri is never set. However, `baseUri` is required to be able to extract relative links. * resolves #501 * update tests * remove unnecessary test results comment Co-authored-by: Kai Zhong <kaizhchn@hotmail.com> Co-authored-by: nruest <ruestn@gmail.com>
GitHub issue(s): #501
What does this Pull Request do?
src
instead ofbase
when extracting links, and deletedbase
parameterThe issue occurred because relative links cannot be extracted by
link.attr("abs:href")
when baseUri is not set.As I look through the code, param
base
is never provided anywhere whenExtractLinks
is called, so default value "" is always used, and baseUri is never set. However, baseUri is required to be able to extract relative links.Instead of adapting the Python UDF to include the base parameter, I think it makes sense to set the baseUri to be
src
. Similar asaut/src/main/scala/io/archivesunleashed/matchbox/ExtractImageLinks.scala
Line 42 in 00e8166
I tested on Wayback Machine bbc pidgin pages and it works.
One concern is that after modification the number of relative links is really really large compared to before, eg. in one of the test cases the number of links extracted 600 -> 30k. And in ArcTest "Count links RDD" I encountered
java.lang.StackOverflowError
because of the large number of links.How should this be tested?