-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "make ArchiveRecord a trait (#175)" #181
Revert "make ArchiveRecord a trait (#175)" #181
Conversation
This reverts commit cd0d7b0.
Sounds good, @ruebot. I’m planning to set auk locally up on my workstation on campus, so I’ll download collections and run jobs overnight/while I’m teaching tonight. |
@ianmilligan1 thanks! |
Codecov Report
@@ Coverage Diff @@
## master #181 +/- ##
==========================================
+ Coverage 67.39% 67.44% +0.05%
==========================================
Files 33 33
Lines 641 639 -2
Branches 125 125
==========================================
- Hits 432 431 -1
+ Misses 168 167 -1
Partials 41 41
Continue to review full report at Codecov.
|
Agreed about the unit testing. I can take a look at that this week. Looks like a problem with making a DateFormat on an empty string. |
I've now tested successfully on "Artist-run centres," "Planning in theory and practice," "Halifax Explosion," and "Nova Scotia Theater companies" - all working quite nicely. I can keep testing but I'm pretty sure we're up and running again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really sorry about the problems caused by this, but I cannot see how my changes could possibly break anything or cause this error in the attached logs: |
I don't know how many spare cycles we have over the next few weeks to intensively test this, @helgeho, although we were able to reproduce this on many WARCs. My sense of it is that with pristine, perfect WARCs this new approach works fine, but it breaks on the sort of missing fields and other such things that we find in many records. |
The thing is that there is no new approach, I did not change any logic, I just refactored this one class with a trait, that's a common practice and should not affect anything, so I do not understand what's happening here at all. Could you please share some WARC file that did not work anymore after my change, then I'll try to figure this out... |
@helgeho it blew up on a bunch of our WALK partner collections on cloud.archivesunleashed.org over the weekend after I put 0.13.0 into production. The files in the linked gist are these two:
Since we've signed research agreements with these universities we can't just hand over the files. But, I'll reach out to Dalhousie and see if it is ok that I share these two in an effort to figure out what's going on here. |
From some quick testing, I wonder if it has to do with multiple inputs. Specifying a |
Looking at it with 20-20 hindsight, moving everything but ISO8601 to a trait somehow meant that we have no DATEFORMAT for more than one WARC. Eventually, we will have a test-case for this. |
so are you saying that keeping ISO8601 as part of the class ( |
It might, but I'm not 100% sure. But looking at the errors, ISO8601 seems to be the culprit. The way to test is to call call RecordLoader.loadArchives (*.warc.gz, {etc}) on a folder with more than one warc. |
Since I had the cloned repo on my system I checked. ISO8601 in object ArchiveRecord breaks on ingesting two *.warc.gz. Moving it to ArchiveRecordImpl works. If you send on the new PR, I'd like @ianmilligan1 or @ruebot to test it this round, because they have the most knowledge of the use cases. Unfortunately, they may be predisposed for the next bit, so it might take a couple of days or so. As discussed, having this break turns out to be a good thing, because it's a potential general problematic use case here, which we can test for as per #182. |
I suspect this happened because Neither of the two implementations is strictly thread-safe at present, AFAICT. You may want to consider instantiating in-line or wrapping it as a |
This is normally nothing we would need to worry about in Spark as every record is only processed by a single thread. However, @anjackson might be right with the |
Thanks @anjackson. It may be worth seeing what we can do to make the SimpleDateFormat more thread-safe. I'll add the issue. |
No worries @greebie - it may be that it is thread-safe in the current usage, but my Scala-foo is too weak for me to be sure (I'd expect a |
We may be looking at some code refactoring in the summer, so identifying these little details is very useful. Thanks to both of you. |
This reverts commit cd0d7b0.
What does this Pull Request do?
Reverts @helgeho's trait work.
I was under the impression that #175 was tested in a way that it was not tested by @greebie before I merged it. Unfortunately this breaks in a number of ways that can be seen here.
How should this be tested?
This is just rolling back to basically 0.12.2, and it should be working fine. @ianmilligan1 would you mind testing this on one of our small WALK datasets? I was using Ransom Myers. So, anything you have laying around should be great. No drop everything super rush here. Sometime this week if you have time.
Additional Notes:
I'll be putting a note on the 0.13.0 release mentioning that is broken, and not to use it. We can't remove releases from Maven Central, so I'll just have to cut a new release once this is merged.
Moving forward, I think we need to come up with a set of tests, or extend the existing tests that that can catch how we failed here. I think it would be better to have our unit tests capture potential traps that happened here rather than a battery of testing on various datasets.
@helgeho moving forward, let me know how you want to proceed with the work done in #175. I think it is a good idea, and really great that our projects can be connected. So, hopefully we can get something going again 😄