Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "make ArchiveRecord a trait (#175)" #181

Merged
merged 1 commit into from
Mar 19, 2018
Merged

Revert "make ArchiveRecord a trait (#175)" #181

merged 1 commit into from
Mar 19, 2018

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Mar 19, 2018

This reverts commit cd0d7b0.

What does this Pull Request do?

Reverts @helgeho's trait work.

I was under the impression that #175 was tested in a way that it was not tested by @greebie before I merged it. Unfortunately this breaks in a number of ways that can be seen here.

How should this be tested?

This is just rolling back to basically 0.12.2, and it should be working fine. @ianmilligan1 would you mind testing this on one of our small WALK datasets? I was using Ransom Myers. So, anything you have laying around should be great. No drop everything super rush here. Sometime this week if you have time.

Additional Notes:

I'll be putting a note on the 0.13.0 release mentioning that is broken, and not to use it. We can't remove releases from Maven Central, so I'll just have to cut a new release once this is merged.

Moving forward, I think we need to come up with a set of tests, or extend the existing tests that that can catch how we failed here. I think it would be better to have our unit tests capture potential traps that happened here rather than a battery of testing on various datasets.

@helgeho moving forward, let me know how you want to proceed with the work done in #175. I think it is a good idea, and really great that our projects can be connected. So, hopefully we can get something going again 😄

@ianmilligan1
Copy link
Member

Sounds good, @ruebot. I’m planning to set auk locally up on my workstation on campus, so I’ll download collections and run jobs overnight/while I’m teaching tonight.

@ruebot
Copy link
Member Author

ruebot commented Mar 19, 2018

@ianmilligan1 thanks!

@codecov
Copy link

codecov bot commented Mar 19, 2018

Codecov Report

Merging #181 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #181      +/-   ##
==========================================
+ Coverage   67.39%   67.44%   +0.05%     
==========================================
  Files          33       33              
  Lines         641      639       -2     
  Branches      125      125              
==========================================
- Hits          432      431       -1     
+ Misses        168      167       -1     
  Partials       41       41
Impacted Files Coverage Δ
...rchivesunleashed/spark/matchbox/RecordLoader.scala 90% <100%> (ø) ⬆️
...ivesunleashed/spark/archive/io/ArchiveRecord.scala 80.55% <100%> (+1.6%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50bcd27...a9014ec. Read the comment docs.

@greebie
Copy link
Contributor

greebie commented Mar 19, 2018

Agreed about the unit testing. I can take a look at that this week. Looks like a problem with making a DateFormat on an empty string.

@ianmilligan1
Copy link
Member

I've now tested successfully on "Artist-run centres," "Planning in theory and practice," "Halifax Explosion," and "Nova Scotia Theater companies" - all working quite nicely.

I can keep testing but I'm pretty sure we're up and running again.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - reverts cd0d7b0 and functionality is restored on test collections. Thanks @ruebot.

@ianmilligan1 ianmilligan1 merged commit 592f27a into archivesunleashed:master Mar 19, 2018
@ruebot ruebot deleted the revert-cd0d7b03854db80dc67f8076aff44f1b4c2364b2 branch March 20, 2018 04:08
@helgeho
Copy link
Contributor

helgeho commented Mar 20, 2018

I am really sorry about the problems caused by this, but I cannot see how my changes could possibly break anything or cause this error in the attached logs: NumberFormatException: For input string: "". That does not sound very related, but I see that reverting my change seems to fix it, which is really strange. Could anyone of you please investigate this? I would really love to have this a flexible trait again as soon as possible, and that should actually be possible without any issues.

@ianmilligan1
Copy link
Member

I don't know how many spare cycles we have over the next few weeks to intensively test this, @helgeho, although we were able to reproduce this on many WARCs.

My sense of it is that with pristine, perfect WARCs this new approach works fine, but it breaks on the sort of missing fields and other such things that we find in many records.

@helgeho
Copy link
Contributor

helgeho commented Mar 20, 2018

The thing is that there is no new approach, I did not change any logic, I just refactored this one class with a trait, that's a common practice and should not affect anything, so I do not understand what's happening here at all. Could you please share some WARC file that did not work anymore after my change, then I'll try to figure this out...

@ruebot
Copy link
Member Author

ruebot commented Mar 20, 2018

@helgeho it blew up on a bunch of our WALK partner collections on cloud.archivesunleashed.org over the weekend after I put 0.13.0 into production.

The files in the linked gist are these two:

Since we've signed research agreements with these universities we can't just hand over the files. But, I'll reach out to Dalhousie and see if it is ok that I share these two in an effort to figure out what's going on here.

@ianmilligan1
Copy link
Member

From some quick testing, I wonder if it has to do with multiple inputs. Specifying a *.gz when there are multiple WARCs in the directory seems to bring it down immediately, whereas if I pass WARCs one at a time they seem to work.

@greebie
Copy link
Contributor

greebie commented Mar 20, 2018

Looking at it with 20-20 hindsight, moving everything but ISO8601 to a trait somehow meant that we have no DATEFORMAT for more than one WARC. Eventually, we will have a test-case for this.

@helgeho
Copy link
Contributor

helgeho commented Mar 20, 2018

so are you saying that keeping ISO8601 as part of the class (ArchiveRecordImpl) would fix it? @greebie

@greebie
Copy link
Contributor

greebie commented Mar 20, 2018

It might, but I'm not 100% sure. But looking at the errors, ISO8601 seems to be the culprit.

The way to test is to call call RecordLoader.loadArchives (*.warc.gz, {etc}) on a folder with more than one warc.

@greebie
Copy link
Contributor

greebie commented Mar 20, 2018

Since I had the cloned repo on my system I checked. ISO8601 in object ArchiveRecord breaks on ingesting two *.warc.gz. Moving it to ArchiveRecordImpl works.

If you send on the new PR, I'd like @ianmilligan1 or @ruebot to test it this round, because they have the most knowledge of the use cases.

Unfortunately, they may be predisposed for the next bit, so it might take a couple of days or so.

As discussed, having this break turns out to be a good thing, because it's a potential general problematic use case here, which we can test for as per #182.

@anjackson
Copy link
Contributor

I suspect this happened because SimpleDateFormat isn't thread-safe. Not quite sure of the mechanism, but I suspect changing how it's instantiated has changed the context over which the instance is being shared.

Neither of the two implementations is strictly thread-safe at present, AFAICT. You may want to consider instantiating in-line or wrapping it as a ThreadLocal.

@helgeho
Copy link
Contributor

helgeho commented Mar 21, 2018

This is normally nothing we would need to worry about in Spark as every record is only processed by a single thread. However, @anjackson might be right with the SimpleDateFormat, because as long as it is stored in a separate (static) object and in case Spark is configured to run on multiple threads per executor, the same format instance will be accessed from multiple records and this indeed might be an issue. Moving it back to the class as @greebie pointed out should fix this though, as it would then only be used by a one record and one thread respectively. I'll do this later and send a new PR.

@greebie
Copy link
Contributor

greebie commented Mar 21, 2018

Thanks @anjackson. It may be worth seeing what we can do to make the SimpleDateFormat more thread-safe. I'll add the issue.

@anjackson
Copy link
Contributor

No worries @greebie - it may be that it is thread-safe in the current usage, but my Scala-foo is too weak for me to be sure (I'd expect a val to behave the same in both places but I'm aware Scala doesn't always behave as I expect).

@greebie
Copy link
Contributor

greebie commented Mar 21, 2018

We may be looking at some code refactoring in the summer, so identifying these little details is very useful. Thanks to both of you.

This was referenced Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants