-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docket content overwritten by a more recent docket #2472
Comments
@mlissner I've checked that RSS data started to be stored on Jul 23, 2020, and Docket uploads on Sep 27, 2017, according to git logs. So that in order to ensure we can get examples where we can trace docket creation and consecutive updates I think it's good to look for examples after Jul 23, 2020. This is the query that can help us to retrieve some examples of overwritten dockets after that date:
It works by looking for opinions where their |
Well that took a long time to run, but it only returned a few items, thankfully:
|
Thank you, I'll look at them! |
I've checked the potential mixed items returned by the script. These are my finds: Most of them seem not to have a problem, they were detected by the script due to a name variation from the opinion and the docket in PACER for instance: https://www.courtlistener.com/admin/search/opinioncluster/7322944/change/ https://www.courtlistener.com/admin/search/opinioncluster/7334818/change/ So cases like these were detected by the script since the name is not exactly the same however the case is the same.
Then, I detected a couple of possible issues related to other cases that seem actual errors: Borusan Mannesmann Boru Sanayi Ve Ticaret A.S. v. United States. The previous opinion has the following docket linked: Is a different case from the original. We have uploads to confirm the original docket matched the opinion: Like this one uploaded on Dec. 20, 2022, 1:04 p.m. PST The same day and almost at the same time we received 6 more uploads to the same The last upload to this I found some more scrambled dockets related to this issue, however, seems that this problem was only happening on appellate dockets, the more recent case I could find related to this issue is the following one that received its last upload on Jan. 5, 2023, 12:02 p.m. PST: https://www.courtlistener.com/docket/64983976/rkw-klerks-inc-v-united-states/ This issue seems related to the intermixed dockets uploads bug #305 @ERosendo solved. As I wrote above seems that we haven't received more mixed uploads like these recently (After Jan 5), but it might be worth checking if the problem is solved in the extension. - There is another problem related to overwritten dockets and this seems to be the main source of this kind of errors right now: And it's related to the Harvard opinions importer. Current docket name: Torres v. The Blackstone Group When Harvard importer imports a new opinion, a new docket is also created based on the opinion data, e.g: The
In this example, it returns: Then when a user uploads a docket or added/updated through the RSS scraper like in this case, the docket is overwritten due to the The
And then the docket is looked at by Since these dockets don't have a So that the docket is overwritten with new data, including a new In a brief, the problem here is the method Another example from an overwritten docket:
Some ideas to fix this issue:
The bad news is none of the problems described above seem to explain the old dockets overwritten in the first comment of this issue: My theory about these is that these dockets were added with a wrong @mlissner let me know what you think. |
Great analysis, thanks. So I think you identified these issues:
Finally, when all of the above is done, I'm still left wondering how we fix the existing problem. How many items are screwed up and how do we fix them? @flooie you'll want to be aware of this issue before you do anymore imports. |
Thanks, some comments about:
Yeah, seems that this is still happening, the main problem is that the opinion name from the original source varies from the docket name on PACER e.g, if the opinion came from Harvard: The opinion name is taken from the If you look for that docket on PACER the docket name is:
Yeah, I think it's worth it to make Documentation says:
Does that mean the
What changes could we introduce to make them different but continue complying with IDB? Could we add letters? I think before doing the docket number conversion we should clean the docket number from opinions since right now the conversion is based on the raw "Nos. 212-213, Dockets 27264, 27265" This one also seems invalid format: "Nos. C 123-80-123-82" There are other formats found in opinions that seem problematic to choosing the right docket number: "CIVIL ACTION NO. 7:17\u2013CV\u201300426" Do you know which are the known formats we should look at in order to be sure we only convert valid docket numbers? |
OK, so the stuff coming from the Harvard importer is pretty messy because it goes back centuries, before the docket numbers were normalized. All of these kinds of things just shouldn't get a docket_number_core:
We should only be fixing modern bankruptcy and district court entries. That looks like what we're doing more or less, but that we just need to tune our regexes to ignore things like above. This one could get a docket_number_core generated:
About case names differing between opinions and dockets, I'm not sure we can fix it. Let's punt that for another day. |
Sorry, I meant to ask, is this enough for you to run with, Alberto? |
Thanks, so now we should only get a docket_core_number when the opinion docket_number has a known format for a district or bankruptcy docket, and ignore those that don't meet with a known format, right? A couple of questions:
|
Right.
I think we're not doing this now, right, so let's not start doing it if that's the case.
We should definitely add the docket, but without a docket_number_core. As for cleaning, I'm slightly in favor of cleaning off things like
No, we just add it with multiple docket numbers. Often courts will issue one opinion for multiple cases, and you'll get this, You still need to have the data, even though it doesn't really fit neatly in our model. |
Thanks, working on this. About appellate dockets:
Yeah, we're not generating a When the docket is saved the Also, the
|
I'll get to your latest questions, Alberto, but @flooie and I were talking about this just now and came up with a few more things to think about. First, he pointed out that the messed up dockets should have the source of Harvard (16) + RECAP (1), which would be 17. So we queried all dockets with a
Looking at an example case, it has a source value of 16 (Harvard), even though it clearly has RECAP data. So that's a bug we need to fix:
The source field is effectively a bitmask (converted to integers), but one of it's values is wrong, according to this bug: #2473. So we should fix that bug before fixing the issue above. Since we weren't able to look up these cases by the
Honestly, I hope that's all of the ones that we've messed up, but I'm extremely suspicious that there are more (for example, the query I ran up above brought back ~75 results for a single year). We'll need to nail down how many cases are affected by this. If the query above isn't right, what is? |
Got it, I'll look at the results returned above and check if all of them are messed up so that we can determine if the query is correct in order to fix them. I'll check the |
And - just to be clear there is a mistake in the model - labeling 17 as Harvard and Scraper- so that should be fixed presumably before we start making 17s appear |
I checked the dockets above in order to find more conflictive docket numbers to test against the new regex to generate the The good news is that now all the results from you're query are messed up, just some of them. I've worked on modifying the regex to ignore things like: "Nos. 212-213, Dockets 27264, 27265" So we'll only generate a docket_number_core if it matches the exact format for district or bankruptcy courts and if there is only one docket number in the string. But checking the examples above I found more cases in that I think the above won't be enough to solve the issue. There are messed up cases like these: https://www.courtlistener.com/opinion/7335339/wis-province-jesus-v-cassem/ Docket number on opinion: Both of them can be found on District Court, D. Connecticut PACER
|
We shouldn't have a docket_number_core for dockets that have more than one number in their docket number. This is tricky:
You ask:
And the answer is, "Not really." I think we should keep letters out of the docket_number_core to be consistent with the IDB data. Would it be possible to introduce something that'd prevent these two from blocking, even if their docket_number_core values were the same?
We're not doing this now, right? If that's the case, let's not start. |
Thanks, some comments below:
Got it, I was thinking of an alternative to solve this when
So, should we stop generating |
According to our talk, here is the example I found about an appellate docket added from an opinion and is messed up due to duplicated https://www.courtlistener.com/opinion/7020971/in-re-s-o-s-sheet-metal-co/ https://ia803106.us.archive.org/9/items/law.free.cap.f2d.297/32.226451.json The opinion has the following docket_number: We'll avoid generating a However, seems that This format is pretty similar to the bankruptcy regex: |
I'm sorry, Alberto, I didn't keep notes from our call. Can you lay out the plan of attack as you understand it? I feel disorganized here, but I get the sense that you don't. |
Yes, of course, the plan here is:
|
Yes, that sounds right. Everything else sounds exactly right too. Thank you. |
Ok, #2511 is in the merge queue. Now we need to clean up mis-matched dockets and re-do some docket_number_core values, right? |
There are some opinions that its linked docket seems to be overwritten by a different docket.
Since the Opinion content doesn't match the docket content, e.g:
https://www.courtlistener.com/opinion/14405/united-states-v-evans/
https://www.courtlistener.com/docket/7771/jackson-womens-health-v-dobbs/
https://www.courtlistener.com/opinion/1867835/go/
https://www.courtlistener.com/docket/1481313/kimberly-carroll-lamb/
Another hint that the docket was overwritten is that the new docket
date_filed
is more recent than the date the docket object was created.These dockets were created around 2014 so we don't have the original source of the docket stored, so we could trace the problem since the docket was created and when it was overwritten.
As a next step, we'll try to find a more recent example of this problem where we could trace and identify the problem.
The text was updated successfully, but these errors were encountered: