Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docket content overwritten by a more recent docket #2472

Closed
albertisfu opened this issue Jan 17, 2023 · 22 comments
Closed

Docket content overwritten by a more recent docket #2472

albertisfu opened this issue Jan 17, 2023 · 22 comments
Assignees

Comments

@albertisfu
Copy link
Contributor

There are some opinions that its linked docket seems to be overwritten by a different docket.

Since the Opinion content doesn't match the docket content, e.g:

https://www.courtlistener.com/opinion/14405/united-states-v-evans/
https://www.courtlistener.com/docket/7771/jackson-womens-health-v-dobbs/

https://www.courtlistener.com/opinion/1867835/go/
https://www.courtlistener.com/docket/1481313/kimberly-carroll-lamb/

Another hint that the docket was overwritten is that the new docket date_filed is more recent than the date the docket object was created.

These dockets were created around 2014 so we don't have the original source of the docket stored, so we could trace the problem since the docket was created and when it was overwritten.

As a next step, we'll try to find a more recent example of this problem where we could trace and identify the problem.

@albertisfu albertisfu self-assigned this Jan 17, 2023
@albertisfu
Copy link
Contributor Author

@mlissner I've checked that RSS data started to be stored on Jul 23, 2020, and Docket uploads on Sep 27, 2017, according to git logs.

So that in order to ensure we can get examples where we can trace docket creation and consecutive updates I think it's good to look for examples after Jul 23, 2020.

This is the query that can help us to retrieve some examples of overwritten dockets after that date:

from cl.search.models import Docket, OpinionCluster

opinions_to_look = OpinionCluster.objects.filter(date_created__gt="2020-7-23").only("case_name", "docket").select_related("docket")

objects_to_check=[]
for opinion in opinions_to_look:
	if opinion.case_name != opinion.docket.case_name:
		objects_to_check.append(opinion.pk)

print(f"Objects to check: {objects_to_check}")

It works by looking for opinions where their case_name doesn't match their linked docket case_name, it will return pks of opinions we should look at.

@mlissner
Copy link
Member

Well that took a long time to run, but it only returned a few items, thankfully:

[4773258, 4834272, 4877997, 4878744, 4889176, 4899595, 5297715, 5301108, 5303630, 5303669, 5311404, 5343680, 6244923, 6458092, 6467467, 6471427, 6471840, 6481440, 6621034, 7020971, 7316711, 7322944, 7323001, 7325959, 7328125, 7329188, 7330351, 7334601, 7334781, 7334818, 7334860, 7334861, 7335339, 7335505, 7335514, 7336088, 7336279, 7336539, 7441142, 7859250, 7859672, 7860589, 7861005, 7891065, 7891066, 7914441, 8243537, 8245363, 8344418, 8439012, 8440650, 8440918, 8442035, 8442264, 8442953, 8443072, 8443266, 8515502, 8515868, 8523653, 8629333, 8689105, 8695437, 8698459, 8698603, 8722127, 9328615, 9353745, 9354581, 9356955, 9357356, 9357825, 9358085, 9358368, 9368256, 9368741]

@albertisfu
Copy link
Contributor Author

Thank you, I'll look at them!

@albertisfu
Copy link
Contributor Author

I've checked the potential mixed items returned by the script.

These are my finds:

Most of them seem not to have a problem, they were detected by the script due to a name variation from the opinion and the docket in PACER for instance:

https://www.courtlistener.com/admin/search/opinioncluster/7322944/change/
Cluster name: In re Johnson & Johnson Talcum Powder Products Marketing, Sales Practices & Products Liability Litigation
Docket name: IN RE: Johnson & Johnson Talcum Powder Products Marketing, Sales Practices and Products Liability Litigation

https://www.courtlistener.com/admin/search/opinioncluster/7334818/change/
Cluster name: In re 3M Combat Arms Earplug Prods. Liab. Litig.
Docket name: IN RE: 3M Combat Arms Earplug Products Liability Litigation

So cases like these were detected by the script since the name is not exactly the same however the case is the same.

  • Is this a problem that we should solve? like updating the opinion name when we get the docket name from PACER? Or no action needed here?

Then, I detected a couple of possible issues related to other cases that seem actual errors:

Borusan Mannesmann Boru Sanayi Ve Ticaret A.S. v. United States.
(pacer_case_id: 17544)
https://www.courtlistener.com/opinion/6621034/borusan-mannesmann-boru-sanayi-ve-ticaret-as-v-united-states/

The previous opinion has the following docket linked:
Tau-Ken Temir LLP v. United States
https://www.courtlistener.com/docket/63566889/tau-ken-temir-llp-v-united-states/

Is a different case from the original.

We have uploads to confirm the original docket matched the opinion:

Like this one uploaded on Dec. 20, 2022, 1:04 p.m. PST
Screen Shot 2023-01-20 at 21 56 15

The same day and almost at the same time we received 6 more uploads to the same pacer_case_id 17544 but containing data for other dockets that don't belong to the pacer_case_id.

e.g:
Screen Shot 2023-01-20 at 22 13 43

The last upload to this pacer_case_id was on Dec. 30, 2022, 10:38 a.m. PST
Again the data doesn't match the original docket, this is the data the docket is filled with currently:
Screen Shot 2023-01-20 at 22 14 43

I found some more scrambled dockets related to this issue, however, seems that this problem was only happening on appellate dockets, the more recent case I could find related to this issue is the following one that received its last upload on Jan. 5, 2023, 12:02 p.m. PST:

https://www.courtlistener.com/docket/64983976/rkw-klerks-inc-v-united-states/
Original docket: California Steel Industries, Inc. v. United States, 23-1210
Also received uploads to the pacer_case_id 17629 that's the correct for the original case but containing data from other dockets.

This issue seems related to the intermixed dockets uploads bug #305 @ERosendo solved. As I wrote above seems that we haven't received more mixed uploads like these recently (After Jan 5), but it might be worth checking if the problem is solved in the extension.

- There is another problem related to overwritten dockets and this seems to be the main source of this kind of errors right now:

And it's related to the Harvard opinions importer.
e.g: https://www.courtlistener.com/admin/search/opinioncluster/8722127/change/
Original opinion/docket name: United States v. Schurman

Current docket name: Torres v. The Blackstone Group
https://www.courtlistener.com/docket/65988049/torres-v-the-blackstone-group/

When Harvard importer imports a new opinion, a new docket is also created based on the opinion data, e.g:
https://ia903104.us.archive.org/8/items/law.free.cap.f-supp.84/411.678854.json

The docket_number is assigned as it's in the opinion: Nos. C 123-80-123-82
But also a docket_number_core is generated using make_docket_number_core:

make_docket_number_core("Nos. C 123-80-123-82")
Out[2]: '2300123'

In this example, it returns: 2300123

Then when a user uploads a docket or added/updated through the RSS scraper like in this case, the docket is overwritten due to the docket_number_core

Thenysd RSS data on Jan 5 2023 contained:

<title>1:23-cv-00123 Torres v. The Blackstone Group et al</title>
<link>https://ecf.nysd.uscourts.gov/cgi-bin/DktRpt.pl?592021</link>
<description>[Complaint] (&#x3C;a href=&#x22;https://ecf.nysd.uscourts.gov/doc1/127132613310?caseid=592021&#x26;de_seq_num=10&#x22; &#x3E;2&#x3C;/a&#x3E;)</description>
<guid isPermaLink="true">https://ecf.nysd.uscourts.gov/cgi-bin/DktRpt.pl?592021&#x26;10</guid>
<pubDate>Fri, 06 Jan 2023 20:41:51 GMT</pubDate>
</item>

find_docket_object() first generate a docket_number_core:

make_docket_number_core("1:23-cv-00123")
Out[3]: '2300123'

And then the docket is looked at by pacer_case_id and docket_number_core, or pacer_case_id, or pacer_case_id: None and docket_number_core.

Since these dockets don't have a pacer_case_id the docket is found by the docket_number_core due to as shown above make_docket_number_core returns the same docket_number_core for two different cases.

So that the docket is overwritten with new data, including a new docket_number and pacer_case_id.

In a brief, the problem here is the method make_docket_number_core returning equal docket_core_number for different docket_number

Another example from an overwritten docket:

In [2]: make_docket_number_core("Case No. 2:18-CV-137")
Out[2]: '1800137'

In [3]: make_docket_number_core("3:18-mj-00137")
Out[3]: '1800137'

Some ideas to fix this issue:

  • I think it's better to store the docket_number when adding the docket from an opinion after cleaning the docket_number, so instead of storing the raw string: Case No. 2:18-CV-137 we store 2:18-CV-137, this could help in some cases where there is a pacer_case_id.
  • The main problem is make_docket_number_core generating the same outputs for different inputs, is it possible to improve it? a docket_number_core can be found on PACER? or is it just a CL number? So I could check what should be the right output when we get the same output.
  • In case we can't improve make_docket_number_core we could add some changes to find_docket_object to avoid only looking for the docket_number_core when there is no pacer_case_id, however, I don't have clear if these could affect other lookups.

The bad news is none of the problems described above seem to explain the old dockets overwritten in the first comment of this issue:
https://www.courtlistener.com/opinion/14405/united-states-v-evans/
https://www.courtlistener.com/opinion/1867835/go/

My theory about these is that these dockets were added with a wrong pacer_case_id from their original source or maybe a different bug, the good news is that this seems to not continue happening (except for the new bugs described above).

@mlissner let me know what you think.

@mlissner
Copy link
Member

Great analysis, thanks. So I think you identified these issues:

  • Some dockets have different case names than opinion clusters. I think this is fine, but it might be worth figuring out if it's still happening and fixing it. It's not ideal.

  • Until January 5, the RECAP Extension seems to have uploaded metadata with HTML that didn't match. This affected some appellate cases. @ERosendo, can you see if we did something around January 5 that would have fixed this? If not, I suspect it's still happening and we just don't have an instance of it recently by sheer (bad/good?) luck.

  • make_docket_number_core is too aggressive and is causing matches that it shouldn't, particularly affecting the Harvard Importer (this makes sense since it's adding more data than anywhere else). I think we can probably fix this in a variety of ways:

    1. We should make the conversions less aggressive. Maybe they should only work on known formats.
    2. We should see making find_docket less aggressive is possible when all we have is the core docket number. I'm not sure this is possible, but it's probably worth it.
    3. If a case has a pacer_docket_number, it probably shouldn't ever get a new one. I'm curious how this would work, but maybe this can serve as an opportunity to throw an error and catch some bugs like this in the future?
    4. We should clean up old core docket numbers after we've made it less aggressive, so that they're not in the DB anymore.

Finally, when all of the above is done, I'm still left wondering how we fix the existing problem. How many items are screwed up and how do we fix them?

@flooie you'll want to be aware of this issue before you do anymore imports.

@albertisfu
Copy link
Contributor Author

albertisfu commented Jan 23, 2023

Thanks, some comments about:

Some dockets have different case names than opinion clusters. I think this is fine, but it might be worth figuring out if it's still happening and fixing it. It's not ideal.

Yeah, seems that this is still happening, the main problem is that the opinion name from the original source varies from the docket name on PACER e.g, if the opinion came from Harvard:

The opinion name is taken from the name_abbreviation field:
https://ia903002.us.archive.org/27/items/law.free.cap.f3d.772/388.5919699.json
"name_abbreviation": "DeBoer v. Snyder"

If you look for that docket on PACER the docket name is:
April DeBoer, et al v. Richard Snyder, et al

  • One possible solution to this is to update the opinion name if it changes after receiving some docket update from PACER, however not sure if this is the best approach, additionally, if we update the opinion name from the docket one we should be sure to solve the problems below to avoid updating an opinion with the wrong docket name.

make_docket_number_core is too aggressive and is causing matches that it shouldn't, particularly affecting the Harvard Importer (this makes sense since it's adding more data than anywhere else). I think we can probably fix this in a variety of ways:

We should make the conversions less aggressive. Maybe they should only work on known formats.
We should see making find_docket less aggressive is possible when all we have is the core docket number. I'm not sure this is possible, but it's probably worth it.
If a case has a pacer_docket_number, it probably shouldn't ever get a new one. I'm curious how this would work, but maybe this can serve as an opportunity to throw an error and catch some bugs like this in the future?
We should clean up old core docket numbers after we've made it less aggressive, so that they're not in the DB anymore.

Yeah, I think it's worth it to make make_docket_number_core less aggressive.
Some questions about it:

Documentation says:

For federal district court dockets, this is the most distilled docket number available. 
In this field, the docket number is stripped down to only the year and serial digits, 
eliminating the office at the beginning, letters in the middle, and the judge at the end. 
Thus, a docket number like 2:07-cv-34911-MJL becomes simply 0734911. This is the 
format that is provided by the IDB and is useful for de-duplication types of activities 
which otherwise get messy. We use a char field here to preserve leading zeros.

Does that mean the docket_core_number generated should match the one on IDB? That would mean that IDB also might have duplicated docket_core_number?

e.g considering this example:
In [2]: make_docket_number_core("2:18-CV-137")
Out[2]: '1800137'

In [3]: make_docket_number_core("3:18-mj-00137")
Out[3]: '1800137'

What changes could we introduce to make them different but continue complying with IDB? Could we add letters?

I think before doing the docket number conversion we should clean the docket number from opinions since right now the conversion is based on the raw docket_number, e.g:

"Nos. 212-213, Dockets 27264, 27265"
In the case above, which would be the right docket number, 212-213? I try to find this docket on PACER but I couldn't find it.

This one also seems invalid format: "Nos. C 123-80-123-82"

There are other formats found in opinions that seem problematic to choosing the right docket number:

"CIVIL ACTION NO. 7:17\u2013CV\u201300426"
"docket_number": "Nos. 14-13542, 14-13657, 15-10967, 15-11166"

Do you know which are the known formats we should look at in order to be sure we only convert valid docket numbers?

@mlissner
Copy link
Member

OK, so the stuff coming from the Harvard importer is pretty messy because it goes back centuries, before the docket numbers were normalized. All of these kinds of things just shouldn't get a docket_number_core:

  • "Nos. 212-213, Dockets 27264, 27265" (too weird, more than one number)
  • Nos. C 123-80-123-82" (too weird)
  • Nos. 14-13542, 14-13657, 15-10967, 15-11166 (too many numbers)

We should only be fixing modern bankruptcy and district court entries. That looks like what we're doing more or less, but that we just need to tune our regexes to ignore things like above.

This one could get a docket_number_core generated:

  • "CIVIL ACTION NO. 7:17\u2013CV\u201300426" (\u2013 is an endash, which we normalize in normalize_dashes; with that done we get CIVIL ACTION NO. 7:17-CV-00426, which is just 1700426).

About case names differing between opinions and dockets, I'm not sure we can fix it. Let's punt that for another day.

@mlissner
Copy link
Member

Sorry, I meant to ask, is this enough for you to run with, Alberto?

@albertisfu
Copy link
Contributor Author

albertisfu commented Jan 25, 2023

Thanks, so now we should only get a docket_core_number when the opinion docket_number has a known format for a district or bankruptcy docket, and ignore those that don't meet with a known format, right?

A couple of questions:

  • For opinions that belong to an appellate court, should we avoid generating a docket_core_number when adding the docket? Or, is there a different format for appellate courts?

  • About weird docket numbers, like Nos. C 123-80-123-82(not a know format), we won't generate a docket_core_number, but should we still create the docket using this docket_number? If so, should we add it as it's (Nos. C 123-80-123-82) or clean it before (removing the Nos. C) so the docket_number is: 123-80-123-82?

  • When in an opinion there is more than one docket number, like in the examples above, since we can't choose a number and since there is no pacer_case_id either, should we avoid adding the docket?

@mlissner
Copy link
Member

Thanks, so now we should only get a docket_core_number when the opinion docket_number has a known format for a district or bankruptcy docket, and ignore those that don't meet with a known format, right?

Right.

* For opinions that belong to an appellate court, should we avoid generating a `docket_core_number` when adding the docket? Or, is there a different format for appellate courts?

I think we're not doing this now, right, so let's not start doing it if that's the case.

* About weird docket numbers, like `Nos. C 123-80-123-82`(not a know format), we won't generate a `docket_core_number`, but should we still create the docket using this docket_number? If so, should we add it as it's (`Nos. C 123-80-123-82`) or clean it before (removing the `Nos. C`) so the docket_number is: `123-80-123-82`?

We should definitely add the docket, but without a docket_number_core. As for cleaning, I'm slightly in favor of cleaning off things like No. and Nos., but even that can go wrong, so I'm fine with leaving them. I would definitely not clean off things like C. Those mean something to somebody somewhere.

* When in an opinion there is more than one docket number, like in the examples above, since we can't choose a number and since there is no `pacer_case_id` either, should we avoid adding the docket?

No, we just add it with multiple docket numbers. Often courts will issue one opinion for multiple cases, and you'll get this, You still need to have the data, even though it doesn't really fit neatly in our model.

@albertisfu
Copy link
Contributor Author

albertisfu commented Jan 27, 2023

Thanks, working on this.

About appellate dockets:

I think we're not doing this now, right, so let's not start doing it if that's the case.

Yeah, we're not generating a docket_number_core directly when adding opinions (for any court type).
But I found appellate dockets added by an opinion that has a docket_number_core e.g:
https://www.courtlistener.com/admin/search/docket/65964068/change/

When the docket is saved the docket_number_core is generated and saved for all dockets including appellate, since the Docket has a custom save method that does it.

Also, the docket_number_core is generated when the docket is updated by other sources (like RSS or RECAP), the find_docket method generates a docket_number_core for all dockets including appellate dockets and this number is used in the lookup.

  • So, we shouldn't use docket_number_core to look up appellate dockets, since we don't have a format for appellate so the conversion is not reliable, right?
  • Should we avoid generating and saving the docket_number_core for all appellate dockets, right?

@mlissner
Copy link
Member

mlissner commented Jan 27, 2023

I'll get to your latest questions, Alberto, but @flooie and I were talking about this just now and came up with a few more things to think about. First, he pointed out that the messed up dockets should have the source of Harvard (16) + RECAP (1), which would be 17. So we queried all dockets with a source of 17. We got none:

In [2]: ds = Docket.objects.filter(source=17)

In [3]: ds.count()
Out[3]: 0

Looking at an example case, it has a source value of 16 (Harvard), even though it clearly has RECAP data. So that's a bug we need to fix:

  • Make sure that we add RECAP as a source for dockets via the add_recap_source method.

The source field is effectively a bitmask (converted to integers), but one of it's values is wrong, according to this bug: #2473. So we should fix that bug before fixing the issue above.

Since we weren't able to look up these cases by the source field, we tried another approach: Looking them up by source + a null pacer_case_id field. That brought back 39 results:

In [4]: ds = Docket.objects.filter(source=16, pacer_case_id__isnull=False)

In [5]: ds.count()
Out[5]: 39

In [6]: for d in ds:
   ...:     print(f'https://www.courtlistener.com{d.get_absolute_url()}')
   ...: 
https://www.courtlistener.com/docket/64323826/united-states-v-application-for-order/
https://www.courtlistener.com/docket/64323992/in-re-attorney-admissions/
https://www.courtlistener.com/docket/65960899/western-heritage-insurance-com-v-dennis-montana/
https://www.courtlistener.com/docket/64324001/united-states-v-search-warrant/
https://www.courtlistener.com/docket/65659484/american-atheists-v-duncan/
https://www.courtlistener.com/docket/64314442/1-16-682-1-torres-hernandez-br-bfont-colorredcase-created-for-the/
https://www.courtlistener.com/docket/64323347/united-states-v-search-warrant/
https://www.courtlistener.com/docket/64305194/united-states-v-becerra/
https://www.courtlistener.com/docket/64323087/in-re-attorney-admissions/
https://www.courtlistener.com/docket/65785327/baab-steel-inc/
https://www.courtlistener.com/docket/65523353/united-states-v-search-warrant/
https://www.courtlistener.com/docket/64324767/united-states-v-tracking-warrant/
https://www.courtlistener.com/docket/64009291/united-states-v-oppedisano/
https://www.courtlistener.com/docket/65660895/april-deboer-v-richard-snyder/
https://www.courtlistener.com/docket/64324575/united-states-v-search-warrant/
https://www.courtlistener.com/docket/64311484/united-states-v-search-warrant/
https://www.courtlistener.com/docket/64323267/in-re-attorney-admissions/
https://www.courtlistener.com/docket/64317672/united-states-v-sealed-search-warrant/
https://www.courtlistener.com/docket/64304210/united-states-v-sealed/
https://www.courtlistener.com/docket/64180258/thompson-v-mdoc/
https://www.courtlistener.com/docket/65661742/sidney-reid-v-unilever-united-states-inc/
https://www.courtlistener.com/docket/64318835/in-re-attorney-admissions/
https://www.courtlistener.com/docket/64325029/in-re-attorney-admissions/
https://www.courtlistener.com/docket/65008271/michael-newdow-v-john-roberts-jr/
https://www.courtlistener.com/docket/65662052/sara-lowry-v-city-of-san-diego/
https://www.courtlistener.com/docket/65954557/zachary-v-southwest-airlines/
https://www.courtlistener.com/docket/65661187/united-states-v-mccarthy-ganias/
https://www.courtlistener.com/docket/66289095/guillermo-gomez-sanchez-v-jefferson-sessions-iii/
https://www.courtlistener.com/docket/65657044/planned-parenthood-v-american-coalition/
https://www.courtlistener.com/docket/65964068/fo2go-llc-v-pinterest-inc/
https://www.courtlistener.com/docket/64323304/in-re-3m-combat-arms-earplug-products-liability-litigation/
https://www.courtlistener.com/docket/65659196/john-balentine-v-william-stephens-director/
https://www.courtlistener.com/docket/65961908/siddiqua-v-new-york-state-department-of-h/
https://www.courtlistener.com/docket/65894702/stm-networks-v-clay-pacific-srl/
https://www.courtlistener.com/docket/65662493/united-states-v-in-the-matter-of-a-warrant-to/
https://www.courtlistener.com/docket/64323346/united-states-v-tracking-warrant/
https://www.courtlistener.com/docket/65988049/torres-v-the-blackstone-group/
https://www.courtlistener.com/docket/65963924/dongkuk-international-inc-v-doj/
https://www.courtlistener.com/docket/65662183/national-association-of-crimi-v-us-department-of-justice/

Honestly, I hope that's all of the ones that we've messed up, but I'm extremely suspicious that there are more (for example, the query I ran up above brought back ~75 results for a single year).

We'll need to nail down how many cases are affected by this. If the query above isn't right, what is?

@albertisfu
Copy link
Contributor Author

Got it, I'll look at the results returned above and check if all of them are messed up so that we can determine if the query is correct in order to fix them.

I'll check the source bug and solve it, so that dockets added by Harvard and updated by RECAP have source 17.

@flooie
Copy link
Contributor

flooie commented Jan 27, 2023

And - just to be clear there is a mistake in the model - labeling 17 as Harvard and Scraper- so that should be fixed presumably before we start making 17s appear

@albertisfu
Copy link
Contributor Author

I checked the dockets above in order to find more conflictive docket numbers to test against the new regex to generate the docket_number_core.

The good news is that now all the results from you're query are messed up, just some of them.

I've worked on modifying the regex to ignore things like:

"Nos. 212-213, Dockets 27264, 27265"
"Nos. C 123-80-123-82"
"Nos. 14-13542, 14-13657, 15-10967, 15-11166
"CIVIL ACTION NO. 7:17-CV-00426, 7:17-CV-00427"

So we'll only generate a docket_number_core if it matches the exact format for district or bankruptcy courts and if there is only one docket number in the string.

But checking the examples above I found more cases in that I think the above won't be enough to solve the issue.

There are messed up cases like these:

https://www.courtlistener.com/opinion/7335339/wis-province-jesus-v-cassem/
https://www.courtlistener.com/docket/64323826/united-states-v-application-for-order/

Docket number on opinion: No. 3:17-CV-01477 (VLB)
Docket number in the overwritten docket: 3:17-mj-01477

Both of them can be found on District Court, D. Connecticut PACER
The document_number_core generated for both is: 1701477

  • In this case in order to solve the problem, my question is, is it possible to include the letters in the document_number_core?
    So that we can have something like: 170cv1477 or 170mj1477

    There are more cases like these:

    District Court, S.D. Texas
    Docket number on docket: 4:18-mc-00463
    Docket number from opinion: No. 5:18-CR-463
    Docket number core: 1800463

    District Court, S.D. Texas
    Docket number on docket: 1:16-mc-01008
    Docket number from opinion: CASE NO. 4:16-CV-1008
    Docket number core: 1601008

    District Court, S.D. Texas
    Docket number on docket: 4:13-cr-00599
    Docket number from opinion: Civil Action No. 4:13-CV-599
    Docket number core: 1300599

    District Court, S.D. Texas
    Docket number on docket: 4:18-mc-00059
    Docket number from opinion: CIVIL ACTION NO. 5:18-CV-59
    Docket number core: 1800059

    There is this one related to an appellate docket:
    https://www.courtlistener.com/opinion/8695437/western-heritage-insurance-v-montana/
    https://www.courtlistener.com/docket/65960899/western-heritage-insurance-com-v-dennis-montana/

    This is not messed up, but it called my attention since the docket number in the opinion is:
    "Nos. 14-13542, 14-13657, 15-10967, 15-11166". (All of these numbers are valid dockets on PACER)
    Currently when an opinion is added the docket number is stored as is in the opinion, with multiple docket numbers.

    But when saving the docket, the docket_number_core is generated and since the current method doesn't ignore multiple docket numbers, it generates the docket_number_core based on the first match.

    So then the docket is found by other sources like the scraper or recap by the number core and updated.

  • So my question here is, when adding a docket from an opinion with multiple docket numbers, with the new approach we'll avoid generating a docket_number_core for it so that this docket will never be updated or have docket entries. Is that correct?

  • About the query and the process to fix messed opinions and dockets, I think your query is correct to find potential overwritten dockets, I have some ideas to complete the script and fix them but we would need to solve the issue related to similar core numbers.

  • And finally, I just want to know if appellate dockets should have a docket_number_core?

@mlissner
Copy link
Member

mlissner commented Jan 30, 2023

So my question here is, when adding a docket from an opinion with multiple docket numbers, with the new approach we'll avoid generating a docket_number_core for it so that this docket will never be updated or have docket entries. Is that correct?

We shouldn't have a docket_number_core for dockets that have more than one number in their docket number.

This is tricky:

Docket number on opinion: No. 3:17-CV-01477 (VLB)
Docket number in the overwritten docket: 3:17-mj-01477

You ask:

is it possible to include the letters in the document_number_core?

And the answer is, "Not really." I think we should keep letters out of the docket_number_core to be consistent with the IDB data. Would it be possible to introduce something that'd prevent these two from blocking, even if their docket_number_core values were the same?

[should] appellate dockets should have a docket_number_core?

We're not doing this now, right? If that's the case, let's not start.

@albertisfu
Copy link
Contributor Author

Thanks, some comments below:

And the answer is, "Not really." I think we should keep letters out of the docket_number_core to be consistent with the IDB data. Would it be possible to introduce something that'd prevent these two from blocking, even if their docket_number_core values were the same?

Got it, I was thinking of an alternative to solve this when docket_core_numbers are the same and avoid overwriting dockets, it might be: If a docket doesn't have a pacer_case_id instead of using the docket_number_core to look for the docket, we could use the docket_number after cleaning things like CIVIL ACTION NO 7:17-CV-00426 (if we decide to keep things like these) from the docket number so we can compare it to the actual docket_number.
do you see any problem with this approach?

We're not doing this now, right? If that's the case, let's not start.
Well, seems this is happening now, if you look for any docket from an appellate court, you can see it has a docket_number_core generated (e.g)

So, should we stop generating docket_core_numbers for appellate dockets? And I think we should set it to blank using an UPDATE?

@albertisfu
Copy link
Contributor Author

According to our talk, here is the example I found about an appellate docket added from an opinion and is messed up due to duplicated docket_number_core:

https://www.courtlistener.com/opinion/7020971/in-re-s-o-s-sheet-metal-co/
https://www.courtlistener.com/docket/64009291/united-states-v-oppedisano/

https://ia803106.us.archive.org/9/items/law.free.cap.f2d.297/32.226451.json

The opinion has the following docket_number: "Nos. 212-213, Dockets 27264, 27265"

We'll avoid generating a docket_number_core when there is more than one docket_number, however in this case since the following numbers 27264, 27265 won't match with any of our regexes, in this case, a docket_number_core would be returned for 212-213 : 12000213

However, seems that 212-213 is not a valid appellate docket_number, trying to find this docket in PACER it says the format must be: yy-nnnn or yy-nnnnn.

This format is pretty similar to the bankruptcy regex: (\d\d)-(\d+), that currently is the one matching appellate docket numbers, maybe we could continue using it and just constrain the regex to only match these formats (must always start with two digits) and we could continue generating docket_number_core for appellate dockets.

@mlissner
Copy link
Member

mlissner commented Feb 1, 2023

I'm sorry, Alberto, I didn't keep notes from our call. Can you lay out the plan of attack as you understand it? I feel disorganized here, but I get the sense that you don't.

@albertisfu
Copy link
Contributor Author

Yes, of course, the plan here is:

  • We tune regexes to generate the docket_number_core so we avoid generating it for weird numbers and also when there is more than one docket_number in opinions. This will solve the overwritten problem when a docket_number_core from a weird number matched with a good one.

  • The point above won't solve the problem when good docket_numbers generate the same docket_number_core and both of them belong to the same court like: 3:17-cv-01477 and 3:17-mj-01477, to solve this problem we talked about that when find_docket returns a match for a docket_number_core without a pacer_case_id we should do an additional check to confirm the docket is the same, so I'm working on an approach that uses the docket_number (after cleaning it) to check they match so we avoid overwritten the docket if it's not the same.

  • We were going to decide if we should keep docket_number_core for appellate dockets since currently we also use the docket_number_core to lookup for appellate dockets so is useful and seems to work well. The only worry about it that was if two or more appellate docket_number could generate the same docket_number_core, so I shared the example above I found that might be a problem, if this is a valid appellate docket_number 212-213 could generate the same core number for 12-213 (with the current regex), from Appellate PACER they say valid docket numbers are yy-nnnn or yy-nnnnn so I think we'll be good if just ignore things that don't match this format, and we continue generating core numbers for appellate, right?

@mlissner
Copy link
Member

mlissner commented Feb 1, 2023

I think we'll be good if just ignore things that don't match this format, and we continue generating core numbers for appellate, right?

Yes, that sounds right.

Everything else sounds exactly right too. Thank you.

@mlissner
Copy link
Member

Ok, #2511 is in the merge queue. Now we need to clean up mis-matched dockets and re-do some docket_number_core values, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants