Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate crosswalk and import cap pdf #4442

Merged

Conversation

jtmst
Copy link
Collaborator

@jtmst jtmst commented Sep 11, 2024

Import Harvard Case Law Access Project (CAP) PDFs to CourtListener

Issues Addressed:

Changes Implemented:

  1. Created a management command to generate crosswalk files between CAP and CL data.
  2. Developed a management command to import CAP PDFs to CL using the generated crosswalk.
  3. Implemented S3 storage integration for PDF storage (configurable for local storage in development).
  4. Added error handling and logging for better debugging and monitoring.
  5. Added settings for CAP R2 env variables
  6. Added tests for both commands

Testing Instructions:

Prerequisites:

  • Ensure you have the necessary environment variables set for R2 and S3 access.
  • For local testing, configure the storage to use local file system instead of S3.

Steps:

  1. Generate the crosswalk:

     docker exec -it cl-django python /opt/courtlistener/manage.py generate_cap_crosswalk
    

    This command will create crosswalk files in cl/search/crosswalks/.

  2. Import CAP PDFs:

     docker exec -it cl-django python /opt/courtlistener/manage.py import_harvard_pdfs
    

    This command will use the generated crosswalk to fetch and store PDFs.

  3. To test with specific parameters:

     docker exec -it cl-django python /opt/courtlistener/manage.py generate_cap_crosswalk --reporter "A.2d" --volume 100
    
    

Verification:

  • Check the cl/search/crosswalks/ directory for generated crosswalk files.
  • Verify that PDFs are stored either in S3 or local storage (based on configuration).
  • Examine the logs for any errors or warnings during the process.

Screenshots:

CAP Crosswalk File (Sample Data):

Crosswalk File Sample

Imported PDFs (Local Storage):

Imported PDFs

Copy link

@FRodriguez18 FRodriguez18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Didn't approve it because I think John proposed some changes, but this looks good to me 👍

cl/search/tests/test_generate_cap_crosswalk.py Outdated Show resolved Hide resolved
@jtmst jtmst requested review from FRodriguez18 and mlissner and removed request for FRodriguez18 September 16, 2024 16:34
@jtmst jtmst marked this pull request as ready for review September 16, 2024 16:35
@mlissner
Copy link
Member

@flooie I just put this on your backlog to prioritize. The background is that we want to have the Harvard PDFs imported into CL and we want to do so regularly. Grab me when you have a sec, and I can give the details before you review or assign to somebody else to review.

Copy link
Member

@quevon24 quevon24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is necessary to add typing to the functions in addition to updating the docstrings of the functions so that they are in accordance with the format used in courtlistener.

cl/search/management/commands/generate_cap_crosswalk.py Outdated Show resolved Hide resolved
cl/search/management/commands/generate_cap_crosswalk.py Outdated Show resolved Hide resolved
cl/search/management/commands/import_harvard_pdfs.py Outdated Show resolved Hide resolved
cl/search/tests/test_import_harvard_pdfs.py Outdated Show resolved Hide resolved
@jtmst jtmst requested a review from quevon24 September 23, 2024 20:39
Copy link
Member

@quevon24 quevon24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, you've addressed the comments I left, one function is still missing the docstrings, and a comment I left regarding the find_matching_case() function is still pending.

After solving what I mentioned before, I think it will be ready.

f"Single match found: {matched_case.case_name} (CL ID: {matched_case.id})"
)
return matched_case
elif match_count > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if there were several cases on the same page (with the same citation)? Wouldn't the PDF always be added to the same cluster (the first one that matches the citation)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I suppose that would be the result, but I was under the impression that when this occurs the citation would be differentiated by page number as "100 A.2d 36a" or "100 A.2d 36b". Is this not the case?

If they were the exact same citation starting on the same page then there would be multiple matches and a logged warning with the citations.

If this is a more common occurrence than I thought then I'll need to find a way to be more specific, how would CL differentiate between the two in this case, by docket number?

Copy link
Collaborator Author

@jtmst jtmst Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in this comment bellow: #4442 (comment)

cl/search/management/commands/generate_cap_crosswalk.py Outdated Show resolved Hide resolved
Copy link
Member

@quevon24 quevon24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks very good, using the cap id greatly simplified the match, I have no more comments

@mlissner
Copy link
Member

mlissner commented Oct 1, 2024

I think we're just waiting for @flooie's final review here, but he's out sick today. Hopefully we'll get this merged shortly. Sorry for the delay.

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me- Thanks for taking the time to rewrite the matching code.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in a bit late. I came in here to see if there were any DB migrations before merging, but I noticed a couple of last minute things I think we should tweak.

@flooie or @quevon24, do you think we could make these changes so Josh can consider his part of this done?

cl/search/crosswalks/example_crosswalk.json Outdated Show resolved Hide resolved
cl/search/management/commands/generate_cap_crosswalk.py Outdated Show resolved Hide resolved
cl/settings/third_party/r2.py Outdated Show resolved Hide resolved
@flooie
Copy link
Contributor

flooie commented Oct 3, 2024

@mlissner Thanks for your suggestions. The failing test passed on a re-run and appears to be related to an error in the meeting I missed.

@mlissner mlissner merged commit ea77adb into freelawproject:main Oct 3, 2024
9 checks passed
@mlissner
Copy link
Member

mlissner commented Oct 3, 2024

Woot! We're merged. @quevon24, do you want to file an issue to get this deployed and work with Ramiro on landing the changes?

@quevon24
Copy link
Member

quevon24 commented Oct 3, 2024

get

Woot! We're merged. @quevon24, do you want to file an issue to get this deployed and work with Ramiro on landing the changes?

Yes, I'll get to it, I think we would only need the values ​​of the environment variables necessary for the command, or are they already set?

@mlissner
Copy link
Member

mlissner commented Oct 3, 2024

They're not, I don't think. I'll get them to Ramiro.

cweider added a commit to cweider/courtlistener that referenced this pull request Oct 25, 2024
Harvard's Caselaw Access Project has been sunset. For projects
which have existing references to CAP cases, there's a need to
identify a CAP case's corresponding CL opinion cluster.

An indexed `harvard_id` column is added to `OpinionCluster`. The
field is also added to the `fields` of `OpinionClusterFilter`.

For migration, this patch builds on work done in freelawproject#4284 and freelawproject#4442
and extends `import_harvard_pdfs` to populate the `harvard_id`
column using CAP crosswalk file.

Fixes: freelawproject#4313
cweider added a commit to cweider/courtlistener that referenced this pull request Oct 26, 2024
Harvard's Caselaw Access Project has been sunset. For projects
which have existing references to CAP cases, there's a need to
identify a CAP case's corresponding CL opinion cluster.

An indexed `harvard_id` column is added to `OpinionCluster`. The
field is also added to the `fields` of `OpinionClusterFilter`.

For migration, this patch builds on work done in freelawproject#4284 and freelawproject#4442
and extends `import_harvard_pdfs` to populate the `harvard_id`
column using CAP crosswalk file.

Fixes: freelawproject#4313
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants