-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate crosswalk and import cap pdf #4442
Generate crosswalk and import cap pdf #4442
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Didn't approve it because I think John proposed some changes, but this looks good to me 👍
@flooie I just put this on your backlog to prioritize. The background is that we want to have the Harvard PDFs imported into CL and we want to do so regularly. Grab me when you have a sec, and I can give the details before you review or assign to somebody else to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it is necessary to add typing to the functions in addition to updating the docstrings of the functions so that they are in accordance with the format used in courtlistener.
…mst/courtlistener into generate-crosswalk-and-import-cap-pdf
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good, you've addressed the comments I left, one function is still missing the docstrings, and a comment I left regarding the find_matching_case() function is still pending.
After solving what I mentioned before, I think it will be ready.
f"Single match found: {matched_case.case_name} (CL ID: {matched_case.id})" | ||
) | ||
return matched_case | ||
elif match_count > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if there were several cases on the same page (with the same citation)? Wouldn't the PDF always be added to the same cluster (the first one that matches the citation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I suppose that would be the result, but I was under the impression that when this occurs the citation would be differentiated by page number as "100 A.2d 36a" or "100 A.2d 36b". Is this not the case?
If they were the exact same citation starting on the same page then there would be multiple matches and a logged warning with the citations.
If this is a more common occurrence than I thought then I'll need to find a way to be more specific, how would CL differentiate between the two in this case, by docket number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in this comment bellow: #4442 (comment)
…mst/courtlistener into generate-crosswalk-and-import-cap-pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks very good, using the cap id greatly simplified the match, I have no more comments
for more information, see https://pre-commit.ci
I think we're just waiting for @flooie's final review here, but he's out sick today. Hopefully we'll get this merged shortly. Sorry for the delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me- Thanks for taking the time to rewrite the matching code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move cap env variables to misc.py
…-pdf' into generate-crosswalk-and-import-cap-pdf
@mlissner Thanks for your suggestions. The failing test passed on a re-run and appears to be related to an error in the meeting I missed. |
Woot! We're merged. @quevon24, do you want to file an issue to get this deployed and work with Ramiro on landing the changes? |
Yes, I'll get to it, I think we would only need the values of the environment variables necessary for the command, or are they already set? |
They're not, I don't think. I'll get them to Ramiro. |
Harvard's Caselaw Access Project has been sunset. For projects which have existing references to CAP cases, there's a need to identify a CAP case's corresponding CL opinion cluster. An indexed `harvard_id` column is added to `OpinionCluster`. The field is also added to the `fields` of `OpinionClusterFilter`. For migration, this patch builds on work done in freelawproject#4284 and freelawproject#4442 and extends `import_harvard_pdfs` to populate the `harvard_id` column using CAP crosswalk file. Fixes: freelawproject#4313
Harvard's Caselaw Access Project has been sunset. For projects which have existing references to CAP cases, there's a need to identify a CAP case's corresponding CL opinion cluster. An indexed `harvard_id` column is added to `OpinionCluster`. The field is also added to the `fields` of `OpinionClusterFilter`. For migration, this patch builds on work done in freelawproject#4284 and freelawproject#4442 and extends `import_harvard_pdfs` to populate the `harvard_id` column using CAP crosswalk file. Fixes: freelawproject#4313
Import Harvard Case Law Access Project (CAP) PDFs to CourtListener
Issues Addressed:
Changes Implemented:
Testing Instructions:
Prerequisites:
Steps:
Generate the crosswalk:
This command will create crosswalk files in
cl/search/crosswalks/
.Import CAP PDFs:
This command will use the generated crosswalk to fetch and store PDFs.
To test with specific parameters:
Verification:
cl/search/crosswalks/
directory for generated crosswalk files.Screenshots:
CAP Crosswalk File (Sample Data):
Imported PDFs (Local Storage):