-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop Extract for JSON format #4257
Comments
hkdctol
moved this from 📔 Product Backlog
to 📟 Sprint Backlog [7]
in data.gov team board
Mar 30, 2023
hkdctol
moved this from 📟 Sprint Backlog [7]
to 📔 Product Backlog
in data.gov team board
Mar 30, 2023
Replace current github action logic with poetry action setup |
jbrown-xentity
added
H2.0/Extract
H2.0/Harvest-General
General Harvesting 2.0 Issues
and removed
H2.0/Harvest-General
General Harvesting 2.0 Issues
labels
May 3, 2023
Waiting for discussion on error handling, functional vs OOP, env var formats, and docstrings. |
nickumia-reisys
changed the title
Develop Extract for DCAT-US JSON format
Develop Extract for JSON format
May 30, 2023
6 tasks
refactored approach
additions
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
User Story
In order to harvest DCAT-US catalog JSON files, data.gov admins want:
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
GIVEN a JSON URL is supplied to the python code (such as https://data.doi.gov/data.json)
WHEN the JSON is downloaded
THEN it will pass a validation utility to confirm the file can be opened and extracted
GIVEN the JSON is valid
THEN a new UUID "identifier" will be generated for the record
GIVEN we have a valid JSON record with an identifier
THEN we save that JSON record to S3 with the filename
identifier
with prefix as<feature>/<sourceId>/<jobId>/<recordId>
Background
Old CKAN process.
This is the first functional step of the harvesting pipeline. See the Harvest Job Lifecycle for pre- and post- steps. Also see the Controller Outline for the inputs and output data types and data structures. There will be future work related to extracting different types of file formats.
It is key not to confuse the file formats with the metadata schemas (i.e. JSON vs. DCAT-US).
We would like to be as abstract in this extract function definition so it can be easily extensible and scalable with future capabilities.
FYI: In order to create a repeatable test, you can create a local test server like we did for catalog. See test files here and mock server here and here.
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
identifier
with prefix as<feature>/<sourceId>/<jobId>/<recordId>
The text was updated successfully, but these errors were encountered: