Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Extract for JSON format #4257

Closed
3 of 8 tasks
jbrown-xentity opened this issue Mar 23, 2023 · 3 comments
Closed
3 of 8 tasks

Develop Extract for JSON format #4257

jbrown-xentity opened this issue Mar 23, 2023 · 3 comments
Assignees

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Mar 23, 2023

User Story

In order to harvest DCAT-US catalog JSON files, data.gov admins want:

  • an extract utility to load and validate a root JSON file
  • a function to then parse that file into individual dataset records identified by a UUID

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a JSON URL is supplied to the python code (such as https://data.doi.gov/data.json)
    WHEN the JSON is downloaded
    THEN it will pass a validation utility to confirm the file can be opened and extracted

  • GIVEN the JSON is valid
    THEN a new UUID "identifier" will be generated for the record

  • GIVEN we have a valid JSON record with an identifier
    THEN we save that JSON record to S3 with the filename identifier with prefix as <feature>/<sourceId>/<jobId>/<recordId>

Background

Old CKAN process.

This is the first functional step of the harvesting pipeline. See the Harvest Job Lifecycle for pre- and post- steps. Also see the Controller Outline for the inputs and output data types and data structures. There will be future work related to extracting different types of file formats.

It is key not to confuse the file formats with the metadata schemas (i.e. JSON vs. DCAT-US).

We would like to be as abstract in this extract function definition so it can be easily extensible and scalable with future capabilities.

FYI: In order to create a repeatable test, you can create a local test server like we did for catalog. See test files here and mock server here and here.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • Confirm file at URL is valid JSON format
  • Parse root JSON into individual records
    • This should be done outside the root JSON validation
  • Generate a new UUID for each potential record to store it in S3 and track it within the pipeline
  • Save the new record to S3 with the filename identifier with prefix as <feature>/<sourceId>/<jobId>/<recordId>
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Mar 30, 2023
@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Mar 30, 2023
@hkdctol hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Mar 30, 2023
@hkdctol hkdctol added the H2.0/Harvest-General General Harvesting 2.0 Issues label Apr 11, 2023
@rshewitt rshewitt self-assigned this Apr 16, 2023
@rshewitt rshewitt moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Apr 16, 2023
@jbrown-xentity
Copy link
Contributor Author

@jbrown-xentity jbrown-xentity added H2.0/Extract H2.0/Harvest-General General Harvesting 2.0 Issues and removed H2.0/Harvest-General General Harvesting 2.0 Issues labels May 3, 2023
@rshewitt
Copy link
Contributor

Waiting for discussion on error handling, functional vs OOP, env var formats, and docstrings.

@rshewitt rshewitt moved this from 🏗 In Progress [8] to 📡 Blocked in data.gov team board May 15, 2023
@hkdctol hkdctol moved this from 📡 Blocked to 🏗 In Progress [8] in data.gov team board May 19, 2023
@nickumia-reisys nickumia-reisys changed the title Develop Extract for DCAT-US JSON format Develop Extract for JSON format May 30, 2023
@rshewitt
Copy link
Contributor

refactored approach

  • download file ( xml, json, waf-html ) THIS WILL ALWAYS BE THE CATALOG
  • schema_specific check ( e.g. is_dcatus ( json ) or iso ( xml ) )
  • parse catalog ( this is schema specific )
  • upload to s3

additions

  • add tests which address exceptions
    • 404, 403, .json() doesn't work

@rshewitt rshewitt moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jun 5, 2023
@hkdctol hkdctol closed this as completed Jun 8, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jun 8, 2023
@btylerburton btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants