Develop Extract for JSON format #4257

jbrown-xentity · 2023-03-23T21:23:01Z

User Story

In order to harvest DCAT-US catalog JSON files, data.gov admins want:

an extract utility to load and validate a root JSON file
a function to then parse that file into individual dataset records identified by a UUID

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN a JSON URL is supplied to the python code (such as https://data.doi.gov/data.json)
WHEN the JSON is downloaded
THEN it will pass a validation utility to confirm the file can be opened and extracted
GIVEN the JSON is valid
THEN a new UUID "identifier" will be generated for the record
GIVEN we have a valid JSON record with an identifier
THEN we save that JSON record to S3 with the filename identifier with prefix as <feature>/<sourceId>/<jobId>/<recordId>

Background

Old CKAN process.

This is the first functional step of the harvesting pipeline. See the Harvest Job Lifecycle for pre- and post- steps. Also see the Controller Outline for the inputs and output data types and data structures. There will be future work related to extracting different types of file formats.

It is key not to confuse the file formats with the metadata schemas (i.e. JSON vs. DCAT-US).

We would like to be as abstract in this extract function definition so it can be easily extensible and scalable with future capabilities.

FYI: In order to create a repeatable test, you can create a local test server like we did for catalog. See test files here and mock server here and here.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

Confirm file at URL is valid JSON format
Parse root JSON into individual records
- This should be done outside the root JSON validation
Generate a new UUID for each potential record to store it in S3 and track it within the pipeline
Save the new record to S3 with the filename identifier with prefix as <feature>/<sourceId>/<jobId>/<recordId>

The text was updated successfully, but these errors were encountered:

jbrown-xentity · 2023-05-02T18:56:45Z

Replace current github action logic with poetry action setup

rshewitt · 2023-05-15T17:22:42Z

Waiting for discussion on error handling, functional vs OOP, env var formats, and docstrings.

rshewitt · 2023-05-31T16:54:17Z

refactored approach

download file ( xml, json, waf-html ) THIS WILL ALWAYS BE THE CATALOG
schema_specific check ( e.g. is_dcatus ( json ) or iso ( xml ) )
parse catalog ( this is schema specific )
upload to s3

additions

add tests which address exceptions
- 404, 403, .json() doesn't work

jbrown-xentity added this to data.gov team board Mar 23, 2023

hkdctol moved this to 📔 Product Backlog in data.gov team board Mar 30, 2023

hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Mar 30, 2023

hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Mar 30, 2023

hkdctol added the H2.0/Harvest-General General Harvesting 2.0 Issues label Apr 11, 2023

rshewitt self-assigned this Apr 16, 2023

rshewitt moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Apr 16, 2023

jbrown-xentity added H2.0/Extract H2.0/Harvest-General General Harvesting 2.0 Issues and removed H2.0/Harvest-General General Harvesting 2.0 Issues labels May 3, 2023

rshewitt moved this from 🏗 In Progress [8] to 📡 Blocked in data.gov team board May 15, 2023

hkdctol moved this from 📡 Blocked to 🏗 In Progress [8] in data.gov team board May 19, 2023

nickumia-reisys changed the title ~~Develop Extract for DCAT-US JSON format~~ Develop Extract for JSON format May 30, 2023

rshewitt mentioned this issue May 31, 2023

Configure S3 to store Harvester Records #4335

Open

6 tasks

rshewitt moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jun 5, 2023

hkdctol closed this as completed Jun 8, 2023

hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jun 8, 2023

btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop Extract for JSON format #4257

Develop Extract for JSON format #4257

jbrown-xentity commented Mar 23, 2023 •

edited by rshewitt

Loading

jbrown-xentity commented May 2, 2023

rshewitt commented May 15, 2023

rshewitt commented May 31, 2023

Develop Extract for JSON format #4257

Develop Extract for JSON format #4257

Comments

jbrown-xentity commented Mar 23, 2023 • edited by rshewitt Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

jbrown-xentity commented May 2, 2023

rshewitt commented May 15, 2023

rshewitt commented May 31, 2023

refactored approach

additions

jbrown-xentity commented Mar 23, 2023 •

edited by rshewitt

Loading