Skip to content

Ingest Process

Terry Brady edited this page Jun 24, 2021 · 12 revisions

Initial upload

In Merritt's primary ingest workflow , a curator logs in through the Apache load balancer to the Merritt UI , is authenticated via LDAP, and submits a file, container, or manifest. The payload is staged initially in local storage attached the the UI server, which submits a job to the Ingest service.

Alternatively, a researcher may enter metadata into Dryad and either upload data files or enter a list of URLs. Dryad will assign the dataset a DOI via either DataCite or EZID, and construct either a ZIP container, or a manifest, as appropriate, using a role account to authenticate to the SWORD server via LDAP, and submitting the DOI as a local ID.

Note:

For a high-level overview of the various services, see the Architecture page. For more details on the messages sent between the services, see the Dataflow page.

Unpacking and verification

In either case, the payload is transmitted to the Ingest service (5) and staged in its local storage. If the payload is a container, the Ingest service will unpack it. If the payload is a manifest rather than a container or individual file, the Ingest service will read the manifest and download the files listed there. The Ingest service calculates SHA-256 digest values for each file in the payload and compares them with any supplied with the payload.

Local IDs

The Ingest service queries the Local ID service for any local IDs given to the payload. If the local ID is not found, Ingest requests a new ARK from EZID, and uses the Local ID service to map it to the local ID.

Storage and inventory

When the object is assembled, Ingest pushes a retrieval manifest to the Storage service. Based on the manifest, the Storage service retrieves the object files from the Ingest service and stages them in its own attached local storage. The files are characterized using Apache Tika, and a storage manifest is created. The content and storage manifest are then uploaded to the primary storage node and verified. Large files are split into 5 GB chunks, with each chunk stored as a separate bitstream.

After the content has been successfully stored, the Ingest service adds a job to the Inventory service queue, giving the URL for the storage manifest. The Inventory service then retrieves the storage manifest and uses it to populate the inventory database.

Replication

The Replication service reads from the inventory database, scanning for objects requiring replication, and assigning them as necessary a secondary storage node based on the collection profile. Content is downloaded from the primary storage node to the Replication server's local storage, and then uploaded to the secondary node.

Diagram

Merritt Ingest Component Diagram