Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export a repository object (node, media, files) #1096

Open
Natkeeran opened this issue Apr 18, 2019 · 11 comments
Open

Export a repository object (node, media, files) #1096

Natkeeran opened this issue Apr 18, 2019 · 11 comments
Labels
Type: use case proposes a new feature or function for the software using user-first language.

Comments

@Natkeeran
Copy link
Contributor

Natkeeran commented Apr 18, 2019

We need to be able to export a digital repository object fully for various uses cases including migration and preservation (AIP/Bags).

  • Ability to export Metadata (Repository Item) and bitstreams (media/files)
  • Option to include versions
  • Options to include derivatives
  • Options to pull in external/redirected bitstreams
  • Some logic to handle conceptual objects (i.e books, compounds) so that when we export a book, it exports all pages as well
    *(Advanced: Pull in the full graph of a Repository Item. I.e if it has uris to subject, pull in that uri!)

Additional Info:

We probably need a method to ingest the exported object as well.

@Natkeeran Natkeeran changed the title Export repository object (node, media, files) Export a repository object (node, media, files) Apr 18, 2019
@mjordan
Copy link
Contributor

mjordan commented Apr 18, 2019 via email

@rangel35
Copy link

What we do at UT-Austin in I-7 is use the bagging feature so our users can request bags for preservation purposes and offsite vaulting. we use both bagging via the interface for bags under 2g and for bags over 2g they get queued for drush processing and bagged overnight.

We provide the ability to bag ALL datatstreams and metadata of the object and for paged content it will also bag the "pages" and their datastreams and metadata

Our users have also requested the ability to bag selected datastreams

@Natkeeran
Copy link
Contributor Author

UTSC has a similar use case and workflow as noted by @rangel35. A mods flag indicates which objects can be bagged. A report is generated with pids. The objects are exported via command line using drush.

(We have considered adding a premis event on bag creation. As it seemed to complicate the workflow, we did not implement that).

In islandora 7.x we bag the full atom zip (including versions) with archive context. In one of the storage locations, we aim to do validation of bags as well.

In 7.x, we run into problems exporting large objects or collections consistently, thus command line seems to work best.

Having the option to bag from UI and Islandora API is nice to have as we don't have a way to download the whole object right now.

Also, it would be ideal to have an option to ingest from a bag or another export format.

@mjordan
Copy link
Contributor

mjordan commented Apr 25, 2019

Some preliminary thoughts on a Bagging microservice:

  1. Take something like Islandora Bagger and put a REST interface on top of it.
  2. From withing Islandora 8, a user chooses to Bag an object via the GUI, which POSTs a message to the microservice containing the node ID, which then creates the Bag like Islandora Bagger does now by fetching the various files, metadata, etc from Islandora via Islandora's REST interface. The module running in Drupal doesn't do any bagging, it just sends the request to create a Bag (and maybe exposes the results of the Bagging process back to the user, see next point).
  3. On successful creation of the Bag, the microservice sends an email to the user containing the URL of the Bag to download (or some indication of where the bag can be found); or alternatively, the new Bag's URL is provided via the microservice's REST interface so it can show up in a Drupal View, etc.
  4. The microservice would retain its command-line UI so it can be incorporated into automation scripts, etc.

Having a microservice separate from Drupal do the bagging would allow the jobs to run as long as they needed to, eliminating the risk of timing out in front of the user because the bagging is done asyncronously. We'd need to figure out how to allow for different Bag options, but those could possibly be sent as the REST POST request's body or something.

@Natkeeran with regard to ingest from a Bag, that is something that users have been asking for for a while. But, with Islandora 8's nice REST interfaces, we can probably figure out how to map the contents of a Bagged object back to the originating components of the node+media fairly easily and push it into Islandora using something like https://github.com/mjordan/claw_rest_ingester. I think using URIs to define what taxonomy terms should be assigned to the reingested object would be useful here as well.

@dannylamb
Copy link
Contributor

@mjordan @Natkeeran I would love to see bags (or zipped bags, really), be the new zip importer format. I don't know how possible that is given how widely bags can vary, but it makes sense to move away from a bespoke format to a more widely adopted one.

@Natkeeran
Copy link
Contributor Author

Natkeeran commented Apr 26, 2019

@mjordan @dannylamb

The feature set for microservice looks good. We can extended it later in the Drupal side to have a flag and queue/cron mechanism.

Ingest would be a neat addition, with use cases such as restore from backup, migration and batch ingest from zip. Having ingest from zip can theoretically be seen as bootstrapping Drupal from Fedora as well.

Some points to consider:

  • Exporting and importing the full graph is the major challenge. For example, a person has relationships to other people. I don't know enough graph theory to determine how to find the full graph, and how to avoid circular loops!

  • The second related challenge is persistent identifiers. What is our PID? If Drupal nid or taxonomy id the pid, then does Drupal allow us setting a PID. Do we want to support use case where people install Islandora 8 in an existing instance of Drupal! Do we have a persistent ID in Fedora?

  • Does Fedora or Drupal representation provide the logical representation of the full repository object similar to FOXML in 7.x? Maybe, via Portland Common Data Model? Though this adds a level of complexity, do we need such a representation (i.e METS) for preservation (OAIS AIP compliance) purposes?

  • We should be clear about how we are handling conceptual entities (i.e books, compound objects).

@mjordan
Copy link
Contributor

mjordan commented Apr 26, 2019

@Natkeeran yes, those are all significant issues, but I see them as out of scope for the Bagit functionality. They are more data modeling issues, aren't they?

@dannylamb couldn't agree more. Even if an institution hasn't adopted Bagit widely, the tooling is decent and it is always easier to convert from a standard format than from a bespoke one, especially from a long-term preservation perspective (e.g. the platform tied to the bespoke format hasn't been in use in 20 years....).

@rangel35
Copy link

I-8 creates a UUID couldn't we use that as the PID? or are you thinking more along the standard namespace type PID?

@mjordan
Copy link
Contributor

mjordan commented Apr 30, 2019

In order for the creation of Bags to be truly decoupled from the Drupal module POSTing the request, we either need to issue the request using an asynchronous Guzzle call or using an asynchronous Javascript request, or do something on the microservice side that collects node IDs in a file and then runs as a batched cron job.

One advantage of the batch approach is that since the bagger would be running in a CLI environment, it wouldn't time out like it would if the bags were generated within an HTTP response.

@mjordan
Copy link
Contributor

mjordan commented May 6, 2019

Did some work on Islandora Bagger over the weekend. It now has a REST API that lets you add a node ID and settings file to a queue. It also has a simple FIFO queue manager, and a console command to process the queue. The original CLI create_bag_ command still works as it used to.

The README explains how it works: as PUT requests like this come in:

curl -v -X POST -H "Islandora-Node-ID: 4" --data-binary "@sample_config.yml" http://127.0.0.1:8001/api/createbag

each request's node IDs is added to the queue, along with the path to the settings YAML file (which is the body of the request). In a cronjob, you would run the following to process the queue:

./bin/console app:islandora_bagger:process_queue --queue var/islandora_bagger.queue

which loops through the queue and runs the create_bag CLI command (it does this using internal Symfony methods):

./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112

@rosiel
Copy link
Member

rosiel commented May 9, 2020

The Robertson Library's RDM project uses Mark Jordan's Islandora Bagger and integration module.

We have a BagIt ansible role which installs our fork of islandora_bagger and of islandora_bagger_integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: use case proposes a new feature or function for the software using user-first language.
Projects
Development

No branches or pull requests

6 participants