🔥
|
Neither CollectionSpace (the organization) nor LYRASIS offers support for This application is created and maintained for internal LYRASIS staff use only. Many of its design decisions are based on:
This means:
However, we have made this code available in the spirit of open-source and transparency, in case any of it might be informative for CS institutions/users who wish to build their own tooling for working with CS data at scale. |
See doc/decisions.adoc for more info/background on some of the decisions made
-
Ruby 3.1.0.
-
Docker and docker-compose.
-
Running the required docker-compose command will by default set up two instances of Redis on ports 6380 and 6381. The port numbers used are configurable as described below
-
-
For each CS instance this tool will be used to manipulate:
-
The user of this tool must be able to connect directly to the CollectionSpace database and perform read-only/select queries
-
There must be an AWS S3 bucket set up for ingesting into the instance, and the user of this tool must have permission to add and delete objects to that bucket, as well as list objects in the bucket.
-
There is an AWS Lambda set up to perform a Services API call for each object added to the S3 bucket
-
To avoid having to prepend thor
, rspec
and other commands with bundle exec
, add this repo’s ./bin
to your PATH.
If you do not do this step, and running a command you see in the documentation fails, try prepending `bundle exec ` to the command.
This should "just work" without you having to do anything, but you might want to change it if the ports being used for Redis conflict with something you use for other work.
If you want to change the Redis ports, you need to update them in two places:
-
the
./docker-compose.yml
file (which builds the Redis instances and makes them locally accessible via the given ports) -
the
./redis.yml
file (which tells the application which port/Redis instance to use for storing RefNames vs. CSIDs)
Nothing in ./redis.yml
is sensitive, as it’s all just on your local machine.
You will need to create an AWS profile that this tool will use to interact with S3 Fast Import Buckets (and eventually CloudWatchLogs) and S3 buckets set up to hold client media files for ingest.
-
You will need personal user credentials to the CollectionSpace AWS account. Mark (or whoever admins that account) will need to set that up and ensure your user has the "MigrationsRole"
-
If you do not already have an
~/.aws/credentials
file, create that file. -
In that file, add the following:
[csdefault] aws_access_key_id = {your user key id} aws_secret_access_key = {your user secret access key} [collectionspacemigrationtools] role_arn = arn:aws:iam::163800758780:role/MigrationsRole source_profile = csdefault region = us-west-2 [cs-media] role_arn = arn:aws:iam::005622933699:role/MigrationsRole source_profile = csdefault region = us-west-2
Your personal user credentials go in the [csdefault]
entry. The [collectionspacemigrationtools]
entry uses those credentials to access the account where the fast import buckets are created. The [cs-media]
uses those credentials to access the account where client media buckets are created.
You can change the profile names in brackets as you wish. Just note that the first profile name is the value of source_profile
in the second profile.
Copy ./sample_system_config.yml
to ./system_config.yml
You may wish to tweak the settings in system_config.yml
. These include:
- csv_chunk_size
-
The use/purpose of reading CSVs in chunks is explained in Faster Parsing CSV With Parallel Processing. Each chunk is sent to a parallel worker for processing. A chunk with more rows will take longer to process, but I have not investigated the tradeoff between queueing up/passing on more chunks vs. larger chunks.
- max_threads
-
Maximum number of threads that will be spun up for a given parallel process run in threads.
- max_processes
-
Maximum number of processes that will be spun up for a given parallel process run in processes.
- aws_profile
-
The name of the second AWS profile you created (the one with
role_arn
andsource_profile
specified.
ℹ️
|
The default settings seem to be working ok for not-gigantic migration projects on my DTS-issued Macbook Pro, but I have not yet done much testing to figure out optimal settings for these. I assume if things are running super slowly, try upping max_threads/max_processes. If your system is too strained, lower max_threads/max_processes. I confess I’m not entirely sure if it thread vs process makes a difference in terms of system resource usage, but it seemed like a good idea to separate them in case this mattered. |
💡
|
You can find what uses threads vs. processes by searching this codebase for CMT.config.system.max_threads and CMT.config.system.max_processes .
|
You will need to have:
-
admin-level credentials to the CS instance
-
login info for connecting to the database for the CS instance
-
the name of the S3 Fast Import bucket for the CS instance (currently we need to request that Mark set this bucket up)
See client config management documentation for more details.
Once you have done the one-time config and set up at least one instance, you can verify that your AWS access works by doing the following in this repo’s base directory:
bin/console
CMT::Build::S3Client.call
If you get Success(#<Aws::S3::Client>)
, good. If you get a Failure
, something is not right.
Ensure desired config is in place (See One-time setup and Per-instance setup sections above)
cd
into repository root
docker-compose up -d
(Starts Redis instances. The -d
puts docker-compose into the background, so you can use the terminal for other things)
thor list
(to see available commands)
Run available commands as necessary.
❗
|
Most of the commands for routine workflow usage are under thor batch and thor batches . See workflow overview documentation for details.
|
docker-compose down
(Stops and closes Redis containers. The Redis volumes are NOT removed, so your cached data should still be available next time you run docker-compose up -d
.)
You can also use the IRB console for direct access to all objects:
bin/console
💡
|
If you make changes to code while you are in the console, running CMT.reload! will reload the application without you needing to exit and restart console. This doesn’t always work to pick up all changes, but saves a lot of time anyway.
|
To test, run:
rspec
At least initially, a lot of the functionality around database connections, querying, and anything that relies on a database call is not covered in automated tests. This is mainly because I did not have time to figure out how to test that stuff in a meaningful way without exposing data that needs to be kept private.
-
Built by Kristina Spurgin with design/infrastructure input from Mark Cooper
-
Project scaffold built with Rubysmith.