This script is designed to run on a server with access to an ArchivesSpace installation. It runs a series of checks in the ArchivesSpace database, accessing data through the API, and exporting and evaluating EAD.xml files for content and syntax errors. The script then generates an Excel spreadsheet detailing where there are any areas for data cleanup. For more information about what data is checked, see Workflow.
- lxml - Used to parse XML files for evaluating any XML syntax errors and parsing data from downloaded XML files
- mysql - Used to import mysql-connector
- mysql-connector-python - Used to connect and detect any connection errors to the ArchivesSpace MySQL database
- openpyxl - Used to create and write an Excel spreadsheet to document data audit report
- requests - Used to check URLs and get their status codes
- Download the repostiory via cloning to your local IDE or using GitHub's Code button and Download as ZIP
- Run
pip install requirements.txt
- Create a secrets.py file with the following information:
- An ArchivesSpace admin username (as_un = ""), password (as_pw = "")
- The URLs to your ArchivesSpace staging (as_api_stag = "") and production (as_api = "") API instances
- The ArchivesSpace data_auditor account username (as_auditor_un = "") and password (as_auditor_pw = "")
- Variables with their values set to user emails you want to send the report to
- sendfrom_email = "<send_from_email>"
- sendto_emails = ["<send_to_email>", "<send_to_email>", "<send_to_email>"]
- senderror_emails = ["<send_to_email>", "<send_to_email>"]
- The email server from which you send your email report (email_server = "")
- Your ArchivesSpace's staging database credentials, including username (as_dbstag_un = ""), password (as_dbstag_pw = ""), hostname (as_dbstag_host = ""), database name (as_dbstag_database = ""), and port (as_dbstag_port = "")
- Run the script as
python3 ASpace_Data_Audit.py
Open the console of your choice and navigate to the project directory. Type python3 ASpace_Data_Audit.py
to run the
script. If you want to run the audit without emailing users of the result, add -t or --test, so
python3 ASpace_Data_Audit.py -t
. The testing functionality is still being developed and may not function properly.
There are a series of unittests that check various functions in ASpace_Data_Audit.py. They are still being developed and
any test should be run with the -t
or --test
argument as listed in # Script Arguments
- Generate an Excel spreadsheet to use for our report
- Begin running the audit. The audit checks for the following:
- Any new controlled vocabulary terms for the following and highlights the row in red:
- Subject_Term_Type
- Subject_Sources
- Finding_Aid_Status_Terms
- Name_Sources
- Instance_Types
- Extent_Types
- Digital_Object_Types
- Container_Types
- Accession_Resource_Types
- Any archival objects with component unique identifiers
- Any top containers without barcodes
- Any top containers without indicators
- A list of all current users
- Any archival objects with multiple top containers
- Any archival objects with multiple digital objects
- Any archival objects listed as level of description == collection
- Any resources with EAD IDs
- Any duplicate subjects
- Any duplicate agent-persons
- Any resources without Creator agents
- Any XML syntax errors in exported EAD.xml files
- Any broken URLs in EAD.xml exports
- Any top containers not linked to any resources or archival objects
- Any archival objects with "otherlevel" and "unspecified" level of description
- Any new controlled vocabulary terms for the following and highlights the row in red:
- Save the spreadsheet and send an email using email_users(). If an error is generated, send a message to specified user
- Delete the spreadsheet and exported EAD.xml folder and files from the server - email if there is an error
- Corey Schmidt - Project Management Librarian/Archivist at the University of Georgia Libraries
- Kevin Cottrell - GALILEO/Library Infrastructure Systems Architect at the University of Georgia Libraries
- ArchivesSpace Community