Skip to content

Runs a series of checks against ArchivesSpace data for content cleanup

License

Notifications You must be signed in to change notification settings

uga-libraries/aspace_data_audit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArchivesSpace Data Auditor

Overview

This script is designed to run on a server with access to an ArchivesSpace installation. It runs a series of checks in the ArchivesSpace database, accessing data through the API, and exporting and evaluating EAD.xml files for content and syntax errors. The script then generates an Excel spreadsheet detailing where there are any areas for data cleanup. For more information about what data is checked, see Workflow.

Getting Started

Dependencies

  • lxml - Used to parse XML files for evaluating any XML syntax errors and parsing data from downloaded XML files
  • mysql - Used to import mysql-connector
  • mysql-connector-python - Used to connect and detect any connection errors to the ArchivesSpace MySQL database
  • openpyxl - Used to create and write an Excel spreadsheet to document data audit report
  • requests - Used to check URLs and get their status codes

Installation

  1. Download the repostiory via cloning to your local IDE or using GitHub's Code button and Download as ZIP
  2. Run pip install requirements.txt
  3. Create a secrets.py file with the following information:
    1. An ArchivesSpace admin username (as_un = ""), password (as_pw = "")
    2. The URLs to your ArchivesSpace staging (as_api_stag = "") and production (as_api = "") API instances
    3. The ArchivesSpace data_auditor account username (as_auditor_un = "") and password (as_auditor_pw = "")
    4. Variables with their values set to user emails you want to send the report to
      1. sendfrom_email = "<send_from_email>"
      2. sendto_emails = ["<send_to_email>", "<send_to_email>", "<send_to_email>"]
      3. senderror_emails = ["<send_to_email>", "<send_to_email>"]
    5. The email server from which you send your email report (email_server = "")
    6. Your ArchivesSpace's staging database credentials, including username (as_dbstag_un = ""), password (as_dbstag_pw = ""), hostname (as_dbstag_host = ""), database name (as_dbstag_database = ""), and port (as_dbstag_port = "")
  4. Run the script as python3 ASpace_Data_Audit.py

Script Arguments

Open the console of your choice and navigate to the project directory. Type python3 ASpace_Data_Audit.py to run the script. If you want to run the audit without emailing users of the result, add -t or --test, so python3 ASpace_Data_Audit.py -t. The testing functionality is still being developed and may not function properly.

Testing

There are a series of unittests that check various functions in ASpace_Data_Audit.py. They are still being developed and any test should be run with the -t or --test argument as listed in # Script Arguments

Workflow

  1. Generate an Excel spreadsheet to use for our report
  2. Begin running the audit. The audit checks for the following:
    1. Any new controlled vocabulary terms for the following and highlights the row in red:
      1. Subject_Term_Type
      2. Subject_Sources
      3. Finding_Aid_Status_Terms
      4. Name_Sources
      5. Instance_Types
      6. Extent_Types
      7. Digital_Object_Types
      8. Container_Types
      9. Accession_Resource_Types
    2. Any archival objects with component unique identifiers
    3. Any top containers without barcodes
    4. Any top containers without indicators
    5. A list of all current users
    6. Any archival objects with multiple top containers
    7. Any archival objects with multiple digital objects
    8. Any archival objects listed as level of description == collection
    9. Any resources with EAD IDs
    10. Any duplicate subjects
    11. Any duplicate agent-persons
    12. Any resources without Creator agents
    13. Any XML syntax errors in exported EAD.xml files
    14. Any broken URLs in EAD.xml exports
    15. Any top containers not linked to any resources or archival objects
    16. Any archival objects with "otherlevel" and "unspecified" level of description
  3. Save the spreadsheet and send an email using email_users(). If an error is generated, send a message to specified user
  4. Delete the spreadsheet and exported EAD.xml folder and files from the server - email if there is an error

Author

  • Corey Schmidt - Project Management Librarian/Archivist at the University of Georgia Libraries

Acknowledgements

  • Kevin Cottrell - GALILEO/Library Infrastructure Systems Architect at the University of Georgia Libraries
  • ArchivesSpace Community

About

Runs a series of checks against ArchivesSpace data for content cleanup

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages