-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752
Comments
I have encountered similar problems with Archivematica 1.10 on CentOS running multiple MCP Clients. I must say that for me the threshold is smaller than 5000 JPEG files. But I am running 6 clients divided over 2 hosts, so maybe that changes the observed behaviour a but. Are there any ideas of what the root cause for this problem is, or where this can be found? A secondary question would be the impact of this bug. It seems that the processing of the transfer continues, so is it merely my mailbox getting bombarded with failure reports or are there more serious problems? |
Today I ran into this problem again, a 1205 timeout. And the normalization completely fails. This means that no normalized files will be in the saved AIP. |
I'm still testing, but early indications are that the problem is due to the fpr_fprule counter updates in transcoder.py. In our case, the pairs of normalization rules for different format versions (e.g., PRONOM JPEG fmt/43 and fmt/44) are both executed in different concurrent MCP client batches. Since the fpr_fprule row counters are being updated in batch, the transactions deadlock on attempts to update both of the same rows. Eventually one of the transactions times out leading to the 1205 error. |
The relevant source lines in transcoder.py are here. During testing, I commented out the count updates and the fprule.save() and it seems to have resolved the problem. I'm not sure what a proper and permanent fix would look like though. |
@uofmsean thanks for the report, it's super useful. Another reason why this might be happening could be that we keep a db transaction open for the whole batch (see the use of transaction.atomic in normalize.py). Far from ideal when you have multiple MCPClient instances trying to write to the database. See a different manifestation here: #1111 (comment). In antivirus_clamscan, we worked around that by delaying writes. Related: #313 |
I see this problem when running two MCPClients on a single server with Archivematica 1.9.2 on Ubuntu 18.04 LTS. If I take the batch size down to 4 rather than 128 then I see this when a transfer has just a few tens of files. |
Maybe good to link @mjaddis' comment in #1161 here:
|
Expected behaviour
For CentOS/RHEL 7, starting a second MCPClient can be accomplished by creating a new systemd unit file as described in Deploy Multiple MCPClients. With multiple clients, batches can be processed concurrently.
Current behaviour
During the Normalize microservice, the Normalize for access, or Normalize for preservation jobs can encounter an InnoDB transaction deadlock (error 1213) or timeout (error 1205) when the fpr_fprule table is updated. (See InnoDB engine status examples of detected deadlocks below.) This causes the rest of the batch to fail with other tasks in the batch reporting the error "An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block." The transfer will be reported as a failure at the end with the JPEG failure counts being a multiple of the batch size.
normalize-for-access-innodb-deadlock-1.txt
normalize-for-preservation-innodb-deadlock-1.txt
Steps to reproduce
Deploy Archivematica 1.9.1 and Storage Service 0.14.1 on a single CentOS/RHEL 7 host
Start two MCPClient processes (see Deploy Multiple MCPClients)
MCPServer batch size 50
MariaDB 5.5 or MySQL 5.7.x database on a separate host
Perform a Standard Transfer/Ingest with a large transfer source containing > 5000 JPEG files
Normalize for preservation and access
Your environment (version of Archivematica, OS version, etc)
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Archivematica 1.9.1
Storage Service 0.14.1
gearmand 1.1.12 (release 18.el7)
Elasticsearch 6.8.0
MariaDB 5.5, or MySQL 5.7.26 Enterprise
Two Storage Service Spaces: Local disk / (default) for processing and /local/am-storage on NFS for AIP/DIP storage
4 CPU cores
16GB memory
For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:
The text was updated successfully, but these errors were encountered: