Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752

Open
uofmsean opened this issue Jun 18, 2019 · 7 comments
Labels
Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result.

Comments

@uofmsean
Copy link

Expected behaviour
For CentOS/RHEL 7, starting a second MCPClient can be accomplished by creating a new systemd unit file as described in Deploy Multiple MCPClients. With multiple clients, batches can be processed concurrently.

Current behaviour
During the Normalize microservice, the Normalize for access, or Normalize for preservation jobs can encounter an InnoDB transaction deadlock (error 1213) or timeout (error 1205) when the fpr_fprule table is updated. (See InnoDB engine status examples of detected deadlocks below.) This causes the rest of the batch to fail with other tasks in the batch reporting the error "An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block." The transfer will be reported as a failure at the end with the JPEG failure counts being a multiple of the batch size.
normalize-for-access-innodb-deadlock-1.txt
normalize-for-preservation-innodb-deadlock-1.txt

Steps to reproduce
Deploy Archivematica 1.9.1 and Storage Service 0.14.1 on a single CentOS/RHEL 7 host
Start two MCPClient processes (see Deploy Multiple MCPClients)
MCPServer batch size 50
MariaDB 5.5 or MySQL 5.7.x database on a separate host
Perform a Standard Transfer/Ingest with a large transfer source containing > 5000 JPEG files
Normalize for preservation and access

Your environment (version of Archivematica, OS version, etc)
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Archivematica 1.9.1
Storage Service 0.14.1
gearmand 1.1.12 (release 18.el7)
Elasticsearch 6.8.0
MariaDB 5.5, or MySQL 5.7.26 Enterprise
Two Storage Service Spaces: Local disk / (default) for processing and /local/am-storage on NFS for AIP/DIP storage
4 CPU cores
16GB memory


For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

  • All PRs related to this issue are properly linked 👍
  • All PRs related to this issue have been merged 👍
  • Test plan for this issue has been implemented and passed 👍
  • Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍
@sromkey sromkey added the Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result. label Jun 18, 2019
@jorikvankemenade
Copy link

I have encountered similar problems with Archivematica 1.10 on CentOS running multiple MCP Clients. I must say that for me the threshold is smaller than 5000 JPEG files. But I am running 6 clients divided over 2 hosts, so maybe that changes the observed behaviour a but.

Are there any ideas of what the root cause for this problem is, or where this can be found? A secondary question would be the impact of this bug. It seems that the processing of the transfer continues, so is it merely my mailbox getting bombarded with failure reports or are there more serious problems?

@jorikvankemenade
Copy link

Today I ran into this problem again, a 1205 timeout. And the normalization completely fails. This means that no normalized files will be in the saved AIP.

@uofmsean
Copy link
Author

uofmsean commented Mar 2, 2020

I'm still testing, but early indications are that the problem is due to the fpr_fprule counter updates in transcoder.py. In our case, the pairs of normalization rules for different format versions (e.g., PRONOM JPEG fmt/43 and fmt/44) are both executed in different concurrent MCP client batches. Since the fpr_fprule row counters are being updated in batch, the transactions deadlock on attempts to update both of the same rows. Eventually one of the transactions times out leading to the 1205 error.

@uofmsean
Copy link
Author

uofmsean commented Mar 6, 2020

The relevant source lines in transcoder.py are here. During testing, I commented out the count updates and the fprule.save() and it seems to have resolved the problem. I'm not sure what a proper and permanent fix would look like though.

@sevein
Copy link
Contributor

sevein commented Mar 12, 2020

@uofmsean thanks for the report, it's super useful. Another reason why this might be happening could be that we keep a db transaction open for the whole batch (see the use of transaction.atomic in normalize.py). Far from ideal when you have multiple MCPClient instances trying to write to the database. See a different manifestation here: #1111 (comment). In antivirus_clamscan, we worked around that by delaying writes.

Related: #313

@mjaddis
Copy link

mjaddis commented Apr 2, 2020

I see this problem when running two MCPClients on a single server with Archivematica 1.9.2 on Ubuntu 18.04 LTS. If I take the batch size down to 4 rather than 128 then I see this when a transfer has just a few tens of files.

@jorikvankemenade
Copy link

Maybe good to link @mjaddis' comment in #1161 here:

I commented out the code that updates the FPR stats: https://github.com/artefactual/archivematica/blob/b4dab1d01ea3978dd6c3919f251bab58e394ec77/src/MCPClient/lib/clientScripts/transcoder.py#L153-L159 and have had no issues since then with either normalise.py being running in parallel by fork_runner or by using multiple MCP Clients on the same machine. I did a test using various transfer sizes (100,300,1000,3000,10000 files) using a small batch size (16) and two MCP Clients. Normalisation was successful in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result.
Projects
None yet
Development

No branches or pull requests

5 participants