Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752

uofmsean · 2019-06-18T17:24:31Z

Expected behaviour
For CentOS/RHEL 7, starting a second MCPClient can be accomplished by creating a new systemd unit file as described in Deploy Multiple MCPClients. With multiple clients, batches can be processed concurrently.

Current behaviour
During the Normalize microservice, the Normalize for access, or Normalize for preservation jobs can encounter an InnoDB transaction deadlock (error 1213) or timeout (error 1205) when the fpr_fprule table is updated. (See InnoDB engine status examples of detected deadlocks below.) This causes the rest of the batch to fail with other tasks in the batch reporting the error "An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block." The transfer will be reported as a failure at the end with the JPEG failure counts being a multiple of the batch size.
normalize-for-access-innodb-deadlock-1.txt
normalize-for-preservation-innodb-deadlock-1.txt

Steps to reproduce
Deploy Archivematica 1.9.1 and Storage Service 0.14.1 on a single CentOS/RHEL 7 host
Start two MCPClient processes (see Deploy Multiple MCPClients)
MCPServer batch size 50
MariaDB 5.5 or MySQL 5.7.x database on a separate host
Perform a Standard Transfer/Ingest with a large transfer source containing > 5000 JPEG files
Normalize for preservation and access

Your environment (version of Archivematica, OS version, etc)
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Archivematica 1.9.1
Storage Service 0.14.1
gearmand 1.1.12 (release 18.el7)
Elasticsearch 6.8.0
MariaDB 5.5, or MySQL 5.7.26 Enterprise
Two Storage Service Spaces: Local disk / (default) for processing and /local/am-storage on NFS for AIP/DIP storage
4 CPU cores
16GB memory

For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍

jorikvankemenade · 2019-11-06T14:15:58Z

I have encountered similar problems with Archivematica 1.10 on CentOS running multiple MCP Clients. I must say that for me the threshold is smaller than 5000 JPEG files. But I am running 6 clients divided over 2 hosts, so maybe that changes the observed behaviour a but.

Are there any ideas of what the root cause for this problem is, or where this can be found? A secondary question would be the impact of this bug. It seems that the processing of the transfer continues, so is it merely my mailbox getting bombarded with failure reports or are there more serious problems?

jorikvankemenade · 2020-02-27T19:20:25Z

Today I ran into this problem again, a 1205 timeout. And the normalization completely fails. This means that no normalized files will be in the saved AIP.

uofmsean · 2020-03-02T22:42:04Z

I'm still testing, but early indications are that the problem is due to the fpr_fprule counter updates in transcoder.py. In our case, the pairs of normalization rules for different format versions (e.g., PRONOM JPEG fmt/43 and fmt/44) are both executed in different concurrent MCP client batches. Since the fpr_fprule row counters are being updated in batch, the transactions deadlock on attempts to update both of the same rows. Eventually one of the transactions times out leading to the 1205 error.

uofmsean · 2020-03-06T20:56:22Z

The relevant source lines in transcoder.py are here. During testing, I commented out the count updates and the fprule.save() and it seems to have resolved the problem. I'm not sure what a proper and permanent fix would look like though.

sevein · 2020-03-12T22:12:23Z

@uofmsean thanks for the report, it's super useful. Another reason why this might be happening could be that we keep a db transaction open for the whole batch (see the use of transaction.atomic in normalize.py). Far from ideal when you have multiple MCPClient instances trying to write to the database. See a different manifestation here: #1111 (comment). In antivirus_clamscan, we worked around that by delaying writes.

Related: #313

mjaddis · 2020-04-02T15:01:32Z

I see this problem when running two MCPClients on a single server with Archivematica 1.9.2 on Ubuntu 18.04 LTS. If I take the batch size down to 4 rather than 128 then I see this when a transfer has just a few tens of files.

jorikvankemenade · 2020-04-08T15:41:23Z

Maybe good to link @mjaddis' comment in #1161 here:

I commented out the code that updates the FPR stats: https://github.com/artefactual/archivematica/blob/b4dab1d01ea3978dd6c3919f251bab58e394ec77/src/MCPClient/lib/clientScripts/transcoder.py#L153-L159 and have had no issues since then with either normalise.py being running in parallel by fork_runner or by using multiple MCP Clients on the same machine. I did a test using various transfer sizes (100,300,1000,3000,10000 files) using a small batch size (16) and two MCP Clients. Normalisation was successful in all cases.

sromkey added the Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result. label Jun 18, 2019

jorikvankemenade mentioned this issue Feb 29, 2020

Discussion: support the use of distributed filesystems #988

Closed

5 tasks

sevein added the triage-release-1.12 label Mar 12, 2020

jorikvankemenade mentioned this issue Apr 2, 2020

Problem: normalize.py cannot run in parallel #1161

Open

5 tasks

sromkey removed the triage-release-1.12 label Jul 4, 2020

scollazo mentioned this issue Jan 25, 2022

Problem: updating fpr_counters causes errors under high concurrency #1525

Open

5 tasks

sevein mentioned this issue Oct 31, 2024

Defer FPRule counter updates during normalization artefactual/archivematica#2007

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752

Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752

uofmsean commented Jun 18, 2019

jorikvankemenade commented Nov 6, 2019

jorikvankemenade commented Feb 27, 2020

uofmsean commented Mar 2, 2020

uofmsean commented Mar 6, 2020

sevein commented Mar 12, 2020 •

edited

Loading

mjaddis commented Apr 2, 2020

jorikvankemenade commented Apr 8, 2020

Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752

Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752

Comments

uofmsean commented Jun 18, 2019

jorikvankemenade commented Nov 6, 2019

jorikvankemenade commented Feb 27, 2020

uofmsean commented Mar 2, 2020

uofmsean commented Mar 6, 2020

sevein commented Mar 12, 2020 • edited Loading

mjaddis commented Apr 2, 2020

jorikvankemenade commented Apr 8, 2020

sevein commented Mar 12, 2020 •

edited

Loading