Below are some notes on investigating a user-facing deadlock on the WORKFLOW
table from 06/2017.
https://broadinstitute.atlassian.net/browse/GAWB-2243
- https://files.slack.com/files-pri/T0CMFS7GX-F5XE1MKJN/stack_trace.java
- 500 on submission create
-
Lock wait timeout exceeded; try restarting transaction
- https://sentry.io/broad-institute/firecloud-prod/issues/255118893/events/6011356841/
- Jun 23, 2017 12:36:54 PM UTC
- https://sentry.io/broad-institute/firecloud-prod/issues/255118893/events/6002193538/
- Jun 22, 2017 4:56:37 PM UTC
- https://sentry.io/broad-institute/firecloud-prod/issues/255118893/events/6002184900/
- Jun 22, 2017 4:55:36 PM UTC
- https://sentry.io/broad-institute/firecloud-prod/issues/255118893/events/5991501149/
- Jun 21, 2017 6:22:20 PM UTC
- https://sentry.io/broad-institute/firecloud-prod/issues/255118893/events/5967509154/
- Jun 19, 2017 6:58:51 PM UTC
mysql> show engine innodb status;
Scroll to
------------------------
LATEST DETECTED DEADLOCK
------------------------
There are two active locks being held:
RECORD LOCKS space id 5325 page no 8792 n bits 240 index
PRIMARY
of tablerawls
.WORKFLOW
trx id 203110648 lock_mode X locks rec but not gap waiting
RECORD LOCKS space id 5325 page no 7198 n bits 600 index
idx_workflow_status
of tablerawls
.WORKFLOW
trx id 203110647 lock_mode X locks rec but not gap waiting
Let's look at the actual queries.
Query 1:
update WORKFLOW set status = 'Aborted', status_last_changed = '2017-06-23 12:36:54.825', record_version = record_version + 1 where status = 'Queued' and submission_id = x'7B4889EE44A74A10B7CE11E53AE262E7'
Query 2:
update WORKFLOW set status = 'Launching', status_last_changed = '2017-06-23 12:36:54.825', record_version = record_version + 1 where (id, record_version) in ((478207, 0))
- Query 1 locked idx_workflow_status = 'Queued' and is waiting to lock the primary key index
- Query 2 locked the primary key index and is waiting to lock idx_workflow_status = 'Queued'
- It looks like 1 transaction is trying to abort the workflow and the other is trying to launch it.
- The lock on idx_workflow_status = 'Queued' is blocking the submission POST endpoint!
https://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks-handling.html
- At any time, issue the
SHOW ENGINE INNODB STATUS
command to determine the cause of the most recent deadlock. That can help you to tune your application to avoid deadlocks. - Enable
innodb_print_all_deadlocks flag
. This makes MySQL log all deadlocks instead of just the last detected one.- Ticket open to do this: https://broadinstitute.atlassian.net/browse/GAWB-2264
- This might not be possible due to the flags that CloudSQL allows
- Always be prepared to re-issue a transaction if it fails due to deadlock. Deadlocks are not dangerous. Just try again.
- Not good if it impacts user experience as was the case with creating submissions
- Keep transactions small and short in duration to make them less prone to collision.
- If you use locking reads (SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE), try using a lower isolation level such as READ COMMITTED.
- This is not really applicable when doing updates
- When modifying multiple tables within a transaction, or different sets of rows in the same table, do those operations in a consistent order each time. Then transactions form well-defined queues and do not deadlock. For example, organize database operations into functions within your application, or call stored routines, rather than coding multiple similar sequences of INSERT, UPDATE, and DELETE statements in different places.
In this case, resolved in the following way:
- broadinstitute/rawls#718
WorkflowSubmissionActor
ignores workflows whose submission is Aborting, Aborted, Done. This removes the conflict betweenSubmissionMonitorActor
andWorkflowSubmissionActor
.
- Start WorkflowSubmissionActor with processInterval = 100 milliseconds
- Create a submission with 20000 workflows via the API
- this starts the SubmissionSupervisor/SubmissionMonitorActor for the submission. I reduced the poll interval from 1 minute to 1 second.
- note: I had to fix some strange issues in the mock Cromwell server (RemoteServicesMockServer) to allow creating this many workflows in a unit test environment
- Abort the submission via the API
- Goto 2
The deadlock usually appeared before 30 iterations. After the fix, the deadlock never appears. :success:
This one in GoogleGroupSyncMonitorSupervisor
appears in sentry pretty often:
https://broadinstitute.atlassian.net/browse/GAWB-2256
However it is not affecting user experience, so seems to be less critical.