Skip to content

Instantly share code, notes, and snippets.

@rtitle
Last active July 10, 2017 15:42
Show Gist options
  • Save rtitle/8c0086b3b153fb9ffa83639b22335474 to your computer and use it in GitHub Desktop.
Save rtitle/8c0086b3b153fb9ffa83639b22335474 to your computer and use it in GitHub Desktop.
Deadlock tech talk notes

Case Study: Debugging MySQL Deadlocks

Below are some notes on investigating a user-facing deadlock on the WORKFLOW table from 06/2017.

JIRA:

https://broadinstitute.atlassian.net/browse/GAWB-2243

User-facing error:

Sentry errors:

Database info:

mysql> show engine innodb status;

Scroll to

------------------------
LATEST DETECTED DEADLOCK
------------------------

There are two active locks being held:

RECORD LOCKS space id 5325 page no 8792 n bits 240 index PRIMARY of table rawls.WORKFLOW trx id 203110648 lock_mode X locks rec but not gap waiting

RECORD LOCKS space id 5325 page no 7198 n bits 600 index idx_workflow_status of table rawls.WORKFLOW trx id 203110647 lock_mode X locks rec but not gap waiting

Let's look at the actual queries.

Query 1:

update WORKFLOW set status = 'Aborted', status_last_changed = '2017-06-23 12:36:54.825', record_version = record_version + 1 where status = 'Queued' and submission_id = x'7B4889EE44A74A10B7CE11E53AE262E7'

Query 2:

update WORKFLOW set status = 'Launching', status_last_changed = '2017-06-23 12:36:54.825', record_version = record_version + 1 where (id, record_version) in ((478207, 0))

  • Query 1 locked idx_workflow_status = 'Queued' and is waiting to lock the primary key index
  • Query 2 locked the primary key index and is waiting to lock idx_workflow_status = 'Queued'
  • It looks like 1 transaction is trying to abort the workflow and the other is trying to launch it.
  • The lock on idx_workflow_status = 'Queued' is blocking the submission POST endpoint!

Deadlock handling advice:

https://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks-handling.html

  • At any time, issue the SHOW ENGINE INNODB STATUS command to determine the cause of the most recent deadlock. That can help you to tune your application to avoid deadlocks.
  • Enable innodb_print_all_deadlocks flag. This makes MySQL log all deadlocks instead of just the last detected one.
  • Always be prepared to re-issue a transaction if it fails due to deadlock. Deadlocks are not dangerous. Just try again.
    • Not good if it impacts user experience as was the case with creating submissions
  • Keep transactions small and short in duration to make them less prone to collision.
  • If you use locking reads (SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE), try using a lower isolation level such as READ COMMITTED.
    • This is not really applicable when doing updates
  • When modifying multiple tables within a transaction, or different sets of rows in the same table, do those operations in a consistent order each time. Then transactions form well-defined queues and do not deadlock. For example, organize database operations into functions within your application, or call stored routines, rather than coding multiple similar sequences of INSERT, UPDATE, and DELETE statements in different places.

Resolution

In this case, resolved in the following way:

  • broadinstitute/rawls#718
  • WorkflowSubmissionActor ignores workflows whose submission is Aborting, Aborted, Done. This removes the conflict between SubmissionMonitorActor and WorkflowSubmissionActor.

Unit test reproduction:

  1. Start WorkflowSubmissionActor with processInterval = 100 milliseconds
  2. Create a submission with 20000 workflows via the API
    • this starts the SubmissionSupervisor/SubmissionMonitorActor for the submission. I reduced the poll interval from 1 minute to 1 second.
    • note: I had to fix some strange issues in the mock Cromwell server (RemoteServicesMockServer) to allow creating this many workflows in a unit test environment
  3. Abort the submission via the API
  4. Goto 2

The deadlock usually appeared before 30 iterations. After the fix, the deadlock never appears. :success:

Other deadlocks:

This one in GoogleGroupSyncMonitorSupervisor appears in sentry pretty often: https://broadinstitute.atlassian.net/browse/GAWB-2256

However it is not affecting user experience, so seems to be less critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment