Skip to content

Instantly share code, notes, and snippets.

@AlanCoding
Created May 13, 2022 15:34
Show Gist options
  • Save AlanCoding/b9919c1b1ebbb9ee6b5be2a584362200 to your computer and use it in GitHub Desktop.
Save AlanCoding/b9919c1b1ebbb9ee6b5be2a584362200 to your computer and use it in GitHub Desktop.

Report from ansible/awx#12176 and elsewhere is traceback ending with

django.db.utils.OperationalError: out of shared memory HINT: You might need to increase max_locks_per_transaction.

We are trying to reproduce that.

Steps to reproduce

We know that postgres default for max_locks_per_transaction is 64.

Because of that, our attempts here try to create an inventory with references to over 64 partitions.

Modify AWX

Normally partitions are done in hourly bins. That presumably means that users hit this by deleteing inventory data (either via an import or the delete_inventory dispatcher task) when the inventory referenced >64 hourly partitions.

We do not have 64 hours to create our reproducer, so this diff changes the partitioning logic. It changes it from hourly bins to secondly bins.

diff --git a/awx/main/utils/common.py b/awx/main/utils/common.py
index 19394247b3..1ff7c6188a 100644
--- a/awx/main/utils/common.py
+++ b/awx/main/utils/common.py
@@ -1118,13 +1118,15 @@ def create_partition(tblname, start=None):
     if start is None:
         start = now()
 
-    start = start.replace(microsecond=0, second=0, minute=0)
-    end = start + timedelta(hours=1)
+    # start = start.replace(microsecond=0, second=0, minute=0)  # round down to whole hour
+    start = start.replace(microsecond=0)  # round down to whole second
+    # end = start + timedelta(hours=1)
+    end = start + timedelta(seconds=1)
 
     start_timestamp = str(start)
     end_timestamp = str(end)
 
-    partition_label = start.strftime('%Y%m%d_%H')
+    partition_label = start.strftime('%Y%m%d_%H%M%S')
 
     try:
         with transaction.atomic():

Set up data

This is done with tower-qa / awxkit scripting

    def test_alan(self, factories):
        inventory = factories.inventory()
        for i in range(25):  # 25
            inventory.add_host()
        jt = factories.job_template(inventory=inventory)
        for i in range(90):  # 66
            jt.launch()
            time.sleep(1)

We run that, and afterwards we have to wait for all the jobs in the system to finish.

Run

In a separate tab, watch the number of active locks

docker exec -it tools_postgres_1 /bin/bash
su - postgres
while true; do psql -d awx -U awx -c "SELECT count(*) FROM pg_locks WHERE mode='AccessShareLock';"; done

This will make the screen constantly scroll with the number of locks used.

Now, delete the inventory created by the script in the UI.

Experiment history

You can modify the exact numbers of things by adjusting the awxkit script above.

First try

inventory id = 43

In this try, I launched 66 jobs against the inventory. We did not confirm the number of jobs / partitions before deleting it, unfortunately. The watcher found a max number of 61 active transactions. The inventory was successfully deleted

Second try

inventory id = 46

I increased the number of jobs to 90.

Go into shell_plus to confirm the number of jobs.

inv = Inventory.objects.order_by('-created').first()  # most recently created
h = inv.hosts.first()
h.job_events_as_primary_host.count()

I found that a host has 180 events linked to it. Every job produces 2 events for each host, so this implies 90 jobs, as intended.

In this case the database crashed, but not due to the expected reason

tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [1] LOG:  server process (PID 48603) was terminated by signal 2: Interrupt
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [1] LOG:  terminating any other active server processes
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [45636] WARNING:  terminating connection because of crash of another server process
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [45636] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [45636] HINT:  In a moment you should be able to reconnect to the database and repeat your command.

The inventory was ultimately deleted.

I failed to get a value for the locks other than 1, so it was not useful data.

@john-westcott-iv
Copy link

We can change the max_locks_per_transactions in the container with:

diff --git a/tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2 b/tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2
index eadfe0b976..037ec89473 100644
--- a/tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2
+++ b/tools/docker-compose/ansible/roles/sources/templates/docker-compose.yml.j2
@@ -181,6 +181,7 @@ services:
       POSTGRES_USER: {{ pg_username }}
       POSTGRES_DB: {{ pg_database }}
       POSTGRES_PASSWORD: {{ pg_password }}
+    command: postgres -c max_locks_per_transaction=10
     volumes:
       - "awx_db:/var/lib/postgresql/data"
 {% if execution_node_count|int > 0 %}

10 is the minimum number. If you go less the DB container will fail to start and the AWX container will throw errors that it can't connect to the DB.

Once the container is running you can validate the number of connection has changed with:

docker exec -t tools_postgres_1 psql -d awx -U awx -c "SHOW max_locks_per_transaction;"
 max_locks_per_transaction 
---------------------------
 10
(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment