Report from ansible/awx#12176 and elsewhere is traceback ending with
django.db.utils.OperationalError: out of shared memory HINT: You might need to increase max_locks_per_transaction.
We are trying to reproduce that.
We know that postgres default for max_locks_per_transaction
is 64.
Because of that, our attempts here try to create an inventory with references to over 64 partitions.
Normally partitions are done in hourly bins. That presumably means that users hit this by deleteing inventory data (either via an import or the delete_inventory
dispatcher task) when the inventory referenced >64 hourly partitions.
We do not have 64 hours to create our reproducer, so this diff changes the partitioning logic. It changes it from hourly bins to secondly bins.
diff --git a/awx/main/utils/common.py b/awx/main/utils/common.py
index 19394247b3..1ff7c6188a 100644
--- a/awx/main/utils/common.py
+++ b/awx/main/utils/common.py
@@ -1118,13 +1118,15 @@ def create_partition(tblname, start=None):
if start is None:
start = now()
- start = start.replace(microsecond=0, second=0, minute=0)
- end = start + timedelta(hours=1)
+ # start = start.replace(microsecond=0, second=0, minute=0) # round down to whole hour
+ start = start.replace(microsecond=0) # round down to whole second
+ # end = start + timedelta(hours=1)
+ end = start + timedelta(seconds=1)
start_timestamp = str(start)
end_timestamp = str(end)
- partition_label = start.strftime('%Y%m%d_%H')
+ partition_label = start.strftime('%Y%m%d_%H%M%S')
try:
with transaction.atomic():
This is done with tower-qa / awxkit scripting
def test_alan(self, factories):
inventory = factories.inventory()
for i in range(25): # 25
inventory.add_host()
jt = factories.job_template(inventory=inventory)
for i in range(90): # 66
jt.launch()
time.sleep(1)
We run that, and afterwards we have to wait for all the jobs in the system to finish.
In a separate tab, watch the number of active locks
docker exec -it tools_postgres_1 /bin/bash
su - postgres
while true; do psql -d awx -U awx -c "SELECT count(*) FROM pg_locks WHERE mode='AccessShareLock';"; done
This will make the screen constantly scroll with the number of locks used.
Now, delete the inventory created by the script in the UI.
You can modify the exact numbers of things by adjusting the awxkit script above.
inventory id = 43
In this try, I launched 66
jobs against the inventory. We did not confirm the number of jobs / partitions before deleting it, unfortunately. The watcher found a max number of 61
active transactions. The inventory was successfully deleted
inventory id = 46
I increased the number of jobs to 90
.
Go into shell_plus to confirm the number of jobs.
inv = Inventory.objects.order_by('-created').first() # most recently created
h = inv.hosts.first()
h.job_events_as_primary_host.count()
I found that a host has 180
events linked to it. Every job produces 2 events for each host, so this implies 90 jobs, as intended.
In this case the database crashed, but not due to the expected reason
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [1] LOG: server process (PID 48603) was terminated by signal 2: Interrupt
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [1] LOG: terminating any other active server processes
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [45636] WARNING: terminating connection because of crash of another server process
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [45636] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
tools_postgres_1 | 2022-05-13 15:28:35.350 UTC [45636] HINT: In a moment you should be able to reconnect to the database and repeat your command.
The inventory was ultimately deleted.
I failed to get a value for the locks other than 1, so it was not useful data.
We can change the
max_locks_per_transactions
in the container with:10 is the minimum number. If you go less the DB container will fail to start and the AWX container will throw errors that it can't connect to the DB.
Once the container is running you can validate the number of connection has changed with: