AlanCoding/Notification_trigger_fix.md

## Notification_trigger_fix.md

      
    Raw
  

              Notification_trigger_fix.md
            
          
    Description of the current system for triggering notifications

We have 3 types of notifications - started, successful, failed. We should always send the started notification and either the success / failed notification for all jobs (and all job types). The started notification is simple, but we have problems consistently sending the success / failed (the final) notification.
The final notifications are sent when a particular event from a job is processed by the callback receiver. That event is:

the playbook_on_stats event ➟ for jobs / project updates / ad hoc commands or
the EOF event ➟ for inventory updates / system jobs

I will refer to this event as the wrapup event for clarity.
As the wrapup event is processed, a dispatcher task is submitted (so it will run in a different process). That task polls the job's status field until it is no longer running.
If the status field never gets out of running after 5 seconds (or a configured value), then it gives up and doesn't send any notifications.
The immediate issue is that we want it to never "give up".
Problems of current system and solutions

For the immediate issue, we remove the extra task that polls the job's status.
We use the same 2 types of triggers for initiating notifications sending. Call these:

event A - job's final status is saved (successful / failed /error)

changes status field (obviously)


event B - the wrapup event is processed

changes host_status_counts field


In both of these 2 triggers, the code is modified to use a row lock to perform an update.
While holding the lock, it checks the field that the other process changes,
this check reveals whether the other event has finished.
The row lock avoids race conditions where this check returns a false negative.
If the code of event A is running, and event B has not ran, then notifications are not sent.
Likewise, if the code of event B is running, and event A has not ran, then notifications are not sent.
If either event happens, and finds that its counterpart has ran, then it sends notifications.
This is painfully technical logic to assure that the last of the 2 events sends notifications.
In the big picture, this is the entire point.

A number of other minor complaints are given in ansible/awx#8771
I will give some of those, and add a few more.
Non-overlapping criteria

We have a special criteria for sending notifications if the job had an "error" status.
This assumes that a job with "error" status did not send the wrapup event into redis,
and vice versa, that a job without an "error" status did send the wrapup event.
This correlation is usually true, but it's not always true.
This avoids the corner-cases by using a different kind of criteria.
We track whether the runner_callback object has dispatched the wrapup event.
If it never dispatched the event (to redis) then we know that notifications must be sent
by event A - the status change.
On_commit not used correctly

Before, if the wrapup event was playbook_on_stats, it sent notifications from _update_from_event_data.
Historically, there is a well-known trap where websockets can "outrun" other saves.
The UI will get the websocket, make a request, and get new info before the updated values make it to the database.
This could happen with the tasks that send notifications, for instance.
To avoid this, the task was ran on_commit, but due to flawed reasons.
If not in a transaction (the callback receiver does not run this in a transaction), the on_commit code runs right away.
But the greater risk is the way that the callback receiver buffers events - it saves them in a queue for up to .2 seconds commonly.
This code is ran when the event is put into the queue, not when the queue is flushed.
So this PR tries to consolidate the triggers for the wrapup events, partially to clean things up,
but also to assure that we consistently run after callback receiver flushes.