frsyuki/Digdag task scan improvements.md

## Digdag task scan improvements.md

      
    Raw
  

              Digdag task scan improvements.md
            
          
    Performance degradation due to too many active tasks

When we ran a stress testing tool on Digdag server, the server stopped running tasks. The cause was that propagateAllPlannedToDone and propagateBlockedChildrenToReady methods of io.digdag.core.workflow.WorkflowExecutor class were too slow when there're too many active tasks.
Here is the scenario:


Many workflows submit many tasks.


Eventually, there're a lot of tasks in PLANNED or BLOCKED tasks.


propagateAllPlannedToDone method takes all PLANNED tasks and checks whether all of their children are finished or not. If all of them are finished, the method changes state of their parent from PLANNED to a done state (ERROR, GROUP_ERROR, CANCELED, or SUCCESS).


This method does nothing if some of children are still not finished. This method scans same amount of tasks again later.


propagateBlockedChildrenToReady method takes all parents of BLOCKED tasks to check status of depending siblings of BLOCKED tasks. If all depending siblings are SUCCESS, the method change state of a BLOCKED task to READY.


This method does nothing if some dependent siblings are still running. This method scans same amount of tasks again later.


If a workflow doesn't have _parallel: true (which is the majority of use cases), only one task is running. Across all workflows, only a few percentage of tasks are running. This means that propagateAllPlannedToDone and propagateBlockedChildrenToReady can change only a few tasks' state at a time even if there are huge amount of BLOCKED tasks.


Therefore, propagateAllPlannedToDone and propagateBlockedChildrenToReady methods scan huge amount of tasks every time but number of PLANNED or BLOCKED tasks doesn't decrease rapidly.


Digdag server consumes large amount of time just to re-scanning task status.


Digdag server needs to change status from READY to RUNNING to start tasks. But because two methods consume most of time, this process doesn't run smoothly (note: digdag server has only 1 thread to check/change task status). Thus tasks won't start immediately even though they're are ready to start.


Thus, propagateAllPlannedToDone and propagateBlockedChildrenToReady methods keep scanning huge amount of tasks for ever.


Solutions

This problem happens because propagateAllPlannedToDone and propagateBlockedChildrenToReady methods scan all PLANNED and BLOCKED tasks even if the methods scanned the tasks right before. If they scan tasks only when there're some changes, amount of scan will be small drastically.
False solution

An idea was adding updated_at column so that each scan can check tasks which are affected by recently updated tasks only. However, I found that this approach is fragile because it needs tow manage two race conditions carefully:

check all status when a server restarts
check potentially affected status since last scan.

This needs duplicated implementations. There are some code still remained in code (which should be removed) but I gave up this approach.
Status propagation queue

Another idea is to create another table on PostgreSQL to notice propagation of status transactionally.


When a task finishes (ERROR, GROUP_ERROR, CANCELED, or SUCCESS), it adds its parent task's id to a table in the same transaction (ignore if already exists).


Check the rows in the new table periodically. For each id,


Check its children.


If there are some BLOCK tasks which can start now, change their status to READY. This is similar behavior with propagateBlockedChildrenToReady but more efficient.


If all children are finished, change the state of their parent task. This is similar behavior with propagateAllPlannedToDone but more efficient.


Delete the row regardless of the results. If there're new changes, id will be added to the new table again.


With this way, Digdag doesn't have to re-check status of tasks when no tasks have changed their status.


Step 1 and 2 must block each other using a transaction.