When we ran a stress testing tool on Digdag server, the server stopped running tasks. The cause was that propagateAllPlannedToDone
and propagateBlockedChildrenToReady
methods of io.digdag.core.workflow.WorkflowExecutor
class were too slow when there're too many active tasks.
Here is the scenario:
-
Many workflows submit many tasks.
-
Eventually, there're a lot of tasks in PLANNED or BLOCKED tasks.