Skip to content

Instantly share code, notes, and snippets.

@Arkatufus
Last active August 4, 2020 05:56
Show Gist options
  • Save Arkatufus/5c4848a106c520ebfbba727112132a03 to your computer and use it in GitHub Desktop.
Save Arkatufus/5c4848a106c520ebfbba727112132a03 to your computer and use it in GitHub Desktop.
Batching SQL persistence bug report

Hunches

  • BatchingSqlJournal expects a Terminated message to properly dispose of itself, but it never receives any.
    • Related to akkadotnet/Akka.Persistence.SqlServer#114 (comment) "the number of recovering actors are more that 3500!" (strong possible clue)
    • Possible fix: add Context.Watch(Self); to the constructor
    • Context.Watch(Self) is called in AddTagSubscriber and AddAllSubscriber, need further investigation.
  • A combination of BatchingSqlJournal and sharding supervising is causing the system to recreate failing actors way too frequently, spamming the system with "CircuitBreaker is failing fast" debug message.
    • Have not checked the sharding code yet, still a hypothesis.

Tested hypothesis

  • CircuitBreaker.AttemptReset() never got called.

    • Made a spec using pure PersistentTestkit to test a failing SnapshotStore database that goes through all the steps of failing database
    • Result: CircuitBreaker is working properly.
  • Something in Akka.Persistence.SqlServer.BatchingSqlSServerJournal is faulty

    • Made a "chaos monkey" test console app to manually drop and restart different part of a SqlServer backed persistent actor system during run time.
    • Wasted too much time making sure that Docker.DotNet reports the proper docker container and networking statuses (the monitoring part is glitchy as hell, prone to deadlocking)
    • Tried various fails scenarios to see how the persistent actor behaves.
    • Result: Actor seemed to work properly, found no variation of scenario conditions that causes a failure.
  • Something in BatchingSqlJournal base class isn't working properly.

    • BatchingSqlJournal does not inherit from AsyncWriteJournal so it does not work with PersistentTestkit
    • Tried to create a working PersistentTestkit look alike that would work with BatchingSqlJournal, so I can start testing failure scenarios using the cheaper Sqlite database
    • Work in progress, still not working as intended
@ismaelhamed
Copy link

ismaelhamed commented Aug 3, 2020

@Arkatufus What we've seen is that, from time to time, some persistent actors seems to stop (due to a persistent failure) but never start again (even though they are entities under a sharding or are supervised by a BackoffSupervisor). This doesn't seem to happen when we use the SqlJournal.

When I looked into it the only thing that popped up was that the BatchingSqlJournal was not producing neither the WriteMessageRejected nor the WriteMessageFailure messages when an exception occurred in the ExecuteChunk method. Specifically, the WriteMessageFailure is pretty important since this is what it's going to make the Eventsourced actor stop the persistent actor in case of a failure, see line #437

So, I was thinking that maybe fixing this will actually fix the problem I just described, unless there's something else I'm missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment