BatchingSqlJournal
expects aTerminated
message to properly dispose of itself, but it never receives any.- Related to akkadotnet/Akka.Persistence.SqlServer#114 (comment) "the number of recovering actors are more that 3500!" (strong possible clue)
Possible fix: addContext.Watch(Self);
to the constructorContext.Watch(Self)
is called inAddTagSubscriber
andAddAllSubscriber
, need further investigation.
- A combination of
BatchingSqlJournal
and sharding supervising is causing the system to recreate failing actors way too frequently, spamming the system with "CircuitBreaker is failing fast" debug message.- Have not checked the sharding code yet, still a hypothesis.
-
CircuitBreaker.AttemptReset()
never got called.- Made a spec using pure
PersistentTestkit
to test a failingSnapshotStore
database that goes through all the steps of failing database - Result:
CircuitBreaker
is working properly.
- Made a spec using pure
-
Something in
Akka.Persistence.SqlServer.BatchingSqlSServerJournal
is faulty- Made a "chaos monkey" test console app to manually drop and restart different part of a SqlServer backed persistent actor system during run time.
- Wasted too much time making sure that
Docker.DotNet
reports the proper docker container and networking statuses (the monitoring part is glitchy as hell, prone to deadlocking) - Tried various fails scenarios to see how the persistent actor behaves.
- Result: Actor seemed to work properly, found no variation of scenario conditions that causes a failure.
-
Something in
BatchingSqlJournal
base class isn't working properly.BatchingSqlJournal
does not inherit fromAsyncWriteJournal
so it does not work withPersistentTestkit
- Tried to create a working
PersistentTestkit
look alike that would work withBatchingSqlJournal
, so I can start testing failure scenarios using the cheaper Sqlite database - Work in progress, still not working as intended
@Arkatufus What we've seen is that, from time to time, some persistent actors seems to stop (due to a persistent failure) but never start again (even though they are entities under a sharding or are supervised by a BackoffSupervisor). This doesn't seem to happen when we use the SqlJournal.
When I looked into it the only thing that popped up was that the
BatchingSqlJournal
was not producing neither theWriteMessageRejected
nor theWriteMessageFailure
messages when an exception occurred in theExecuteChunk
method. Specifically, theWriteMessageFailure
is pretty important since this is what it's going to make theEventsourced
actor stop the persistent actor in case of a failure, see line #437So, I was thinking that maybe fixing this will actually fix the problem I just described, unless there's something else I'm missing.