Lengthy database calls in akka-peristence-r2dbc

I’m running an event-sourced system with ~300 million entities, and I’m persisting events using the R2DBC plugin. In the log, I’m getting frequent messages that indicate unexpectedly long DB operations, e.g.

insert [Variant|1_d91439d5261ba186ddc0ddf2e3351291_9502fb2b89cb53c3ef24aed51ca09994] - getConnection took [464194] µs
insert [Variant|1_cff35e4413aceb4de82fe113f755451c_b42d35106898db1d2ecd262649c69f23] - DB call completed [2023-08-07T22:02:15.866209Z] in [1213377] µs
select eventsByPersistenceId [Parent|1_35bd1a5323e49022a8cca07e8e21e1f2] - Selected [2] rows in [1310022] µs
select highest seqNr [Parent|1_b2b064e73a2b764b895b5718a93745d1] - Selected [1] rows in [780619] µs

The high latency doesn’t seem to be related to the DB server: p99 latency there is below 1ms.

So, this has to be client-side, right? What exactly is happening? Is it something like the (completed) Futures containing the results from the DB are not being scheduled to run on the CPU soon enough?

I’m running the Starvation Detector on all of the relevant dispatchers, and I do get the occasional warning, e.g.

Exceedingly long scheduling time on ExecutionContext Dispatcher[command-hub.dispatchers.persistence]. Starvation detector task was only executed after 499 ms which is longer than the warning threshold of 100 milliseconds. This might be a sign of thread, CPU, or dispatcher starvation. This can be caused by blocking calls, CPU overload, GC, or overlong work units.

Among those 4 causes, this seems like CPU overload, because I’ve been pretty careful to avoid the other 3. In situations like these, is the solution to either a) throw more compute resources at the problem, or b) decrease parallelism to a level that the current compute resources can handle?

I guess I’m using this post as a sounding board for my own conclusion. But also, maybe I’m missing something or am flat-out wrong. Any insight is appreciated.

Sounds like it could be starvation, look at CPU (and maybe network) metrics to verify if you are saturating the machine.

Other things that could be worth looking into would be connection pool size and possibly also dispatcher sizing. Also, even if you think you have been careful to not block, give that an extra look through, third party libraries can block in surprising places.

Hello Johan. CPU usage is pegged at 90+%, so I’m definitely saturating the machine. I guess I was just wanting confirmation that those akka-persistence warnings can manifest as a result of a saturated CPU.

Sounds likely that is the cause.