CassandraSession#selectAll never returns, causing blocked thread


(Andrew Ryno) #1

Hello, we use CassandraSession#selectAll to query our Cassandra cluster for some additional data separate from akka-persistence. Because of some application requirements we block on the result before continuing. We had a case in production where the CompletionStage (we’re using the Java DSL) never completed and caused the entire thread to block indefinitely due to lack of a timeout.

"application-cassandra-plugin-default-dispatcher-7889" #29169 prio=5 os_prio=0 tid=0x00007fae38042000 nid=0x7abe waiting on condition [0x00007fadd8e0f000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000732b05e58> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
	at com.twilio.application.ItemDAOCassandraImpl.get(ItemDAOCassandraImpl.java:66)
	... snip ...

Where the code is just:

final Select select = QueryBuilder.select()
        .from(KEYSPACE, TABLE)
        .where(QueryBuilder.eq(ACCOUNT_ID, accountId.getValue()))
        .limit(1);

return cassandraSession
        .selectAll(select)
        .thenApply(rows -> convertToProtocol(rows))
        .toCompletableFuture()
        .get();

The .thenApply() is simply object conversion using GettableByNameData#getString.

There’s no evidence that there was any problem with our Cassandra cluster (network, timeouts, errors, etc.), and I also see nothing obvious in akka-persistence-cassandra that could cause this looking at the SelectSource though I’m not the most well-versed in akka streams.

While we already have plenty of fixes to prevent this from happening again (as well as moving to CassandraSession#selectOne as that fits our use-case better), I’m hoping to understand why selectAll may have not returned so that we can reproduce and ensure all of our betterments account for this failure mode.

Using:

  • akka 2.5.13
  • akka-persistence-cassandra 0.85
  • Cassandra 3.0.9

I can give more details if needed, not sure what would be relevant.


(Johan Andrén) #2

Sounds like it could maybe be a bug to me, I created https://github.com/akka/akka-persistence-cassandra/issues/392 to track it.


(Patrik Nordwall) #3

Using .get is an anti-pattern, but I guess you know.

Very strange if it would not complete if the query completed.
Do you do many of these blocking .get at the same time?

Do you have a way to reproduce it?


(Andrew Ryno) #4

Yeah the .get() is definitely something we’re going to be removing but that aside, still have the strange behavior of the future not completing at all. .get() was originally there due to the synchronous interface of serializers which caused other failures which was solved by the introduction of the AsyncSerializer in 2.5.13.

After some more digging we did determine there would have been around ~40-50 reads across 3 application hosts (6 Cassandra hosts) to the same Cassandra table for the same row all within a few milliseconds (no writes).

We also see no obvious evidence of the query completing as we don’t have tracing enabled for these queries. I have been attempting to reproduce it but unfortunately haven’t been able to. That’s partially why I posted here instead of direct to Github as we’re still attempting to reproduce or find if the cause is in akka-persistence or Cassandra. Current reproduction steps have just been to run load tests against the endpoint and to also introduce latency/packet loss/anything else that might cause an interruption in the stream but so far akka has worked around those failures as expected. Also the fact that we run 3 application hosts, 6 Cassandra hosts, spread evenly across 3 AZs makes me think it’s less like it’s a networking issue or data integrity problem but rather a deadlock problem within akka-persistence-cassandra/datastax driver (unsure which).


(Patrik Nordwall) #5

If you had many of these gets at the same time, each one blocking a thread, it could be starvation problem. All threads in the dispatcher thread pool blocked by the gets and thereby no possibility for other async parts of the queries to proceed, resulting in a deadlock.