CassandraSession#selectAll never returns, causing blocked thread

aryno · August 28, 2018, 5:21pm

Hello, we use CassandraSession#selectAll to query our Cassandra cluster for some additional data separate from akka-persistence. Because of some application requirements we block on the result before continuing. We had a case in production where the CompletionStage (we’re using the Java DSL) never completed and caused the entire thread to block indefinitely due to lack of a timeout.

"application-cassandra-plugin-default-dispatcher-7889" #29169 prio=5 os_prio=0 tid=0x00007fae38042000 nid=0x7abe waiting on condition [0x00007fadd8e0f000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000732b05e58> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
	at com.twilio.application.ItemDAOCassandraImpl.get(ItemDAOCassandraImpl.java:66)
	... snip ...

Where the code is just:

final Select select = QueryBuilder.select()
        .from(KEYSPACE, TABLE)
        .where(QueryBuilder.eq(ACCOUNT_ID, accountId.getValue()))
        .limit(1);

return cassandraSession
        .selectAll(select)
        .thenApply(rows -> convertToProtocol(rows))
        .toCompletableFuture()
        .get();

The .thenApply() is simply object conversion using GettableByNameData#getString.

There’s no evidence that there was any problem with our Cassandra cluster (network, timeouts, errors, etc.), and I also see nothing obvious in akka-persistence-cassandra that could cause this looking at the SelectSource though I’m not the most well-versed in akka streams.

While we already have plenty of fixes to prevent this from happening again (as well as moving to CassandraSession#selectOne as that fits our use-case better), I’m hoping to understand why selectAll may have not returned so that we can reproduce and ensure all of our betterments account for this failure mode.

Using:

akka 2.5.13
akka-persistence-cassandra 0.85
Cassandra 3.0.9

I can give more details if needed, not sure what would be relevant.

johanandren · August 31, 2018, 11:21am

Sounds like it could maybe be a bug to me, I created https://github.com/akka/akka-persistence-cassandra/issues/392 to track it.

patriknw · August 31, 2018, 11:27am

Using .get is an anti-pattern, but I guess you know.

Very strange if it would not complete if the query completed.
Do you do many of these blocking .get at the same time?

Do you have a way to reproduce it?

aryno · August 31, 2018, 5:11pm

Yeah the .get() is definitely something we’re going to be removing but that aside, still have the strange behavior of the future not completing at all. .get() was originally there due to the synchronous interface of serializers which caused other failures which was solved by the introduction of the AsyncSerializer in 2.5.13.

After some more digging we did determine there would have been around ~40-50 reads across 3 application hosts (6 Cassandra hosts) to the same Cassandra table for the same row all within a few milliseconds (no writes).

We also see no obvious evidence of the query completing as we don’t have tracing enabled for these queries. I have been attempting to reproduce it but unfortunately haven’t been able to. That’s partially why I posted here instead of direct to Github as we’re still attempting to reproduce or find if the cause is in akka-persistence or Cassandra. Current reproduction steps have just been to run load tests against the endpoint and to also introduce latency/packet loss/anything else that might cause an interruption in the stream but so far akka has worked around those failures as expected. Also the fact that we run 3 application hosts, 6 Cassandra hosts, spread evenly across 3 AZs makes me think it’s less like it’s a networking issue or data integrity problem but rather a deadlock problem within akka-persistence-cassandra/datastax driver (unsure which).

patriknw · September 1, 2018, 7:21am

If you had many of these gets at the same time, each one blocking a thread, it could be starvation problem. All threads in the dispatcher thread pool blocked by the gets and thereby no possibility for other async parts of the queries to proceed, resulting in a deadlock.

Topic		Replies	Views
Akka Persistence Cassandra Upgrade Persistence / Event Sourcing	2	689	December 7, 2021
Akka-persistence-cassandra: Getting "Unable to find missing tagged event" when replaying events Persistence / Event Sourcing	1	743	July 25, 2022
ReadsideProcessor Cassandra failes with AskTimeoutException Lagom Persistence API	10	1544	March 1, 2019
Akka Cassandra Persistence WriteTimeoutException in query with consistency SERIAL during single-node akka cluster startup Persistence / Event Sourcing	1	1048	July 13, 2018
Cassandra failes with AskTimeoutException, actor doesn't have response randomly Persistence / Event Sourcing	0	402	April 4, 2022

CassandraSession#selectAll never returns, causing blocked thread

Related Topics