Keep denormalization in sync in cassandra read-side


(Andrey Ladniy) #1

I have not found some pattern for keep in sync denormalized tables in cassandra. We have batch storing only (with some restrictions for batch statment list). It might be useful to add a handler EventStreamElement[E] => Future[Unit] which will be called after setEventHandler processed? In the Unit we can do anything, and some updates (in place or calling some external jobs) in tables also.

(Ignasi Marimon-Clos) #2

Hi Andrey,

I’m not sure I understand the problem you describe or the solution. Would you be able to share some specific use case?


(Andrey Ladniy) #3

in setEventHandler more suitable for upserting entity in table and preparing some indexes (VideoAddedByUser => insert videos (pk video_id), insert videos_by_user (pk user_id) ), but user name is not upserted and can be updated after. So after user name updated we can store it (UserNameUpdated => update table user (pk userId) ) and after can start some job which will update user name in denormalized tables (videos_by_user)
This simple example can be implemented in BatchStatement list, but updation may be more complex.
It can be used for notification about successful stored updates too (for GUI or other listeners)

(Andrey Ladniy) #4

In the general case, the callback after a successful write will be useful.
This can be used for keep in sync depended from read-side projection other application parts (like read-side from event journal). Some message may be send via message brocker and dependent parts will react on updates.

(Tim Moore) #5

The API supports returning a future of a list of statements that are executed in the same batch as the offset update, so you can handle denormalization by updating multiple tables in that batch, or by performing other updates in your event handler and then chaining the returned future after those ones.

It wouldn’t make sense to have a callback after the offset is updated, because there is no way to ensure that it is ever handled successfully. If the offset update completes successfully, but then the node fails before the callback is processed, there would be no mechanism to retry it, so it could result in data inconsistency. The entire purpose of offset tracking is that the offset is only updated when the processing completely succeeds.

(Andrey Ladniy) #6

I understand problem now. I saw cources about cassandra, the main part of denormalization/duplication can be implemented with materialized views. The remainder is either not needed either may be implemented in read-side handler with repair read and batch write.

Found one more problem also. There is no way for updating counter tables with lagom read-side processors…

(Tim Moore) #7

That’s true. Cassandra counters are kind of problematic. There is no idempotent way to update a counter. It seems that many Cassandra experts recommend avoiding them entirely. See for example.

The only option currently is to update the counter in a separate statement executed before the batch that updates the offset. This means that if the batch that updates the offset fails to complete for some reason, you could end up incrementing the counter more than once: first when the event is initially processed and then again on each retry until the offset update completes successfully.

A callback like you suggested would allow the counter to be updated after the offset, with the drawback that it would be possible for the counter update never to succeed. This is a fundamental choice you have to make: either overcounting or undercounting. There is no way to ensure the increment will happen exactly once.

If you have a need for counters in your read-side model, it might be preferable to use a different data store entirely.

(Andrey Ladniy) #8

I try mix postgres read-side via slick and cassandra read-side in one service. No problem. I understand, it is necessary to select a database under the terms of use, if I need two databases I must use two databases…