Read-side latency and responsivness


(Marc-Antoine Nüssli) #1

Hello,

I’m dealing with the issue of read-side latency and UI responsivness (already briefly discusses in a previous topic). In order to achieve the latter, we need a quickly updated read-side view, even if it is not fully consistent. However, this seems difficult to achieve in a CQRS architecture as we cannot avoid a latency of several seconds between write commands and read-side updates. At least in Lagom, from what I read and tried, even by tuning the parameters it seems difficult to go below a 2 seconds latency.

Possible solutions
In the context of Lagom, I see mainly two solutions:

  1. Use the PersistenEntity state as a read-side model
    … and for example, put it into the command reply to provide client with the updated view/state. It has several limitations (single-entity view and single kind of view) and it somehow breaks the CQRS pattern by using the write-model as a read-model.

  2. Implement some client-specific predictive view update
    For example, when receving a command reply, use it (along with the sent command) to update the current client view, or alternatively propagates events in near rela-time to clients through pubsub and use them to update client view. This solution is quite general and efficient. However, this requires to write situation-specific client code to update the specific client views after each command. Somehow, we need to duplicate part of the view updating code in the client and we have to take care of doing it consistently on both side. More generally, it requires to have part of read-side view management logics on the client side while it should preferably be completely managed on the server side (at least to my opinion)

I used both solutions in different situations and while they work well, I am clearly not satisfied with these becaus of the drawbacks mentioned above (lack of generality & anti-pattern for 1 and view management code duplication for 2) So, if someone has other ideas about how to approach that problem, I would be very glad to hear about it

Towards a more generic solution
On my side, I am thinking about how to implement a kind of generic version of solution 2. The global idea idea would be to have a read-side that do not only listen to events but also to command replies in order to update its content. Hence, we would have some sort of (non-blocking) command proxy so that we could forward any valid command reply to the view in order to do some optimistic update of the view. Then, when the view receive the actual events, it could validate (or invalidate) the updates done based on command replies.
In practice, one approach would be to maintain two copies the view state (consistent and latest) as well as a list of applied (command,reply) pairs. Everytime a new event arrives, it is applied to the consistent view (as in usual read-side). Then, the event is compared with the first element of the applied commands list. If they correspond, the element is simply removed from the list and if they don’t correponds, the latest view is recomputed by re-applying the list of applied commands on the consistent view (after filtering the one corresponding to the current event) .

In general terms, the goal is to have the most up-to-date view besides the eventually consistent view, taking care of not introducing an overhead on the command processing (which could potentially still be used without going through the “reactive view proxy”)

Basically, there would be two requirements:

  1. There should be some way to find the correspondence between a specific event and a specific (command,reply) pair. Ideally, the event offset would seem to be the best way to have such a correspondence but in Lagom it is not possible to have it when constructing the command reply (but maybe it could be possible in principle?!?) For example, we could add an identifier to all events and passing it in command reply.

  2. The same read-side mutation must be defined both for an event and for a (command,reply) pair. Typically, I would imagine a third type ViewUpdate and some (implicit) functions to convert an Event to a ViewUpdate and to convert an (command,reply) pair to a ViewUpdate. Then effective read-side updates are done using ViewUpdate values

From my initial thinking, there are several small issues that need to be adressed (handling multiple events cases) but the main problem I see is about handling efficiently the two view versions in a databases. We need two have two copies of the same database and regularly restore one copy (the latest version) to the state of the other (the consistent version). Alternatively, we should be able to rollback some of the latest changes. For SQL-based, I don’t think there are such features out of the box. However, this could be accomplished by using some kind of automatic genral audit history based on triggers so that we could able able to revert the last DB updates.

Sum up
Of course, such a solution would add constraints on the definitions of the corresponding commands, events and read view, as well as some overhead on read-side querying and memory usage. But, what I like is that it would potentially provide a reactive read-side view without having to care on the client side about optimistic updates of the view. More generally, this could considerably simplify application code (and make it more consistent)

Another solution I was briefly exploring would be to do that at the query level. Roughly, the idea would be to have specific “reactive query” on the read-side for which the clients get notified about the updates (through WS). But then there are questions on how to accomplish this in a generic fashion and without too much overhead.

I would be happy to hear any comment on this proposed solution. In particular,

  • do you think such a pattern is coherent with the general CQRS approach ?
  • do you think it could work in practice (and thus significantly reduce the read-side latency) ?
  • do you see any potential problem that could arise with such a solution ?
  • do you any other idea to avoid the client to manage the view updates ?
  • any other comment…

Thanks for taking the time to read!

Best regards,
Marc-Antoine Nüssli


(Omid) #2

Thank you for writing this. I think the delays in read-side is a huge issue and I have been struggling with it for a while now. I propose to have a benchmark on the responsiveness and the delay/throughput of the read-side processor. Numbers matter, if we can have this delay under 0.5 second then there is no need for any other implementations. Also, we should really understand why this number is that huge (in your case after optimization 2 seconds) is just not acceptable latency for any application.


(Renato) #3

Hi @datalchemist,

I’m not sure if I fully understand your strategy for option 2, but before starting discussing it I would like to make a point about option 1.

On the topic of being “illegal” or not to use the write-model as a read-model, this is more about how CQRS compliant you want to be.

There are many things that get mixed when people discuss ES/CQRS. You can have perfectly well-crafted event sourced systems without being fully compliant with CQRS. It’s true that option 1 would violate the principles of CQRS, but those principles are less important if you need strong consistency.

Much more important than a true separation between write and read-side models is the correct definition of your consistency boundary. The consistency boundary defined by a PersistentEntity (aggregate) is the real added value, not the Command Query Model segregation. That said, if you can’t cope with eventual consistency, simply don’t apply segregation and keep only the event sourced / consistency boundary aspect of it.

The impact of this decision in terms of code is not that big. If we consider that A is your Aggregate, E is Event and V is the View, you will need the following functions.

A => E => A // classical event handler
A => V // from write-side to read-side

The problem that arises is that you also need to build an index for your views. You can’t make a search on the write-side, find the right A and convert it to V (you can only find A by id). So, you still need to listen to events on a read-side processor and update an index so you can perform a search on fields other than the id.

In addition to that, you want to deliver the same V after performing a search on the index. The most common way of doing that is to have a function V => E => V, but then you get the next issue. You can’t guarantee that a V produced by A => V will be the same as the one produced on the read-side. Even if you consider that there are no new events and both sides are consistent, you can still have different views of it because they are produced by different functions and you get that feeling that you are repeating yourself.

You can mitigate it by generating a read-side that contains columns that you want to search and two blob columns, one for A and one for V. Yes, that’s exactly that. You can also have A on the read-side (totally not CQRS compliant).

Whenever you have an event arriving on the read-side processor, you read the A blob, you apply the event on it reusing the same event handler from the write-side: A => E => A. Once you have the new A, you generated a new V with A => V and save it as a blob together with the new A on the read-side table.

You end-up with the same two functions you were using on the strong consistent side. You use it to generate exactly the same V on the eventual consistent side together with index info.

The net result of this is that you work around the eventual consistency issue each time you can pick a V by id. You simply take A (by id) and call A => V. Your index will lag behind because read-side are eventually consistent, but that’s usually not a big issue.

Note that this solution is totally not CQRS compliant. In that case, it’s preferable to speak about the strong and the eventual consistent sides. The strong consistent side gives you the consistency guarantees (consistency boundary) and the correct ordering of events in the journal (event sourcing). The eventual consistent side gives you search capabilities. And you reuse the same functions everywhere.

This approach only works if A => V can be defined. If V needs more info than available in A, you can’t do it. That means that you have a 1:1 mapping between the Aggregate (the consistency guardian) and the View (its representation).

The question that remains is if eventual consistency is a problem, isn’t that better to leave aside some CQRS principles and adopt a new set of guidelines? I believe that the complexity of option 2 can’t be justified if the only reason is to preserve some model segregation purity.

Maybe we should simply not see it as a model segregation (not two models), but two sides of the same model. The strong consistent command side and the eventual consistent query side.


(Renato) #4

@omidb, I understand that is not always easy to embrace eventual consistency. So, I would like to add some more context…

Also, we should really understand why this number is that huge (in your case after optimization 2 seconds) is just not acceptable latency for any application.

The current delay relates to how Cassandra’s materialized views work. The previous version of the plugin (and Lagom) was impacted by it. The new version (v.0.8.0) doesn’t suffer from it. You can use it in Lagom, but not without migrating your data first. New Lagom applications can use it directly.

You can always try to optimize things, but eventual consistency is inherent to the architectural style of event sourced applications.

Imagine the following situation. You have many active entities in memory producing new events and filling the journal. You can’t have as many process reading from the journal and producing views. You would have too many queries hitting the DB.

That’s why read-side processors work with tags. You may have sharded tags and have, let’s say, 10 processors in memory each reading a subset of the journal. Still, for a write-intensive application, those 10 processors can’t be as efficient as the write-side. The write-side does append only on the journal, while the read-side, need to read-update the views each time.

That said, you can’t simply neutralize eventual consistency. An application that can’t tolerate latency on the query-side, is not suited for an architectural pattern that introduces eventual consistency. Even if you overcome all technical limitation of a given plugin, eventual consistency will hit you on a system under load. So, the application must be designed to tolerate latency on the query side.


(Omid) #5

I understand the eventual consistency limits and why this happens. I have used the newest plugin and it’s better but the latency is real.

What I am saying is the latency should be documented better and that’s why I said numbers matter. The latency can be 1 hour or 10 ms! I propose to have a table that shows eventual consistency latency with columns like number of writes per second, number of read-side processors and … This numbers can help people to decide when they can tolerate this latency. Also, it’s good for developers, maybe they can come up with read-side models that are much faster.
I simply cannot just accept because eventual consistency exists, therefore it’s okay to have latencies like that.
Also, it’s interesting to see what is this number of other libraries in .NET world or even in Spring Framework CQRS tools.

I found this paper with some numbers about the performance of their model: https://www.icsr.agh.edu.pl/~malawski/DebskiSzczepanik-CQRS-IEEE-Software.pdf

From the paper:

Recently, Akka authors have rolled out Lagom [10],
a full-fledged microservice framework based on the CQRS+ES
architecture. It handles persistence in a similar way to our
prototype application, but in contrast to our toolkit approach
it enforces the entire application structure, up to the definition
of REST endpoints. Despite the high frequency of publishing
new tools influenced by CQRS+ES architecture, there have
been no in-depth performance studies of this approach we are
aware of.


(Marc-Antoine Nüssli) #6

Hello @renato,

Thanks for your detailed answer, which was a good source of new reflexions about this topic :slight_smile:
Here are some my thoughts about first the write-side as read-side solution and second, about the general question of coping with eventual consistency.

About the write-side as a view solution
Actually, at the beginning, I did essentially what you describe: using the aggregate as the source for the main view and query it by ID through read-only commands. Then, as you said, the read-side is used essentially as an index to find out the aggregate IDs from other fields.
However, I’am not sure to grasp correctly what you mean when you say that

…and the A&V blobs solution that you suggest in response to that.
In my case, I used the view as an index and then used the retrieved IDs to query the write-side to get the actual view. If I understand you correctly, what you propose is to store the V (& A to update V with E) in the read-side, so that we can return the V when doing an index search, without the need of querying the write-side for each retrieved ID. Is-this what you meant?
But then, when you returns the V from an index search, you return an eventually consistent version of it, while if you only use the read-side for ID retrieving (and then query write-side to get each V), then you get strongly consistent Vs (but you of course still query an eventually consistent index)

In general, I agree that this solution works well and I understand that it is not so important whether it is CQRS-compliant or not. But in general, I am not very satisfied with it for different reasons. One is as you say it only works for simple cases where you have one-to-one aggregate-view relation but when it comes to more complex views involving several entities you cannot rely anymore on such a solution. Also, in term of data model, it is a bit unclear what is what and it is somehow not clear to decide what to put in the Aggregate, in the event and in the view. In my case, I had to modify my aggregate several times because its contents did not match the requirements for my view. Also, I sometimes ended up putting too much in the event just to pass the required data to the view. More generally, I would like to have a pattern that could scale when system complexity increase
Finally, this approach somehow feels like bending a event-sourced CQRS architecture to have an event-sourced non-CQRS system… And I wonder if that makes much sense to use Lagom persistent entity & read-side infrastructures to do that, although I totally understand and agree that it works.

About coping with eventual consistency

I’m not sure to understand what “coping with eventual consistency” means in practice. As I picture the things, when it comes to UI responsiveness, there is always a point in which you cannot deal with eventual consistency. So, except when there is very few and non-significant user interaction, I don’t see how an application can play well with eventual consistency. What I suppose, but I may be missing something, is that when you have a full CQRS-compliant system with some significant UI, you almost always end up doing some optimistic updates on the client side to avoid eventual consistency issues that may affect responsiveness. Or did I miss something? Are there cases where eventual consistency is absolutely not a problem, even on the UI side?!? Or are there other means (than optimistic updates or not being CQRS-compliant) to cope with it?

Does eventual consistency necessarily mean temporal lag between write-model and read-model?
As I understand eventual consistency, it essentially means that your view data will be consistent (i.e. represent the actual correct state of the data) only eventually and thus, at some moment, it may not represent the actual state of the data (but an inconsistent version of it). In Lagom (and I guess also in other CQRS architecture) the inconsistency is reflected in the fact that the view may represent the actual state as it was some time ago (and not as it is now). In other words, there is a temporal lag between the current strongly consistent state (on the write-side) and the eventually consistent state(on the read-side). But, isn’t possible to have different kind of eventual consistency? More specifically, can’t we imagine an eventually consistent view that represent the best guess of the actual current state. That is that there is no(or at least much smaller) temporal lag between the view state and the actual state but that the view may contains an incorrectly predicted state (that will be eventually corrected)

That is, in more abstract terms, what I have imagined in my drafted solution. The idea is to have a faster, although potentially incorrect, transmission of the events (or at least of the data required to update the view) to the view, so that it could be optimistically updated before it receives the consistently ordered events. Of course that would certainly never be as efficient than doing optimistic updates on the client side but that would be much more generic and would also offer a clear separate the data view management from the client.


(Marc-Antoine Nüssli) #7

Hello @omidb,

I agree with you that these numbers should be more documented (which of course require to benchmark them) in order to help people to decide what it acceptable before implementing.

But at the same time, I have the impression, as I explain in my response to renato, that there are necessarily, for almost any application, a point (in particular when it comes to UI interaction) where this latency will be a problem. And thus, that would be great to have generic patterns to solve these issues instead of constantly implementing ad-hoc solutions (which I suspect to be essentially situation-specific client-side optimistic update)
As I say, maybe I do not grasp completely CQRS architecture, but I currently I have great difficulty to imagine a case in which eventual consistency latency is absolutely never an issue.


(Renato) #8

Hi @datalchemist,

Sorry for the long delay in responding. I had this message on my mailbox waiting for me to answer, but I lost sight of it.

My point about when I said “…you want to deliver the same V after performing a search…”, was about the case where you do a search and you retrieve more than one result. For instance, you search of all users with a gmail email and you retrieve many of them and you want to build a list to display. In that case, it’s not practical to retrieve all the IDs and hit the entity to get the final state.

The price you pay is that that view is eventually consistent with the view you generated directly from the entity state. Which we should not forget is also not waterproof, because while your looking into it, it may have already changed by another process.

Eventual consistency happens more often than we think, but it is more disturbing when you get stale that just after performing an update.

The solution I proposed in my previous message is certainly not generic enough for the reasons we already discussed. Unfortunately, don’t think there is a generic solution for this. The only real solution is to completely remove eventual consistency, but in that case you pay the price of contention on the write side.

Yes, I think so, but that’s also not generic. And is very application dependent. And I see some issues with that was well. How can we predict what may be a correct state? Is that by having the view in-memory and try to apply the events we have directly on it? If so, how can we be sure that we are not missing events?

I think we should not loose sight of the differences between “temporarily inconsistent” and “incorrect”. If I see some data representation that is stale, it may not be the last state, but it is not an incorrect state. If we try to predict or shortcut the flow we have the risk of producing incorrect state.

There are also other techniques you can use to mitigate the eventual consistency. For instance, you can have a version number that is propagated through the events and also returned after applying a command. With the version number, you can poll the view and only return if it has that same (or greater) version number. Again, it has its limitations as well.