How to set up Event Sourcing multi-region across k8s clusters?

Hi all, we’re spinning up a microservice using Akka Persistence and Event Sourcing. We have a requirement for the service to be highly available across multiple regions and so we have an EKS cluster in each region and can deploy the service across clusters with global load balancing via Fastly.

However, my understanding of Akka Persistence and event sourced behaviours in particular is that actor state is held in memory while under the covers the events persisted to the journal are written into a database. As such, an actor in one region’s cluster would be unaware of changes made in another region as even with a global database, the database wouldn’t be in charge of co-ordinating updates across regions. To deal with that, you’d need a single Akka cluster spanning all of the regions so that events for a given persistence id are persisted in only one place which the whole system is aware of.

Given that the separate regions are separate kubernetes clusters, is there any guidance on how to have an Akka cluster span multiple k8s clusters? My understanding was that was discouraged though, so is there a recommended solution to this use case?

Start with this:

https://doc.akka.io/docs/akka/current/typed/replicated-eventsourcing.html

Thanks, I had taken a look at Replicated Event Sourcing and it’s very likely to be the route we want to go down but it still has the same pitfall when it comes to multi-region. Within a single Akka cluster, the actors can be replicated across all nodes with no problem. To keep that behaviour across regions, you’d need your Akka cluster to span all of those regions, which is multiple k8s clusters here as I mentioned above.

There is an issue that talks about addressing this Replicated event sourcing across two Akka Clusters · Issue #29575 · akka/akka · GitHub but we need a solution for the interim.

Please correct me if my understanding is incorrect.

Sorry about the short post. I had a meeting but wanted to at least get the link in there. I think part of the confusion relies on the various overlapping meanings of cluster.

There are several challenges here:

  1. That you have been told not run an Akka Cluster across regions. This is good advice. Unless you use the multi-DC feature of Akka. But the issue with multiDC that is that a multi-DC cluster basically functions as n separate clusters. Which means that things like Persistence aren’t going to work with multiDC. The main problem here is the overloading of the word cluster.
  2. That Replicated Event Sourcing uses “Direct Replication of Events” as a performance optimization. Like Chris in the issue you link, I suspect that things will be fine if you turn this feature off and run two separate clusters. I had thought that this had tested, but apparently not according to that issue.
  3. That you will need to have at least some intercommunication between your K8S clusters. Notably the communication to the databases in other clusters. This shouldn’t be too big of a problem to configure, but I can’t promise that I’ve thought about every edge case.

If you want to do multi-region high availability, this is really the only path for the reasons documented in the Replicated Event Sourcing docs regarding consistency/availability tradeoffs. The only option would essentially to build your own version of Replicated Event (and you’d have the same challenges.

Also, as a bit of a commercial plug, if you are building a Microservice with this high of availability requirements, and using things like Replicated Event Sourcing on the the bleeding edge of functionality, you really should be talking to Lightbend for support. The way that issue is going to get prioritized is if a customer needs it.

Thanks a lot for the detailed explanation, David. That certainly clears things up. I was missing some knowledge there around direct replication to fully understand Chris’s suggestion but I’ve seen the section on it in the documentation now so thanks for highlighting that in particular. If we share a global Cassandra database across k8s clusters we should be able to deal with 3 easily, so I can see how things could work now. We’ll go ahead and try out that approach to confirm.

Unfortunately, we don’t have the Lightbend subscription at the moment. Typical company politics delaying us renewing it again for the moment but it’s certainly something we’ll be considering once more now that we’re diving deeper into the tech.