DData migration

We are planning to move our Lagom-backed service to DData for Akka-cluster. Per the Lagom migration guide (as well as the Akka documentation), this requires downtime, as nodes that use DData cannot connect to ones who do not use DData without corrupting the journal.

Obviously, one way to achieve this is to shut down our servers, then start them up again with the DData implementation, we are considering an alternative mechanism to reduce the downtime and keep our capacity high.

We are considering booting up a second environment, with nearly the same configuration as our existing one except the seeds, and without the read side processors running. This second environment would use its own nodes as seeds, effectively creating a split brain. We would make sure no traffic is going to it when we start it up, so without any writers (Entity, RSPs, and the shard coordinator writer, as this cluster uses DData) we wouldn’t expect any corrupt data to be written. Then we would cut traffic to the old environment and repoint our DNS entries to the new environment. Finally, we’d shut down our old environment, then turn on the read-side processors in our new environment.

We’d have downtime between when we cut traffic to our old environment and when our clients refresh their DNS cache after we update our DNS, but we think that is it. We think this ensures that only one set of nodes is writing to any specific data at any point in time, so it shouldn’t corrupt any data.

Are we missing some data? Does this seem like a valid strategy?

Hi @zmarois,

There is one thing you are missing. Your old cluster may have entities in-memory with a few commands in their mailbox. When you cut the cable, you are only avoiding that new commands reach the mailbox, but existing commands may still be processed, mutate the entity and produce events in your journal.

If, at the same time, you open the access to the new cluster, you may have the same entity loaded in both clusters. The split brain becomes a nightmare in that case.

You may have two clusters running, but you will need to shut down the old one entirely before opening the access to the new one. That said, I believe this is a quite complicated scenario with not much gain in terms of downtime.

1 Like

Thanks @octonato. Yes, the complexity is what we are weighing against the potential downtime. Thanks for identifying that extra case. We definitely missed that.