Deployment strategy for clustered Lagom services on kubernetes

akka-cluster
(Mazen Ataya) #1

What is the recommended deployment/upgrade strategy for clustered Lagom services in kubernetes?

Currently, every time we deploy to k8s, we start up new nodes and allow them to form a new cluster. Once the new cluster is formed, the k8s readiness checks will pass, therefore k8s will consider those nodes as ready to receive traffic and will start to terminate old nodes. Can anyone see an issue with this strategy? Are we exposing ourselves to split-brain risk?

K8s seems to instantly kill old cluster when the new cluster is formed but I do worry it might not be always be the case.

The alternative strategy would be to add new nodes to the existing cluster in a rolling upgrade till all nodes have been replaced. The reason we didn’t go with this strategy was because of concerns regarding ser/deser issues that might happen when new nodes communicate with old ones.

Thoughts and suggestions are very appreciated.

1 Like
(Alan Klikic) #2

Hi,

Forming two separate clusters for one service is a Split Brain state. It must be avoided.

Check Akka cluster bootstrap - Deployment considerations for details regarding initial deployment and rolling update guidelines.

About what ser/deser issues are you concerning about?

Br,
Alan

(Mazen Ataya) #3

About what ser/deser issues are you concerning about?

For example, a persistent entity on one of the new nodes writes a new event to the event stream. Later that persistent entity gets migrated to an old node that doesn’t know how to ser/deser the new event because it doesn’t have that class.

(Alan Klikic) #4

I do not think that this situation can be avoided.
But, I would say that, probability for it to happen is rather low (while doing the update new added node persists new event, crashes and old one picks it up). The worst case scenario, in this situation, would be that this particular entity instance(s) access would fail until hitting a new node again and finally when update is finished.

1 Like
(Tim Moore) #5

The best way to handle this kind of situation is to perform the deployment in multiple steps:

  1. First, deploy a version that is able to deserialize the new event class, but doesn’t write it
  2. Wait for that deployment to roll across the entire cluster
  3. Then, it’s safe to deploy a second version that writes the new event class

Most other types of data format changes can be done in a similar manner, where you deploy a version that can read it first, then deploy the writers in a second phase.

3 Likes