Cluster dimension

Hi team
I am building a new type of Deep Neural Network
any relevant information of how many nodes / akka actors a cluster can have?
thanks

How many actors?

For intents and purposes the only real limit to the number of actors is memory.

How many nodes?

A much more interesting question. To which there is no definitive answer because “it depends”. The biggest factors are “how stable is your network” and “how often does your topology change”.

I’ve played around with big clusters, although I’d love to hear from people who have operated big clusters for long periods of time. The core challenge is that the entire cluster has to maintain consistency about certain facts, most notably consistency about the cluster membership. The way the cluster does that is via a gossip protocol. This is the most scalable way to maintain consistency in distributed state, but it does have limitations as consistency in distributed systems is inherently challenging.

Let me give you some of my personal rules of thumb (again, I’d love to hear from someone with more hands on production experience with large clusters, but I figure this is a strawman others can give their feedback on):

  • 1-12 nodes: Trivial. The common case: just enough nodes that losing one is easy to recover.
  • 13-50 nodes: Still “just works”, because Akka’s failure detection and gossip protocol is smart. But it’s now non-trivial and it’s easier for people to shoot themselves in the foot. At this size I want devs to understand the implications of what they are doing and consider if this many nodes is really necessary.
  • 50-400 nodes: Officially a “big” cluster. Getting consistency now takes a measurable amount of time. (Which means changes to topology need to be more deliberate.) Having a reliable/low latency network becomes important. Rapid changes to topology are probably a bad idea. Well within the design specs, but you aren’t a “typical user” anymore and many things others take for granted will be tricky. (For example, distributed data would be difficult to use effectively.)
  • 400-1000 nodes. The cluster config starts to become critical. For example, the docs mention some config settings are automatically tuned at this size to help avoid stragglers in consensus making. I’ve played around with this size of cluster in POCs. At this point you probably need to be very deliberate, do lots of testing, and have considerable expertise. Clearly Lightbend has done some thinking about this size cluster, but I suspect production clusters this size are quite rare. (Anyone care to share?)
  • 1000+ nodes. Lightbend did some testing of a 2400 node cluster a long time ago. That was basically the upper limit they could reach. That was a long time ago, so I bet things have improved since then. And I think that if you really were willing to invest some time you could probably push quite a bit higher. But at a this scale (1000+) I think it’s safe to say you are pushing the limits and you should be prepared to do a lot of tuning and work hand in hand with Lightbend and/or experts in Akka Cluster.

Be sure to read the “specifications” part of the docs: Cluster Specification • Akka Documentation as that gives you some more details on the “why” of my rules of thumb.

4 Likes

Hi David. Thanks a lot for your complete answer. My approach should work.
I am just afraid of the price of the licencing as I am launching my business in the US and with big cluster the price would be just too high.
let’s see
thanks

I don’t work for Lightbend, so I obviously can’t answer your licensing question. But I will say:

  1. If you are a new company, the BSL almost certainly doesn’t apply to you. There is a free license available for companies under $25mm in revenue. Sure, you want to think about the future where your revenue is greater than $25mm, but if you are making that kind of money based off the framework it makes sense to license Akka anyway.

  2. If you are in a situation where you will be pushing the limits of the framework (as you implied) you will want a license anyway for support reasons.

  3. If you are worried, talk to Lightbend. I’ve observed them to be very flexible. If most of the CPU is going to the AI and not the framework itself, then Lightbend might be very flexible about terms.

I’m not trying to minimize the importance of the BSL license change. I do think it’s a big deal. But separate the technical and business questions because they are very different questions. And you aren’t likely to get business questions answered here: you’ll have to talk to Lightbend.

1 Like

$25mm means 25 000 000 $ revenus right?