i am trying to understand the way Kafka TopicProducer and Sharding work together in Lagom.
I can understand why Sharding and PersistentEntity work well together.
I don’t understand why Kafka TopicProducer and Sharding should work together and why it is useful.
The following questions are:
1- What is the goal behind Kafka TopicProducer and Sharding?
2- Is there a way to configure the Shard number for Kafka TopicProducer?
3- How much Kafka TopicProducers do i have pro service?
The TopicProducer is designed to read the event journal and push events into Kafka.
When you tag your events, you usually use a shard number. In fact, when doing so you create sharded tags. So, if you choose 10 as the number of shards for your tags you will have AccountEvent0…AccountEvent9 (assuming AccountEvent is the orignal tag).
This allows Lagom to create distributed projections in an Akka Cluster. Each projection will read one sharded tag and do its work. In other words, instead of having one thread reading the whole journal, you have 10 threads distributed over your nodes each reading a subset of the journal.
TopicProducers and Read-Side Processors are distributed projections and therefore in that same way.
So, to answer you questions one by one:
- is what I explained above
- not really, you define the shards for your tags and that affect any kind of projections
- The num of TopicProducers is defined by the num of shards you choose for your tag.
Note, you can’t change that number on a system already in production. If you need to change it you need to re-shard the tags.
thanks a lot for the explanation, it helps a lot understand how distributed projections work.
I have to more questions:
1- how many projection regsitry do i have in cluster? Is it per node?
2- How can i re-rehard the tags?
What do you mean by projection registry?
We do have such a thing internally, but it’s not external API so I’m not sure you are referring to it.
For the internal registry, we use a CRDT to keep track of all projections. So, it’s not really one per cluster, but a distributed data structure.
In order to re-shard the tags, you need to write a problem that reads each event from the journal and re-write it back using a new shard number. There is no tool for that yet.
You need to do shutdown the cluster and remove all entries in the read-side-offset table. All projections will be replayed and new offsets persisted. This is especially annoying for TopicProducer because you will publish the events again in Kafka.
thanks for the explanation regarding re-shard. It seems to be really cumbersome.
I mean the projection registry in the internal API. The projection registry is backed by an actor. I just wonder if that actor is also distributed in a the cluster.
The registry is not distributed. It’s per node, but the data it holds is a CRDT.
thanks a lot