Clarifying GroupBy Documentation

Hi There, trying to understand the doc of groupBy it seems that it need to be re-written a bit better. In the mean time i have the following questions on the documentation:

emits an element for which the grouping function returns a group that has not yet been created. Emits the new group there is an element pending for a group whose substream backpressures completes when upstream completes (Until the end of stream it is not possible to know whether new substreams will be needed or not)

In particular what does this means:

Emits the new group there is an element pending for a group whose substream backpressures
???

When we talk about Complete, are we talking about completing individual sub-streams, or completing the all set of sub-streams ?

Please i hope you can help clarify that

Hi,

when it comes to completion, individual substreams are completed. Then it depends what is being done with those substreams. If they are merged back to a single stream, it is completed as a result of that. You can think of completion propagation similar to element propagation, but it can overtake elements in the stream.

Hi,

I also have questions regarding groupBy after consulting the documentation. Forgive me if this is already documented, my questions are:

  1. When and how does groupBy back-pressure if there is back-pressure in 1 or more or all of the groups (which this I mean back-pressure within 1 or more or all “sub-streams”)?

  2. Does groupBy ever back-pressure from the perspective of the up-stream, and if so, when and how?

If there is an element arriving to groupBy for a substream that is applying backpressure, that will lead to backpressure from groupBy until the substream stops applying backpressure element can be sent down the substream.

If you have individual downstreams that are likely to backpressure and want to avoid this it can make sense to make sure to start the substream with a buffer, so that it will be less likely that backpressure “reaches” groupBy.

Okay. So in other words, if in any of the groups (here aka. substreams) there is back-pressure and this back-pressure reaches groupby, the upstream in front of the groupby will also see back-pressure (i. e. groupby will have less demand).

I my particular case I am grouping elements (the actual number of different element keys and thus groups being dynamically determined from a database source) with groupby that are offered by the upstream continuously and infinitely driven by a tick, in order to keep the relevant groups up even if they complete successfully (while within each group there is tailored supervision as well). So taking your suggestion a bit further for my case, I am in the comfortable position that directly behind the groupby I can have a buffer even with OverflowStrategy.dropNew, which means my groupby won’t ever seriously backpressure, if at all.

Not quite right, if any of the substreams do backpressure and there is an element for that substream, that element will cause groupBy to backpressure.

As long as there are no elements for the backpressuring substream(s) groupBy will keep pushing elements down the other substreams.

@johanandren
If I have a source emitting messages and I am doing a groupBy on the message key and each sub stream (key based) having its own destination with its own rate limit, how can I ensure that substreams are not affected by each other?

For example,
Source emits messages at 1000/sec
Substream 1 accepts messages at 100/sec
Substream 2 accepts messages at 500/sec
Substream 3 accepts messages at 750/sec

Since the source is emitting messages faster than its substreams, I want to ensure that each of the above substreams can maintain their own rate i.e. Substream 1 (100/sec) should not slow down Substream 2 and Substream 3?

i.e, the backpressure should not propagate to the source. How to achieve this in Akka streams?

groupBy has a one element buffer, if the substream selected for an element is backpressuring, that will lead to groupBy applying backpressure until that element can go down the selected substream.

That means that if within the 1000 mps source there are never more than 100 mps for substream id 1, 500 mps for substream id 2 and never more than 750mps for substream id 3, groupBy will never need to apply back pressure.

If the upstream is not guaranteed to have such a distribution it is also not possible to guarantee that it will never backpressure the source. At the core of the problem is having a producer that produces element faster than you can consume – if there is a constant difference and you cannot slow down the producer (through backpressure) it’s pretty much an unsolvable problem – the only other option is to buffer elements, but since the consumer will always slower the buffer will grow indefinitely over time and at some point you will run out of memory.

If it is about spikes rather than a constant difference a bounded buffer at the beginning of each substream will allow the source to continue emit elements until that buffer fills up for a given downstream before it backpressures, but also gives the substream a chance to catch up.