We have multiple akka stream flows deployed and in our model, each of the stages in the stream is using mapAsyncUnordered and .async. This is by design for now in our higher level framework, that allows users to define their stream. I wanted to provide a way to detect backpressure within such a defined stream.
I wrote the following piece of code to create a
InstrumentedStage that I attach to every stage in the stream using the via operator, so that it runs on the same actor as the stage underneath.
In this instrumentation, i am trying to calculate the upstream and downstream latencies for a stage and then signal backpressure whenever downstream latency is greater than upstream latency. I am using micrometer timers so that i can monitor the heatmap in grafana before i start to use this.
For now I am unable to see backpressure, mostly the upstream latencies are much higher than downstream in the heatmap.
I wanted to see given we are using mapAsyncUnordered and async, is this the right logic to do this, or do I have to record the id of every event flowing through onPush and onPull and then calculate the latencies according to those. Appreciate any help and insights into this.
Are there examples of custom stages when using a combination of mapAsyncUnordered and async that I can look at as well.
I didn’t test, or even look that carefully at your logic, but don’t think this would work at all.
- Because of unordered parallelism. I suppose you could try to work around this with ids, but I think the complexity/overhead will be high.
- Because I’m not convinced that downstream latency being longer than upstream latency really gives you any reliable insight into whether back pressure is happening.
- Backpressure isn’t a bad thing. You seem to think that backpressure is a yes/no thing based on whether upstream is faster than downstream. But with all of this async processing there could be lots of little momentary micro-backpressures. You are essentially adding a set o actors with each async task. There will cause momentary backpressure all of the time as the actors get scheduled.
- The overhead on this is going to be really high. You are adding a lot of actors and processing.
If all you want to know is if there is backpressure, there is a one liner that will tell you:
isAvailable(out). But if really what you want is telemetry about upstream/downstream latency, I’d advise just using the built in telemetry. It will get you the metrics you are looking for with a lot less overhead, a lot more detail, and a lot more features. Akka Stream extended telemetry • Lightbend Telemetry
Thanks for getting back to me.
I also understand backpressure is not a bad thing, my aim is to provide some observability into it. Yes there will momentary backpressures, and hence the idea of trying to plot them on a graph to visualize these. I saw the built in telemetry but I cannot pay for the license on this.
I wanted to understand your first point around using Ids because of unordered parallelism. For backpressure does it matter that we track elem with ID X being consumed and then a corresponding onPull signal for that come in versus just the onPush and onPulls like I do now. My assumption was as long as downstream flow was asking for more elements, we can get back without having to track elements.
I went with latency as a second option. I first just tried having a counter in onPush and onPull , but that didnt prove reliable…
Appreciate you taking the time to read the post, thank you.
My concern with async/unordered is that because, well, it’s unordered. Depending on how the underlying actors are scheduled you are going to get a lot of variance in when onPull and onPush are called. One element might have a huge latency because the actor for the next stage doesn’t get scheduled for a while. Another element might have a tiny latency, even though it’s being backpressured just because a buffer is being filled in the next actor. Which will all be exacerbated because of the extra actor layers you are adding for monitoring.
I absolutely could be wrong, but I just suspect you are going to end up with something that isn’t very accurate or useful and also adds an enormous amount of overhead. That’s just a hunch, and perhaps a cynical one. So if you have a bunch of Akka Streams geniuses on your team, feel free do some POCs to prove me wrong. But I have worked with some customers on Streams performance. And, in general, you want fuse as many stages as possible (which you are preventing with this solution), and have monitoring be as lightweight as possible (sometimes even doing things like sampling). It obviously depends on the use case, and how much processing you are doing, but those are my gut reactions.
But, and again apologies for being a touch cynical here, but I think you have three and a half fundamental options.
- Option 1. Use Telemetry. You say you can’t afford the license for this, but with the change in Akka license, you are will soon have to pay a license regardless (security patches for the Apache licensed 2.6.x version ends September 23). Or maybe you qualify for the free license terms, in which case you’d get Telemetry for free. This option will get you exactly what you need and you won’t have to spend any engineering effort.
- Option 2. Just manually gather end to end metrics. If all you are really trying to do is detect when your upstream is going faster than your downstream, you could add a timestamp to an envelope and manually capture (and send to micrometer) the end-to-end time. Because of unordered processing there will be variation, but all you probably care about is trends anyway. This lets you delay the licensing question 9 months, but you get less functionality and more overhead. Capturing counts (just at stream start and stream end) in micrometer will also give you a very useful metric that won’t add too much overhead.
- Option 2.5. You could try to do the above, but actually keep a Vector of timestamps and attempt to capture metrics for every stage. I think you will introduce more overhead than you want, but it would be less overhead than what you are proposing.
- Option 3. Don’t use any monitoring at all. Perhaps this is showing my cynical side but pretty much all observability systems cost money. (As opposed to dev frameworks, which are largely free.) If your dev team can’t afford any observability, then just deliver it without observability and let someone who is on the hook for keeping the system running (or the business sponsor) find the budget to do proper monitoring.
(I don’t list Kamon because my understanding is that it doesn’t really have much Streams support. So not any better than option 2. If I’m wrong, someone can correct me.)
But, knowing the thousands of hours that went into Telemetry to make it performant, and having worked for Lightbend when some of these stream monitoring features were introduced (and how challenging it was to do so), I have a some skepticism that a home grown latency monitoring system is going to work very well. At least not unless you have some real experts on your team.