Server side management of SSE connection closures

Hi again,

This is a follow up question to my previous post regarding scaling up SSE streams on Play 2.7.x (with Akka Http 10.1 under the hood).

Our max-connections issues got resolved, but we are now looking at the closing of connections on the server side, when the client either shuts down or invokes the close method on the JS eventsource object.

It is perhaps important to mention here that we do have a F5 load balancer between our clients and the Play server, which basically means it is actually the F5 interface that talks to the Play server.

Considering the way we have constructed the event stream, the detection of the client side closure is to be ensured by the attempts to send keep alive SSE messages. When that attempt fails, the Akka stream completes, the source actor is terminated and that termination is captured by our session actor which will unsubscribe from the various cluster topics and stop itself (context.stop(self)).

We have been doing tests in 3 different ways to validate the proper closure of connections and the termination of the supporting actors.

  1. Using a curl command that directly opens the event stream on the server
  2. As part of the load tests we are performing with the Gatling SSE functionality
  3. Using our regular web client for our solution : closing here means that we just close the browser tab. We recently also added a window.beforeunload event in our REACT app, that will invoke the close() method on the JS eventsource.

What are we observing ?

  1. For the curl use case - after pressing Ctrl-C to interrupt the connection - we see that the connection on the server side will be terminated as expected with a “connection reset by peer” in our logs, just before the cleanup starts. The cleanup starts with the next attempt of delivering a keep alive SSE message.
  2. For the Gatling test, we have a strange observation : for the majority of the connections everything works as expected (the only difference with test case 1 is that this time a “broken pipe” is the trigger to start the cleanup. A small subset however, does not appear to show the same behavior. What we notice is that on the server side, the keep alives keep being sent for these. This means we have a resource leak here that might grow over time. It looks as if our F5 still keeps the connection going and our server thinks there is still a partner to receive the messages, thus not leading to a completion of the stream and the cleanup we hope to see
  3. When using our regular web client we never see anything being cleaned up. Closing the browser tab, or calling close on the JS evensource object, seems to have no effect. The connection appears to stay open, the keep alives SSE messages continue to be generated (our logging shows this), the Akka stream is not terminated and thus we are having a resource leak because over time we will have a growing number of actors that keep on running without a purpose.

We are having some suspicions towards our F5 (load balancer) - although we could not find any evidence for this and our F5 administrator is still to be involved in the troubleshooting - but we cannot exclude that the error might be somewhere else.

Our environment:
Play 2.7.4
Akka HTTP 10.1.12
SSE stream using Akka Streams (prematerialized Source actor, which is watched by a session actor that subscribes to the internal even publisher of our system)

I am interested to know whether:

  • there are known issues with this in the version we are using and whether an upgrade is required
  • there are other people that have seen issues when load balancers are used

The trace file lines that prove the successful closure

For test case 1

11-01-2021T11:07:46.266+0000 [11:07:46.266UTC] DEBUG [TcpIncomingConnection] - Closing connection due to IO error Connection reset by peer
11-01-2021T11:07:46.266+0000 [11:07:46.266UTC] DEBUG [Materializer] - [sse-events] Downstream finished.
11-01-2021T11:07:46.266+0000 [11:07:46.266UTC] DEBUG [NCFSessionEventListener] - Source actor akka://NCFNode/system/StreamSupervisor-0/$$l-actorRefSource TERMINATED. Stopping SessionEventListener.

test case 2 : the proof of succesful closure for the majority of connections during Gatling injection.

11-01-2021T11:13:31.430+0000 [11:13:31.429UTC] DEBUG [TcpIncomingConnection] - Closing connection due to IO error Broken pipe
11-01-2021T11:13:31.430+0000 [11:13:31.430UTC] DEBUG [Materializer] - [sse-events] Downstream finished.
11-01-2021T11:13:31.430+0000 [11:13:31.430UTC] DEBUG [NCFSessionEventListener] - Source actor akka://NCFNode/system/StreamSupervisor-0/$$n-actorRefSource TERMINATED. Stopping SessionEventListener.

For test case 3, nothing can be shown so far, as nothing gets triggered by closing our regular web client.

For the sessions that are not properly closed, our trace file keeps on showing submission of Keep Alive SSE messages (as if the client is still listening).

11-01-2021T11:13:52.209+0000 [11:13:52.209UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(5d0c0b5b-bc5f-41e4-a971-63f2e0299db3),Some(keep-alive))
11-01-2021T11:13:52.529+0000 [11:13:52.529UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(a3c2a99c-bd7a-4130-a385-4b3128e66946),Some(keep-alive))
11-01-2021T11:13:58.119+0000 [11:13:58.119UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(17d306d3-02f8-4140-beac-e07050dbf151),Some(keep-alive))
11-01-2021T11:13:59.699+0000 [11:13:59.699UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(5d4f4caf-f650-4fe2-91b7-3e69a7e82989),Some(keep-alive))
11-01-2021T11:14:02.209+0000 [11:14:02.209UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(57524bef-0ea0-4901-ab58-2f2d15a61367),Some(keep-alive))
11-01-2021T11:14:02.529+0000 [11:14:02.529UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(69a56c62-e04d-44b3-aa39-6241a341dd87),Some(keep-alive))
11-01-2021T11:14:08.129+0000 [11:14:08.128UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(b42ef360-de71-4a59-a38a-9ea32e0872e9),Some(keep-alive))
11-01-2021T11:14:09.689+0000 [11:14:09.689UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(55397ac8-60b4-4a69-bbe5-8de7e033e4b7),Some(keep-alive))
11-01-2021T11:14:12.219+0000 [11:14:12.219UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(aff3a797-df93-4912-a320-9d5856979f61),Some(keep-alive))
11-01-2021T11:14:12.529+0000 [11:14:12.529UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(2d51964b-459d-4f42-b040-0475a41b4ebc),Some(keep-alive))
11-01-2021T11:14:18.129+0000 [11:14:18.129UTC] DEBUG [Materializer] - [sse-events] Element: Event("",Some(7ee87bb9-7b93-4d3c-9166-cb4e49e54249),Some(keep-alive))

When I previously mentioned that I am also suspecting our load balancer, I meant to say:

Our load balancer is a full proxy (involving 2 separate connections : 1. client - F5 2. F5 - server.
What we might need is also a Keep-Alive check at the level of the load balancer towards the client. That might not yet have been put in place at the moment. I was inspired by this article, but obviously I am not an F5 specialist and it remains to be seen what our F5 administrator thinks about this.

Meanwhile we have done a test without F5 in the middle.

There the problem does not present itself.
Test case 2 shows that all injected connections are also closed and cleaned up as expected on the server side.
Test case 3 correctly terminates the server side resources when F5 is not part of the equation.

This appears to indicate we need to look at that level and the issue is probably not at the level of Play / Akka HTTP.

I will keep you posted on the outcome of the investigation, just in case other people are having a similar kind of load balancer.

We are having a rather difficult discussion with the vendor of our F5.

Their vision on this issue is, that in the test cases where it fails to close our server side resources, the network only shows a FIN packet sent to our server, but no RST packet.

Below a short snippet (without ip source/destination addresses, replaced by resp. F5 or Play as applicable). This is the sequence we see. A FIN sent, and then just acknowledged.

11:49:28.712029 IP F5 > Play: Flags [F.], seq 8203, ack 3255, win 7094, length 0
11:49:28.753148 IP Play > F5: Flags [.], ack 8204, win 44800, length 0

So, it looks like Play / Akka Http when only receiving a FIN packet, just acknowledges instead of sending itself a FIN and triggering the shutdown of the stream.

Could it be there is still something not right in the behavior of Akka HTTP itself ?

Shouldn’t it also start the cleanup when only a FIN packet is received ?