LocalActorRef memory leak and Full GC Allocation Failure causing gc to take too long

SF · September 21, 2020, 12:56pm

Hi

We use akka cluster with persistence version 2.6.6_2.13 and our application runs on raspberry pi 3’s (compute module).

When having -Xmx set to 128m for the jvm, it takes about 26 hours until the heap has almost no space left. There is no OutOfMemoryError but instead the max GC pause becomes 2 sec 339 ms and Full GC happens every minute (analyzed using GCeasy).
This is not acceptable because we have setup timeouts in actors of 1 second which means they always time out if the application is paused for more than 1 second.
Goal: The max GC pause should not exceed 700 millis

Increasing the heap size from 128m to 256m makes it even worse, as expected (max GC pause then is 16 sec 48 ms !).

Heapdump analysis with eclipse’s memory analyzer tool “mat“ suspects a memory problem with an instance of the class akka.actor.LocalActorRef.
Somehow, there is one of our actors in the system, called “propertyHostChannel”, which holds an ActorRef instance which occupies 42.22% of the heap (this is getting worse the longer the application runs). It is a LocalActorRef.

Additionally, we can see per-minute bursts of ~400 (!) debug logs like:

[DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] [SELServer-akka.actor.default-dispatcher-28] - Resolve (deserialization) of path [user/SELServe
r/PropertyHostChannelRouter/VirtualPropertyHost/$Q7#65017950] doesn't match an active actor. It has probably been stopped, using deadLetters.

Eventually relevant code excerpt of the ChannelRouterActor (the actor behind the “propertyHostChannel” instance):

private PSet<ActorSelection> clusterRoutersOfSameType = HashTreePSet.empty();

@Override
public Receive createReceive() {
        return receiveBuilder()
                .match(AddIdWithPropsToRegistry.class, this::onAddIdWithPropsToRegistry)
                .match(RemoveIdWithPropsFromRegistry.class, this::onRemoveIdWithPropsFromRegistry)
                .match(SendTo.class, this::onSendTo)
                .match(SendToCluster.class, this::onSendToCluster)
                // cluster events
                .match(ClusterEvent.CurrentClusterState.class, this::onCurrentClusterState)
                .match(ClusterEvent.MemberUp.class, mUp ->
                        addToClusterRouters(mUp.member())
                )
                .match(ClusterEvent.ReachableMember.class, reachableMember ->
                        addToClusterRouters(reachableMember.member())
                )
                .match(ClusterEvent.UnreachableMember.class, unreachableMember ->
                        removeFromClusterRouters(unreachableMember.member())
                )
                .match(ClusterEvent.MemberLeft.class, memberLeft ->
                        removeFromClusterRouters(memberLeft.member())
                )
                .match(ClusterEvent.MemberDowned.class, memberDowned ->
                        removeFromClusterRouters(memberDowned.member())
                )
                .build();
    }

private void addToClusterRouters(Member member) {
        final ActorSelection cousinRouter = getCousinRouter(member);
        if (cousinRouter.anchor().path().address().host().isDefined()) {
            clusterRoutersOfSameType = clusterRoutersOfSameType.plus(
                    cousinRouter
            );
            log.debug("Added cousinRouter: {} to clusterRoutersOfSameType: {}", cousinRouter, clusterRoutersOfSameType);
        }
    }

private void removeFromClusterRouters(Member member) {
        final ActorSelection cousinRouter = getCousinRouter(member);
        clusterRoutersOfSameType = clusterRoutersOfSameType.minus(
                cousinRouter
        );
        log.debug("Removed cousinRouter: {} from clusterRoutersOfSameType: {}", cousinRouter, clusterRoutersOfSameType);
    }

private ActorSelection getCousinRouter(Member member) {
        final String name = getContext().getSelf().path().name();

        // each node has a ChannelRouterActor at "/user/SELServer/" + name, so select that path
        return getContext().actorSelection(member.address() + "/user/SELServer/" + name);
    }

Questions

Why do we have bursts of the mentioned DEBUG logs?
Why is there one instance of LocalActorRef which occupies so much heap?
How can we get rid of the DEBUG log bursts and how can we fix the memory issue so gc does not take more than 700 millis?

Please let us know if you need more information.

Thanks

jrudolph · September 22, 2020, 12:05pm

Hi @SF,

regarding the heap memory issue. Akka’s ActorSystem organizes Actors in a hierarchy with the root guardian at the top. In a typical Akka application most data is referenced by some actors and parent actors reference their child actors. So, in summary, the guardian actor should reference most of the heap basically by definition.

That said, of course, there could in theory also be a user actor which references all the data, you will have to check that in detail.

You can check what actor (or ActorRef) you are looking at in MAT by looking at the field values of the LocalActorRef in question, in particular at the path field or inside the cell at the _actor field which contain the actual Actor instance of the actor in question.

Another good strategy is not using the hierarchy view but the histogram view to get a first idea of what could be wrong. A common (mis)usage of heap is in “overflowing” actor mailboxes which can happen if an Actor isn’t able to process its (unbounded) mailbox fast enough to keep up with incoming messages.

Hope that helps.
Johannes

SF · September 22, 2020, 12:23pm

Thanks a lot for your reply @jrudolph !
I’ll investigate on your approaches and post insights here.

Best,
Simon

SF · September 22, 2020, 2:24pm

Hi @jrudolph

When checking the heap dump (of the running app, not yet in trouble), it can be seen it’s the actor called PropertyHostChannelRouter:

I don’t feel comfortable with the ChannelRouterActor code posted above, might there be the error?
To mention, the PropertyHostChannelRouter is an instance of ChannelRouterActor.

SF · September 28, 2020, 11:53am

I’m stuck with the problem, does anyone else have a different approach?

jrudolph · October 5, 2020, 9:16am

Try sorting the histogram by shallow heap. The retained heap counts will include each other when one object refers to another one. In particular, in an Akka application most objects will be referred by one actor or another, so it’s no wonder the actors turn up there.

It seems the pcollection collections are using most of the heap. Probably it contains the state of the application? I only see the clusterRoutersOfSameType in your code. Check that these collections don’t grow too big.

Also try to make sense of the numbers: the histogram shows 82 instances of PropertyHostActor and there are ~49000 instances of IntTreePMap. Does that make sense?

In general, if you are memory constrained like this try to make sure you use compact representations. Sometimes it means, that you have to give up on a bit of convenience or immutability because high-level data structures sometimes trade memory size for other properties.

Johannes

SF · October 26, 2020, 1:51pm

Thanks again @jrudolph!

It turns out there is a memory leak within PropertyHostChannelRouter which causes GC to take longer as the applications runs. I fixed it and now run tests to see whether the bug is really solved.

Question 2 is answered and part of question 3 is answered too.

I still cannot understand why we have those DEBUG log bursts like:

[DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] [SELServer-akka.actor.default-dispatcher-28] - Resolve (deserialization) of path [user/SELServe
r/PropertyHostChannelRouter/VirtualPropertyHost/$Q7#65017950] doesn't match an active actor. It has probably been stopped, using deadLetters.

Is this normal behaviour and thus can be accepted? Note the number of such logs increases over time. I don’t understand much of a.a.L.Deserialization internals.

jrudolph · October 26, 2020, 3:57pm

It is probably shown when the local node receives a message from another node for an ActorRef whose actor is not alive any more.

You could subscribe to Deadletters to see what kind of message end up in there.

Johannes

Topic		Replies	Views
Akka Cluster Sharding - long GC pauses causing unreachable nodes Akka Cluster akka-cluster	3	1270	July 6, 2020
Memory leak of RepointableActorRef due to race condition Akka	6	2076	November 18, 2019
[Play 2.6 Scala] Memory leak of RepointableActorRef instances Play Framework	0	668	December 13, 2018
Akka-Cluster: Decreasing system performance having many active actors Akka akka-cluster	8	1248	October 28, 2020
Akka Sharding - Event Sourcing - Performance Issue Akka	0	350	April 23, 2020

LocalActorRef memory leak and Full GC Allocation Failure causing gc to take too long

Related Topics