LocalActorRef memory leak and Full GC Allocation Failure causing gc to take too long

Hi

We use akka cluster with persistence version 2.6.6_2.13 and our application runs on raspberry pi 3’s (compute module).

When having -Xmx set to 128m for the jvm, it takes about 26 hours until the heap has almost no space left. There is no OutOfMemoryError but instead the max GC pause becomes 2 sec 339 ms and Full GC happens every minute (analyzed using GCeasy).
This is not acceptable because we have setup timeouts in actors of 1 second which means they always time out if the application is paused for more than 1 second.
Goal: The max GC pause should not exceed 700 millis

Increasing the heap size from 128m to 256m makes it even worse, as expected (max GC pause then is 16 sec 48 ms !).

Heapdump analysis with eclipse’s memory analyzer tool “mat“ suspects a memory problem with an instance of the class akka.actor.LocalActorRef.
Somehow, there is one of our actors in the system, called “propertyHostChannel”, which holds an ActorRef instance which occupies 42.22% of the heap (this is getting worse the longer the application runs). It is a LocalActorRef.

Additionally, we can see per-minute bursts of ~400 (!) debug logs like:

[DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] [SELServer-akka.actor.default-dispatcher-28] - Resolve (deserialization) of path [user/SELServe
r/PropertyHostChannelRouter/VirtualPropertyHost/$Q7#65017950] doesn't match an active actor. It has probably been stopped, using deadLetters.

Eventually relevant code excerpt of the ChannelRouterActor (the actor behind the “propertyHostChannel” instance):

private PSet<ActorSelection> clusterRoutersOfSameType = HashTreePSet.empty();

@Override
public Receive createReceive() {
        return receiveBuilder()
                .match(AddIdWithPropsToRegistry.class, this::onAddIdWithPropsToRegistry)
                .match(RemoveIdWithPropsFromRegistry.class, this::onRemoveIdWithPropsFromRegistry)
                .match(SendTo.class, this::onSendTo)
                .match(SendToCluster.class, this::onSendToCluster)
                // cluster events
                .match(ClusterEvent.CurrentClusterState.class, this::onCurrentClusterState)
                .match(ClusterEvent.MemberUp.class, mUp ->
                        addToClusterRouters(mUp.member())
                )
                .match(ClusterEvent.ReachableMember.class, reachableMember ->
                        addToClusterRouters(reachableMember.member())
                )
                .match(ClusterEvent.UnreachableMember.class, unreachableMember ->
                        removeFromClusterRouters(unreachableMember.member())
                )
                .match(ClusterEvent.MemberLeft.class, memberLeft ->
                        removeFromClusterRouters(memberLeft.member())
                )
                .match(ClusterEvent.MemberDowned.class, memberDowned ->
                        removeFromClusterRouters(memberDowned.member())
                )
                .build();
    }

private void addToClusterRouters(Member member) {
        final ActorSelection cousinRouter = getCousinRouter(member);
        if (cousinRouter.anchor().path().address().host().isDefined()) {
            clusterRoutersOfSameType = clusterRoutersOfSameType.plus(
                    cousinRouter
            );
            log.debug("Added cousinRouter: {} to clusterRoutersOfSameType: {}", cousinRouter, clusterRoutersOfSameType);
        }
    }

private void removeFromClusterRouters(Member member) {
        final ActorSelection cousinRouter = getCousinRouter(member);
        clusterRoutersOfSameType = clusterRoutersOfSameType.minus(
                cousinRouter
        );
        log.debug("Removed cousinRouter: {} from clusterRoutersOfSameType: {}", cousinRouter, clusterRoutersOfSameType);
    }

private ActorSelection getCousinRouter(Member member) {
        final String name = getContext().getSelf().path().name();

        // each node has a ChannelRouterActor at "/user/SELServer/" + name, so select that path
        return getContext().actorSelection(member.address() + "/user/SELServer/" + name);
    }

Questions

  1. Why do we have bursts of the mentioned DEBUG logs?
  2. Why is there one instance of LocalActorRef which occupies so much heap?
  3. How can we get rid of the DEBUG log bursts and how can we fix the memory issue so gc does not take more than 700 millis?

Please let us know if you need more information.

Thanks

Hi @SF,

regarding the heap memory issue. Akka’s ActorSystem organizes Actors in a hierarchy with the root guardian at the top. In a typical Akka application most data is referenced by some actors and parent actors reference their child actors. So, in summary, the guardian actor should reference most of the heap basically by definition.

That said, of course, there could in theory also be a user actor which references all the data, you will have to check that in detail.

You can check what actor (or ActorRef) you are looking at in MAT by looking at the field values of the LocalActorRef in question, in particular at the path field or inside the cell at the _actor field which contain the actual Actor instance of the actor in question.

Another good strategy is not using the hierarchy view but the histogram view to get a first idea of what could be wrong. A common (mis)usage of heap is in “overflowing” actor mailboxes which can happen if an Actor isn’t able to process its (unbounded) mailbox fast enough to keep up with incoming messages.

Hope that helps.
Johannes

1 Like

Thanks a lot for your reply @jrudolph !
I’ll investigate on your approaches and post insights here.

Best,
Simon

Hi @jrudolph

When checking the heap dump (of the running app, not yet in trouble), it can be seen it’s the actor called PropertyHostChannelRouter:

I don’t feel comfortable with the ChannelRouterActor code posted above, might there be the error?
To mention, the PropertyHostChannelRouter is an instance of ChannelRouterActor.

I’m stuck with the problem, does anyone else have a different approach?

Try sorting the histogram by shallow heap. The retained heap counts will include each other when one object refers to another one. In particular, in an Akka application most objects will be referred by one actor or another, so it’s no wonder the actors turn up there.

It seems the pcollection collections are using most of the heap. Probably it contains the state of the application? I only see the clusterRoutersOfSameType in your code. Check that these collections don’t grow too big.

Also try to make sense of the numbers: the histogram shows 82 instances of PropertyHostActor and there are ~49000 instances of IntTreePMap. Does that make sense?

In general, if you are memory constrained like this try to make sure you use compact representations. Sometimes it means, that you have to give up on a bit of convenience or immutability because high-level data structures sometimes trade memory size for other properties.

Johannes

1 Like

Thanks again @jrudolph!

It turns out there is a memory leak within PropertyHostChannelRouter which causes GC to take longer as the applications runs. I fixed it and now run tests to see whether the bug is really solved.

Question 2 is answered and part of question 3 is answered too.

I still cannot understand why we have those DEBUG log bursts like:

[DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] [SELServer-akka.actor.default-dispatcher-28] - Resolve (deserialization) of path [user/SELServe
r/PropertyHostChannelRouter/VirtualPropertyHost/$Q7#65017950] doesn't match an active actor. It has probably been stopped, using deadLetters.

Is this normal behaviour and thus can be accepted? Note the number of such logs increases over time. I don’t understand much of a.a.L.Deserialization internals.

It is probably shown when the local node receives a message from another node for an ActorRef whose actor is not alive any more.

You could subscribe to Deadletters to see what kind of message end up in there.

Johannes

1 Like