[Resolved] Akka 2.6.15 ClusterActorRefProvider warning - Error while resolving ActorRef - Wrong protocol of akka.tcp, expected akka

I am in the process of updating one of our clusters from akka-2.5 to akka-2.6, including updating from Netty to Artery. I am getting the following warning when the cluster is starting up and so far have been unable to determine what’s causing it or what the full implications are, and consequently am unsure how to properly fix the issue.

akkaAddress: akka://{system}@{host-a}:{port-a} 
level: WARN
logger: akka.cluster.ClusterActorRefProvider
message: Error while resolving ActorRef [akka.tcp://{system}@{host-b}:{port-b}/system/sharding/{shard}#{id}] due to [Wrong protocol of [akka.tcp://{system}@{host-b}:{port-b}/system/sharding/{shard}], expected [akka]]

The log appears multiple times for each member of the cluster.

My debugging so far:

In this log, host-a is the source logging the message, and host-b is the destination; both are IPv4 addresses. Sometimes host-b is the same as host-a, so this warning appears for both local and remote destinations. Curiously, even when host-a and host-b are the same, port-a and port-b can be different, and in such cases port-b doesn’t seem to correlate with any port we have configured.

The expected [akka] portion tells me the ArteryTcpTransport is correctly in use on host-a. There is a similar warning from host-b which tells me the same. The akka.tcp:// portion suggests that somehow the resolution attempts using classic remoting via an AkkaProtocolTransport wrapping a NettyTransport.

Looking at the source for ClusterActorRefProvider, I see the Error while resolving ActorRef can come from a few places in the RemoteActorRefProvider superclass while constructing a RemoteActorRef. The part that results in akka.tcp in the error message comes from a localAddress, and all but one of those comes from an invocation of transport.localAddressForRemote. It’s already established that transport is correctly an ArteryTcpTransport, and its localAddressForRemote always results in setting the akka protocol. So the erroneous path must be going through the remaining location where the localAddress is passed in: RemoteActorRefProvider.resolveActorRefWithLocalAddress

This method only appears related to classic remoting by going through AkkaPduProtobufCodec, through which I’ve gotten as far as some code in Remoting.listens that dynamically creates legacy Transport instances. I then found that akka.remote.classic.enabled-transports = ["akka.remote.classic.netty.tcp"] by default. I tried setting akka.remote.classic.enabled-transports = [] yet the warning persists.

I thought maybe an akka-2.5 artifact was being loaded, but there is no message warning about mixed versions, and no akka-2.5 dependency anywhere in the tree.

We have another cluster for which we have already made this update where the warning does not appear. There are two main differences between them:

  • this one uses akka-persistence-jdbc=4.0.0 while the other does not use akka-persistence at all, and the shard is also a PersistentActor
  • this one uses akka.remote.artery.large-message-destinations for the shard while the other does not use any large message destinations

I tried not including the shard as a large message destination but the warning persists.

This warning does not prevent a cluster from forming. However because this warning appears for the cluster shard and the code where it appears returns an EmptyLocalActorRef instead of the RemoteActorRef, I want to make sure messages sent to the shard are correctly being sent to a singleton instance within the cluster and not to a local instance, as we rely on the singleton aspect for correctness. I believe this is the case but I’ve not been able to follow usage of the EmptyLocalActorRef to validate that belief.

So my questions are

  1. What could cause akka-2.6 to use classic remoting at all while artery is enabled?
  2. What are the effects and implications of seeing this message for a cluster shard on startup?
  3. What can I do about it?

In the events you persist, do you include ActorRefs or actor paths somewhere? That could explain that the classic addresses are present, if they come from events stored before upgrading.

Thanks for the response!

We do not store actor refs or paths in events we create, at least not intentionally. Maybe akka-persistence is adding that behind the scenes?

Actually, asking that question made me go looking through our journal table and I do see some persistence_id for /sharding/{shard}Coordinator where the message contains a full actor path, some of which include akka.tcp and some of which include just akka, and also includes the /system/sharding/{shard}#{id}. This feels like it’s the likely cause, but we don’t create or store anything like a “Coordinator” so I have no idea what this is but it must be coming through akka libraries.

This is also during startup, and only then, which is before we process any new messages that would read events from the database. When a message does come in that reads events, they get processed without the warning. So it now looks like akka-persistence is looking for information that it has stored from the database when the plugin is initialized. It also appears to be examining every message and not just the latest one, because the message with the latest ordering does not have akka.tcp.

My gut says this coordinator is interacting with the cluster.sharding.state-store-mode = "persistence" setting that we have had for years. I know this setting is deprecated in favor of ddata, but we’ve not assessed what changing this setting would actually do or impact for us.

So I guess the new questions are:

  • What is this coordinator and why is it stored in the database at all?
  • Is it possible/reasonable to disable storing and loading info about the coordinator?
  • Assuming we can’t disable it, is it possible to make it not examine every message and only use the latest one?
  • What is the impact of failing to resolve something for the coordinator on startup?

It makes sense that it is the reason you see those warnings.

You can clean those out from your database, there’s a tool for doing that: Cluster Sharding • Akka Documentation

The coordinator keeps track of what shards run on which nodes in the cluster, in case the node it lives on crashes it needs to be able to start on a new node and recover state in case the node, that’s why it needs to use either persistence to store that to a database or ddata to store the state across all nodes in the cluster.

I’d recommend that you consider switching over to the ddata store unless you figure out a good reason to stay with persistence. You can read more about the two state store modes here: Cluster Sharding • Akka Documentation

Ah, okay thanks, I see now.

There is a PersistentShardCoordinator that is a PersistentActor, so it’s treating all this as event sourced data but without any identifier as to which cluster it belongs. So it’ll start up and try to recover state, replaying the snapshot and events, in case it was a crash.

But crashes are super rare and we shutdown the cluster when we deploy so for us that data will almost always be for previous non-existent clusters and therefore almost-always fail since those nodes usually don’t exist anymore. This failure would have been happening all along and we didn’t know about it because there was no protocol mismatch to alert us. Then the coordinator just creates new state since the old state is unrecoverable. Finally given enough time, a new snapshot would be created and old events with akka.tcp would no longer be replayed, and the message would go away on its own.

Since this shard information is only relevant per-cluster, I think we don’t care at all about the info in the database at all, and we can safely and trivially move to ddata then delete anything with a persistent_id starting with /sharding