Lagom cluster production problem with readside

There are ‘abc-service’ and ‘relation-service’, both with entity and ReadSideProcessor,when register to its own seed, both work well. but form a cluster, register one to the other seed, error happens.

manual create readside keyspaces and tables, so ReadSideHandler without setGlobalPrepare handler.

error:

[2020-11-04 14:56:17,188] [ERROR] [akka.actor.OneForOneStrategy] [] [abc-service-akka.actor.default-dispatcher-18] - Ask timed out on [Actor[akka://abc-service/user/readSideGlobalPrepare-RetrieveReadProcessor-singletonProxy#-261176397]] after [20000 ms]. Message of type [com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor$Execute$]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://abc-service/user/readSideGlobalPrepare-RetrieveReadProcessor-singletonProxy#-261176397]] after [20000 ms]. Message of type [com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor$Execute$]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:647)
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:668)
	at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:476)
	at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:164)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:358)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:309)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:313)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:265)
	at java.lang.Thread.run(Thread.java:748)

there is config related:

 abc-service config:

lagom.services = {
  abc-service = "http://127.0.0.1:"${http.port}""
  relation-service = "http://127.0.0.1:9003"
}

akka {
  actor {
    provider = "cluster"
  }
  remote.artery {
    transport = tcp
    canonical {
      hostname = "127.0.0.1"
      port = 2551
    }
  }

  cluster {
    seed-nodes = [
      "akka://"${play.akka.actor-system}"@127.0.0.1:2551"
    ]
  }
  cluster.roles = ["abc-service"]

  cluster.min-nr-of-members =1
  cluster.role{
            abc-service.min-nr-of-members=1
            relation-service.min-nr-of-members=1
  }
  cluster.sharding {
          role="abc-service"
  }
}

akka {
  discovery.method = config
}

akka.discovery.config.services = {
  abc-service = {
    endpoints = [{
      host = "127.0.0.1"
      port = ${http.port}
    }]
  },

  relation-service = {
    endpoints = [{
      host = "127.0.0.1"
      port = 9003
    }]
  }
}


relation-service:


lagom.services = {
  relation-service = "http://127.0.0.1:"${http.port}""
  abc-service = "http://127.0.0.1:9003"
}

 akka {
  actor {
    provider = "cluster"
  }
  remote.artery {
    transport = tcp
    canonical {
      hostname = "127.0.0.1"
      port = 2552
    }
  }

  cluster {
    seed-nodes = [
      "akka://"${play.akka.actor-system}"@127.0.0.1:2551"
    ]
  }
  cluster.roles = ["relation-service"]

  cluster.min-nr-of-members =1
  cluster.role{
            abc-service.min-nr-of-members=1
            relation-service.min-nr-of-members=1
  }
  cluster.sharding {
          role="relation-service"
  }
}

akka.discovery.config.services = {
  relation-service = {
    endpoints = [{
      host = "127.0.0.1"
      port = ${http.port}
    }]
  },

  abc-service = {
    endpoints = [{
      host = "127.0.0.1"
      port = 9003
    }]
  }
}

AkkaVersion = “2.6.5”

Hi @sampleblood,

it looks like there’s an error on relation-service's application.conf:

But the port for Akka remote in the relation-service is not 2551 but 2552.

I assume you are deploying each instance of abc-service and relation-service in separate containers/pods. If you are not using isolated containers, then using 127.0.0.1:2551 as a seed node the relation-service would use abc-service as a seed which is not correct.

Cheers,

@ignasi35
thank you for quick reply.

may be I misunderstand a cluster which need not form by all services, a service with nodes can be a cluster.

so relation-service need not join abc-service to form a big cluster, it can be a cluster itself.

The terminology (cluster, service,…) is bloated and can lead to confusion.

In a world of microservices, we recommend each service to be an isolated akka cluster. Then, communication between different services should be based on HTTP/gRPC or asynchronous over a broker (e.g. Kafka).

@ignasi35 may I ask another question confused me so much, abc-service with entity and readSideProcessor, when I restart this service, sometimes log this and cpu overload and have to kill this service.

Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [10525] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC

but this is random behaviour, sometimes restart success. and when this heartbeat delayed, the readsideProcessor did not process event and entity process command normal, so every time I did not see the readside processing event, I knew this restart failed.

I am not sure this is something wrong with readSide timestamp or something else, always restart succeed after one hour.

can you please help.

During startup, if the JVM’s are limited in resources, excessive JIT or GC-ing can lead to the JVM to stop the world and cause the heartbeat to delay. Another reason could be there is too much traffic between nodes causing the hartbeat message to be delayed.

Try increasing the cpu and memory resources of the JVM.

Cheers,

@ignasi35
I am not sure this is something wrong with readSide timestamp or something else, always restart succeed after one hour.

I just over this again, before 21:00, I restarted over and over again, but failed, after 21:00. restart succeed, read side process normally.

I have tried many times and almost sure there is a readSide issue with timestamp and event size, may be cassandra readSide query events from buckets related time and event size, I will dig this.