Handling failure in inter-service communication

There is a common pattern I’ve seen in Lagom for asynchronous communication between services. Say I have two Services A and B. When a change in a persistent entity in Service B needs to be communicated to Service A, there are two approaches:

  1. In Service B subscribe to the event stream of the persistent entity and make calls to the API of Service A to notify the service
  2. In Service A subscribe to a topic of Service B and react to the events about the change

I’ve seen that the second approach is preferred because it more loosely couples the services, e.g., if Service A is down then Service B will continue without error and Service A will catch up when it comes back online.

My question has to do with how to handle errors that require notifying Service B about the failure so that it can take some compensating action.

In the first approach a failure can be handled by the event processor in Service B. Since the processor is in Service B it can send a command to the relevant persistent entity so that it can initiate a compensating action.

The second approach is trickier. The topic processor is in Service A and must communicate the failure back to Service B. My initial thought was to simply make an API call to Service B. However, that undoes the loose coupling we gained by using the second approach. To maintain loose coupling we’d need to use a topic to publish a failure notification event back to Service B. The only way I can see to accomplish that is to have the persistent entity in Service A persist a failure event. That seems inappropriate because failure events aren’t state changes and because it would mix Service B business logic in to Service A.

I can see the advantages of the second approach and why it seems to be a common pattern. However, I’m trying to figure out how to handle failure notifications without negating those advantages.

I’d appreciate the community’s experience or insight on this issue.


Failure notifications can be communicated via Service A topic.
Currently Lagom Broker API supports only publishing to Kafka using topic producer. Topic producer guaranties at-least-once semantic by using persisted event stream.
Implementing persistant entity, in your use case, is an overhead.

You should publish directly to Service A topic from your Service B topic subscriber. Chaining subscribe and publish operations will guarantee at-least-once semantic for publishing.

For now this can not be done using Lagom Broker API so you will need to implement it using Alpakka kafka (Lagom Broker API for Kafka underlying implementation).

You have check implementation examples here.

Hope this helps.


1 Like

It’s useful to make a distinction between failures that cause the service to stop functioning correctly and errors that are expected and meaningful within the domain of the service.

Generally, failures should be isolated between services. If service A can’t proceed because, say, the database is down, it should retry until it succeeds and service B doesn’t need to know.

Errors are domain logic: maybe in an airline booking system, service A knows that a flight has been cancelled and needs to notify service B to issue notifications to passengers and refund their tickets. This is absolutely appropriate to model as state changes and and error events using the normal mechanisms with persistent entities and topics.

In some cases, you may want to escalate a system failure to a domain error. Maybe the payment gateway service trying to issue a refund for a particular passenger is failing repeatedly and you’d like it to be processed manually by an agent. In that case, you can put failure handling logic in the payment service that retries internally up to a certain threshold and then treats it as a domain event.


Thank you for the thoughtful reply. That’s a very useful distinction between errors and failures. My question mixes the two together; I was thinking of errors. I see the value of domain errors in the examples you provided, but I’m still stuck on one thing.

Say that the two services are a user service and a task service. Tasks are small jobs to be done and users claim tasks to complete. A task can only be claimed by a single user. In this system users that are not currently working any tasks can set a flag to try and claim the next available task.

So, a new task is created, the user service receives that event and two different users both enter a pending state attempting to claim the task. The task service will receive two events for the two users, and one will succeed to claim the task and the other will error because the task is already claimed. In this case the error has to do with the user request to claim the task. The task itself is not in error (unlike a cancelled flight), but we need a way to asynchronously tell the user that their attempt to claim the task met with an error (an acceptable result that it can compensate for).

It still seems odd for the task to persist an event that basically says someone sent me an invalid command. Persisting that event would solve the problem, but I don’t want to solve the problem the wrong way.

An approach I considered was having the user service react to the successful claim event. However, the successful claim event doesn’t contain any information about the other user that tried to claim the task. I would need a read-side in the user service to track these relationships to figure out which persistent entity to notify. I was trying to avoid that by having the error event path do or generate something because it has the appropriate context.


That’s an interesting approach that bypasses the persistent entities in Service A. I assume that if Kafka is configured to ever delete messages there is a chance that this event will be lost if not consumed in time. Do you know if that is correct?

I’ll have to play around with the examples. My project is all Scala so I’ll need to port the Kotlin. Thanks for the info!

You can configure message retention per topic:

kafka-topics --zookeeper <ip>:2181 --alter --topic <topic name>  --config retention.ms=<miliseconds retention period>

Default is 7days

Example is a good deed from @ihostage & @lynxpluto. Hope you guys did not mind me sharing it :slight_smile:

I’m personally working on showcase projects for java and scala but did not find time to finish it yet. Will do when it is done.

1 Like

Really happy to see such references to me and @ihostage )) I think it is usefull to treat Kafka topics as some kind of JSON based API. One service (A) may push messages to the topic that being consumed by another one (B). Also B may push to the topic that being consumed by A. While messages are being consumed then errors may occure and then you probably want to distinct which of them should cancel/restart consuming and which of them should not. Also you may want to trigger/push some compensating event to the same or different topic and then handle it at the separate consumer group for example.