I am currently trying to design a heartbeat monitoring across our microservices, and we are currently undecided whether to go for
- A simple ping-pong approach (i.e. MicroserviceB has a ping REST endpoint, which MicroserviceC periodically invokes. If the ping succeeds, then MicroserviceB is up. If it fails, then MicroserviceB is down), or
- Via Kafka topics (i.e. MicroserviceB publishes a heartbeat event every X seconds, and MicroserviceC listens to this and if it doesnt receive a heartbeat event soon enough, then it can infer that MicroserviceB is down).
The advantage of #1 is that it allows for external tools (like nagios or ELK) to monitor those ping endpoints. While the advantage of #2 is that it seems closer to the “lagom way” of doing things.
We have a microservice which subscribes to two other microservices and it aggregates the information - i.e. MicroserviceA publishes an event EventA, MicroserviceB publishes an event EventB. And then MicroserviceC subscribes to both MicroserviceA and MicroserviceB. Now, our MicroserviceA and MicroserviceB publishes events at different intervals. And our MicroserviceC needs to find out whether the events its getting from MicroserviceA and MicroserviceB are still valid and up to date …or has one of those Microservice’s gone down already
- MicroserviceA publishes an event.
- MicroserviceB publishes an event.
- MicroserviceC gets the two events, processes them and creates its own event.
- MicroserviceA publishes a new event.
- MicroserviceC gets this new event and combines it with the last seen event from MicroserviceB and publishes its own new event.
- Then MicroserviceA publishes a 3rd event, and MicroserviceB goes down.
- MicroserviceC should then process the 3rd event of MicroserviceA but not combine it with the data it last got from MicroserviceB because that data is no longer valid.
So the question we have then is that, how do we know if a data of a microservice is still valid (like the data of MicroserviceB in step #3 and #5) or if its no longer valid (like in step #7)? - we’re addressing this using a heartbeat monitoring. If there’s a better way, I’d like to know more