Heartbeat Monitoring

Hi,

I am currently trying to design a heartbeat monitoring across our microservices, and we are currently undecided whether to go for

  1. A simple ping-pong approach (i.e. MicroserviceB has a ping REST endpoint, which MicroserviceC periodically invokes. If the ping succeeds, then MicroserviceB is up. If it fails, then MicroserviceB is down), or
  2. Via Kafka topics (i.e. MicroserviceB publishes a heartbeat event every X seconds, and MicroserviceC listens to this and if it doesnt receive a heartbeat event soon enough, then it can infer that MicroserviceB is down).

The advantage of #1 is that it allows for external tools (like nagios or ELK) to monitor those ping endpoints. While the advantage of #2 is that it seems closer to the “lagom way” of doing things.

Background Info
We have a microservice which subscribes to two other microservices and it aggregates the information - i.e. MicroserviceA publishes an event EventA, MicroserviceB publishes an event EventB. And then MicroserviceC subscribes to both MicroserviceA and MicroserviceB. Now, our MicroserviceA and MicroserviceB publishes events at different intervals. And our MicroserviceC needs to find out whether the events its getting from MicroserviceA and MicroserviceB are still valid and up to date …or has one of those Microservice’s gone down already

Example:

  1. MicroserviceA publishes an event.
  2. MicroserviceB publishes an event.
  3. MicroserviceC gets the two events, processes them and creates its own event.
  4. MicroserviceA publishes a new event.
  5. MicroserviceC gets this new event and combines it with the last seen event from MicroserviceB and publishes its own new event.
  6. Then MicroserviceA publishes a 3rd event, and MicroserviceB goes down.
  7. MicroserviceC should then process the 3rd event of MicroserviceA but not combine it with the data it last got from MicroserviceB because that data is no longer valid.

So the question we have then is that, how do we know if a data of a microservice is still valid (like the data of MicroserviceB in step #3 and #5) or if its no longer valid (like in step #7)? - we’re addressing this using a heartbeat monitoring. If there’s a better way, I’d like to know more :smiley:

Thanks!

@franz
i would say that the key for your solution is in the way event validation is done.

Microservice C, that does the aggregation, should control domain of the aggregation. Validating each event and aggregation of events itself should be Microservice C responsibility.
From my perspective event validation could be done using these solutions:

  1. by checking event metadata
  2. by tracking logic in event stream (time between events, sequence numbers,…)
  3. validation done by the event producer service - using event producer service service call to validate certain event

#1 is for me a recommended way. So to model events in the way that can be self validated. #2 can be an extend of #1 if validation process depends not just on a single event metadata but event stream.
#3 is something that maybe needed in some cases (when #1 and #2 are not usable) but introduces runtime dependency between event producer and subscriber.

Validating event by checking event publisher service health check does not sound right. Maybe in your case it is but I can not conclude it from your example.

Hope this helps.

Br,
Alan