Akka design advice

Hi everyone,

I am very new to Akka and therefore an absolute beginner. I am trying get more familiar with it by writing a small personal project with Akka and so I am wondering how to do it “the right way”.

So the basic idea is to build an app that crawls some data. So for the crawling part I built three Actors:

  1. ProxyActor
    This actor is instantiated everytime a new proxy comes up. This actor maintains the state of the proxy it holds and checks frequently if the proxy is still alive.

  2. ProxiesActor
    This actor checks for new proxies and spawns new ProxyActor’s accordingly. This actor is meant to be a singleton within the Actorsystem and may be called to obtain an alive Proxy (by querying all ProxyActor’s with tail chopping).

  3. CrawlerActor
    This actor is instantiated by any actor who needs to crawl a website. This actor needs access to an alive proxy and therefore needs to talk to ProxiesActor.

Unfortunetly the ProxiesActor/ProxyActor are spawned under a dedicated branch (ActorSystem -> ProxiesActor(1) -> ProxyActor (N)) and the CrawlerActor may be spawned on other branches by whoever actor needs it.

So the point where i am stuck is how to obtain a ProxiesActor instance within every CrawlerActor the right way. The only thing i found to talk to actors outside your branch is the Receptionist-Pattern. Is this the way to go here or do I miss something that may be simpler or better in the context of modelling actors?

If the Receptionist-Pattern is the right choice, I am wondering if its the only way and how it is meant to be used. Do I understand it correctly that it is the only correct way to obtain an ProxiesActor instance by subscribing to the Receptionist (as CrawlerActor) and swap the internal reference to the ProxiesActor every time I get an update from the Receptionist? I was wondering why there is no mechanism to register for specific actor events in this case (like Created or Removed) but only a simple query like servicesWereAddedOrRemoved, which tells me nothing about the concrete actor referenced by getAllServiceInstances.

Cheers,
Aleks

Quite hard to say if the structure is good or bad, you will have to carefully think through how the application benefits from the structure with regards to what can happen in parallel/concurrently and what parts can fail in isoltion. Figuring out good such structures in a paradigm that is new to you is always hard at first and often takes a couple of tries before you get a feeling for what choices will benefit your applications and what will only complicate things. I’d recommend that you draw up a couple of different possible designs and think of the pros and cons of each.

For the concrete question about receptionist/looking actors up:

When one actor need the service of another actor you can either see it as a service that it owns and is responsible for the lifecycle of. In that case it makes sense to make it a child, that is not shared with other actors.

Or it is a service provided by one or more actors living elsewhere, in that case it often makes sense to know as little as possible about the actor providing the service, only the protocol to interact with it. And use a routing actor, either a built in router, or have a parent actor of your own design that you let route messages to children that it controls the lifecycle of and inject that parent ActorRef to the actor needing the service.

Thanks for your reply Johan, really appreciate it!

Will do that, I just thought that there may be a DO or DONT advice by someone like you regarding the current structure.

When one actor need the service of another actor you can either see it as a service that it owns and is responsible for the lifecycle of. In that case it makes sense to make it a child, that is not shared with other actors.

This one is clear and covered very well by the documentation and introductionary videos.

Or it is a service provided by one or more actors living elsewhere, in that case it often makes sense to know as little as possible about the actor providing the service, only the protocol to interact with it. And use a routing actor, either a built in router , or have a parent actor of your own design that you let route messages to children that it controls the lifecycle of and inject that parent ActorRef to the actor needing the service.

Yes, thats the interesting and confusing case, as it is not covered that much in the documentation. So when it comes to looking up other actors living outside the local scope - the general advice is to build somethig on your own that suits your needs? Ive been just very confused that this kind of basic need is not covered by something built-in.

I also looked up some other ideas on this topic and found the paypal/squbs project , which provides a custom actor registry that lets you talk to outside actors via their protocol without spawning them as a child.

The built in routers do cover that out of the box, a pool will spawn routees as children and forward messages to them using a configurable strategy while a group router picks up actors registered to the receptionist (on any node if you are using a cluster). Additional somewhat related tools are Cluster Sharding which routes all messages based on an id in the message to the same “entity” in a cluster, and Distributed Pub Sub which is more like a one-to-many router that can be used both locally and in a cluster.

Samples of both the kinds of routers can be found in the routing docs and for a lower level interaction with the receptionist yourself you can look in the Actor Discovery section.

I’d advice though that if you do not need the level of flexibility that the receptionist provides, it is probably better to start as minimal as possible, for example a pool router as parent for your worker and then passing the router ActorRef[T] to your actors that need the service. You can always switch to more advanced routing the day you realise need it.

I’d only go for a custom implementation of a routing parent if you have some very specific requirements for how the messages are routed or the routees managed that is not covered by the existing router API (and the others I have mentioned).

Thanks for the great explanation Johan and pointing me to the group routers, looks like I oversaw it! This seems to be exactly what I want, as I do not need to track the lifecycle of actors registered behind a servicekey by subscribing to the receptionist.

The group router works perfectly but I am still wondering how some basic communication between actos needing information from another actor is meant to be designed in Akka.

As every Actor may only communicate via a well defined message protocols across asynchronous boundaries, I am not quite sure what to do with Actors that depend on respones of other Actors.

I hope I am able to illustrate it with the following example:

I am wondering how the Crawler should continue its inital Craw-Message (1) after it gets the Acuire-Proxy-Message (3) back from Proxies Actor. To continue its work, I would need to route the initial Crawl-Message (1) in a generic way through the Acquire-Proxy-Message (2) to the Proxies Actor and back through its response with Acquire-Proxy-Message (3) so the Crawler knows what to crawl when getting the Proxy back.

I am asking myself if this is a common case or a misconception on my side as i am maybe not thinking the akka way?

Note I’m not expert - just a fellow learner myself. But if I understand correctly, you want to continue to do some actual “crawling” work, after receiving the (3) Acquired Proxy message. And once that is done too, then you want to send the result of the crawling work back in (4), and you are wondering how to hook the different message exchanges together?

If so, I think you may want a way to correlate the (3) Acquired Proxy response back to the original (1) crawl message you received.

I think there are different ways to achieve that, but your scenario looks similar to the one described in this section of the docs:

The sample code there shows how the actor receives a command message, and then itself sends a message (similar to your one to the “Proxies”) to a back-end, and stores a taskId etc in an “inProgress” data structure, so that it can come back to this “inProgress” work later upon receipt of the message response which would correspond to your (3).

In that example code, the taskId is stored together with the replyTo address of the actor to be replied to in the data structure. The replyTo is the sender of the original (1) message, and the taskId is included in (2) and (3) messages. This data is used to essentially correlate the messages together.