Resolve 1000+ DNS with async-dns without OOM error

Is it possible to resolve 1000+ domains via akka async-dns?

In our actor we have the following line:

IO(Dns)  ? DnsProtocol.Resolve(domain, ipRequestType(ipv4=true, ipv6=true)(1 second)

Even if it’s wrapped with try/catch logic and added recovery section we have next issue:

19:08:48.140 [grpcTestServer-akka.actor.default-dispatcher-105] ERROR akka.io.dns.internal.AsyncDnsResolver - Resolve failed. Trying next name server
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://grpcTestServer/system/IO-DNS/async-dns/$c/$a#1143717252]] after [500 ms]. Message of type [akka.io.dns.internal.DnsClient$Question6] was sent by [Actor[akka://grpcTestServer/system/IO-DNS/async-dns/$c#281115170]]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
	at akka.pat^C	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:648)
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:669)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:202)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875)
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:113)

To be short, we don’t care if some of lookup’s fail. We return mock response in recovery section.
But, AskTimeoutException slow down app extremely and after a while we get OOM issue.

akka configuration:

akka.io.dns.resolver = async-dns
akka.io.dns.async-dns.provider-object = "akka.io.dns.internal.AsyncDnsProvider"
akka.io.dns.async-dns.resolve-timeout = 0.5s
akka.io.dns.async-dns.positive-ttl = forever
akka.io.dns.async-dns.negative-ttl = 5m

We have above issue even if resolve-timeout set to 150 sec.
But, via whireshark I could see successful lookups (requests/responses).

It doesn’t matter what heap size is. If it’s bigger, then app will fail or hang forever, later.

App ran in docker with the following JAVA_OPTS:

-XX:+AggressiveOpts
-Xms3g
-Xmx3g

Docker memory limited to 4g and 1CPU.
Bigger number of CPU’s will not help.

Have a look at the first error. The moment that error happens it’s possible the failure snowballs.

But, via whireshark I could see successful lookups (requests/responses).

In wireshard you will see successful req/resp but it’s possible there’s a circuit breaker on the implementation preventing a flood downstream.

My point is: your OOME is a consequence of accumulated exceptions (apparently). Instead of investigating that, investigate what is causing the exceptions in the first place. Then, review your implementation with backpressure so, as soon as the DNS server starts to fallback with the avalanche of requests, you slow down (or even halt) your requests.

1 Like

In python I could do required number of lookups per second.
DNS server extremely fast.

My question is how to get rid of AskTimeouts in akka when I’m doing DNS lookups.
I don’t care if lookup has been failed, just want to continue with next one.

investigate what is causing the exceptions in the first place

line 105: https://github.com/akka/akka/blob/0e4d41ad33dbeb00b598cb75b4d29899371bdc8c/akka-actor/src/main/scala/akka/io/dns/internal/AsyncDnsManager.scala

Use the APIs that go via the cache without an Ask. Docs: https://doc.akka.io/docs/akka/current/io-dns.html

2 Likes

But that assumes that the cache is useful for the use case and also only returns entries from the cache. If you really want to run queries against a DNS server it doesn’t help, or does it?

1 Like

Created https://github.com/akka/akka/pull/29376 to remove the stack trace from the exception which might help improve throughput in situations where timeouts are expected and frequent.

In any case, @ignasi35’s answer is also correct. Asynchronous systems are usually quite efficient and it can be easy to overload backend systems. So you must make sure to throttle requests to external systems to a reasonable rate (also for your own sake or you will trigger DDOS protection on the target (DNS) servers).

In your case, it seems you even overloaded your own system by not being able to handle all the timeouts which would have made them eligible for GC.

The easiest way to add throttling would be to put the hostnames to query into a akka stream Source and call the DNS resolver through a mapAsync with a reasonable parallelism setting. Then maybe even add a throttle stage to limit overall throughput.

1 Like

Thanks for support.
I’ll take a look into akka stream and mapAsync.

In your case, it seems you even overloaded your own system by not being able to handle all the timeouts which would have made them eligible for GC.

As I’m new to scala, I will believe you.
The only thing is, I could resolve required num of domains in python on multiple cores. But , thanks for pointing me. Will check one more time.

Can you try again with Akka 2.6.8? We now removed the stack trace from the exception which should speed up things and also use less memory.

Tested Akka 2.6.8
Xmx=1g Xms=1g, docker limited to 1CPU, 2G max memory.
Still seeing lots of “Ask timed out” via JProfiler.
No Errors in STDOUT, that is good.

But, if I set up Xms and Xmx to 256m and
limit docker to 1CPU and 1G memory,
I do not see "Ask timed out"s at all or very few items.

Also, with older version off Akka it crashed after 5min load with above configuration.
With Akka 2.6.8 it survived, at least 30, mins.
From GC prospective everything still looks very sad.


Looks like it will fail with 256m heap size after some time because of DNS cache.

The cache in Akka’s DNS @chbatey mentioned above will keep a cache of the data for as long as the TTL in the resolved data indicates. If you are resolving a big number of domains and all have a rather large TTL (e.g. hours or even days) then your filling your memory.

You can tweak some settings to overwrite the TTL values received or even disable caching.

Cheers,

1 Like

It’s only my guess. I’m not sure. I could try with only one domain.

But, after 70mins akka hangs forever.

@devrivne, it’s really hard to tell what’s going on without seeing any code or having more details wrt the data you are managing.

IO(Dns) ? DnsProtocol.Resolve(domain, ipRequestType(ipv4=true, ipv6=true)(1 second)

For now, no any processing needed. Just want to make a number of request and see if it will survive prolonged load.

This is almost all the code. How to feed data (domains) is up to you.
Configuration described above.

I’m runing HTTP or gRPC server in order to feed data with any free hammer tool (load generator like ghz or Jmeter).

Load from 500 domains/sec up to 1500 domains/sec
Even reading Alexa 1M from file and feeding domains with above throttling would be good for test.

1 Like

I’ll make one more test with disabled DNS cache.

Disabled cache

Did you also collect another memory profile?

Unfortunately, no.
I could rerun and gather needed stats.
What are you interested in?

The removal of the stack trace in the 2.6.8 release shifts this from a OOME in minutes, to OOME after an hour, on a rail test.

These were taken with Akka 2.6.6.

The test rig is Akka gRPC receiving inbound requests. Part of the service requirements is to make a DNS call to resolve the IP address(es) for hostnames. This needs to occur at a particular high speed. As we operate the DNS resolvers, there is no fear of DDoS protection kicking in, or having our requests throttled/limited.

I’ve suspected that the JVM wasn’t garbage collecting properly. My research is documented here: Akka Actor OOME workaround.

I could patch in some Akka gRPC code, and include a ghz (tool) example if required.

Even without the 2.6.8 fix, this seems to alleviate the OOME issue, but the stack trace removal is greatly appreciated nevertheless.

Cheers~

1 Like

OOM after an hour sounds like there is a resource leak, could be in your application or some bug in the DNS subsystem of Akka (could it be that there is no bounding of the cache perhaps?)

Making sure that a heap dump is done on OOM and then looking at what objects are filling up the heap is needed to figure out what is going on (you can start the JVM with -XX:+HeapDumpOnOutOfMemoryError).