The ghost domain problem in DNS, and what we're doing about it

Posted by Mojah 5 days ago

Comments

Comment by quuxplusone 1 day ago

From my read of the article, this isn't a problem "in DNS." OP runs an uptime monitoring service that purports to check whether DNS can resolve your domain — but today OP learned that because he's hitting his ISP's recursive resolver, he doesn't notice downtime until the TTL of the previous response expires.

Solution (which everyone else does, and OP has now implemented): don't use your ISP's recursive resolver! Run your own instance of bind9 or whatever, with the cache disabled. Or it seems like `dig +trace` would probably work, too.

"Cached resources remain visible for their whole TTL, even if the original becomes unreachable or changes" seems like one of the very first principles someone should learn when deciding to go into business selling an uptime monitoring service.

It's not a "ghost domain," it's a Time-To-Live field!

Comment by Bender 1 day ago

Adding to this I would take it a step further. One should use monitoring probes from many locations that check the authoritative servers directly in the event one is using anycast DNS (hundreds or thousands of DNS servers spread around the world) to increase the likelihood that one catches a cluster out of sync SOA mismatch or a cluster is down or experiencing packet loss. There are services that do this or one could just use in-house monitoring from all their data-centers in each continent and region. Even and especially in the biggest DNS providers anycast clusters can get out of sync for a myriad of reasons which is why the probe should always be for the SOA serial of each zone.

Comment by toast0 1 day ago

If you disable caching on your recursive resolver, I guess that works...

But I think there is a real issue if caching recursive resolvers don't revalidate delegations. If .com tells you the example.com nameserver is ns1.example.com with some TTL, and ns1.example.com tells you its the nameserver for example.com every time you ask for www.example.com, you should still check in with the .com nameservers ever TTL to validate.

I'm not surprised the default behavior doesn't do that, because when I migrated a high profile domain to new nameservers, we still saw requests to the old servers for more than a month... but it's still ugh, recursive DNS should do better!

Comment by account42 1 day ago

The point is that that TTL is appropriate for regular lookups but not necessarily for an uptime monitor which has different requirements and tradeoffs to consider.

Comment by whatthesmack 1 day ago

I get where you're coming from, but I think the part you're missing is that caching is not optional (per the "with the cache disabled" comment). If you're running a service like theirs, you need to do caching to some degree.

My understanding of their implementation is that it does caching on a per-worker basis with lower TTLs, so it balances caching with accuracy, but doesn't "fix" visibility to the problem either. However, it narrows the window such that their method of monitoring will very likely expose a pulled domain sooner.

Comment by wahern 1 day ago

Caching is always optional. If you want sophisticated control over caching, or just no caching at all, you can do recursive lookups directly. Some existing DNS libraries can do recursive lookups directly, including libunbound. Ideally you would do caching of TLD nameserver addresses so you're not hammering the root servers for each query. But otherwise I don't think it's necessarily a big deal for a monitoring service to query the .de nameservers every time it checks example.de. If you had a monitor check every 5 minutes for example.de, it's only a 12x load factor between querying .de directly every check versus once per hour. Unless you're AWS Route53, that's not much of a difference relative to the traffic TLD servers handle even if you were checking tens of thousands of domains. Clamping your cache at 1 hour is probably already a much greater load factor. The .de nameservers return a TTL of 1 day for the example.de NS RR sets, so that's already a 24x factor right there. At that point you're clearly giving the .de nameservers the middle finger and relying on your own judgement of what it means to play nice.

Comment by johnhtodd 1 day ago

One approach to solving this for a very limited set of intervals is to actually block namespace that has been removed at the registry level. There is a paper on this from Raffaele Sommese:

https://static.sched.com/hosted_files/icann83/5b/Rafaelle%20...

Quad9 (9.9.9.9) consumes this feed from U Twente of "just deleted" names, as most of them are malicious, and blocking them even if they are NOT malicious causes zero harm. Currently, this is only names that are very short-lived, so may not catch the longer intervals where names are deleted and become ghosts.

Another model using something similar would be to specifically clear those "just-deleted" name cached entries out of the recursive resolver, but that is expensive. Also, with blocking instead of removal it is possible to get high-level metrics on how often those are being abused where NXDOMAIN tracking is not measured in the same dimensions.

(disclaimer: I work for Quad9)

Comment by winstonwinston 1 day ago

Technically every domain is a ‘ghost’ domain until TTL expires and NS RRs are usually cached for very long.