TranslateProject/sources/tech/20220115 Some ways DNS can break.md
DarkSun 5d31b841d9 选题[tech]: 20220115 Some ways DNS can break
sources/tech/20220115 Some ways DNS can break.md
2022-01-18 05:03:00 +08:00

11 KiB
Raw Blame History

Some ways DNS can break

When I first learned about it, DNS didnt seem like it should be THAT complicated. Like, there are DNS records, theyre stored on a server, whats the big deal?

But with DNS, reading about how it works in a textbook doesnt prepare you for the sheer volume of different ways DNS can break your system in practice. Its not just caching problems!

So I asked people on Twitter for example of DNS problems theyve run into, especially DNS problems that didnt initially appear to be DNS problems. (the popular “its always DNS” meme)

Im not going to discuss how to solve or avoid any of these problems in this post, but Ive linked to webpages discussing the problem where I could find them.

problem: slow network requests

Your network requests are a little bit slower than expected, and its actually because your DNS resolver is slow for some reason. This might be because the resolver is under a lot of load, or it has a memory leak, or something else.

Ive run into this before with my routers DNS forwarder all of my DNS requests were slow, and I restarted my router and that fixed the problem.

problem: DNS timeouts

A couple of people mentioned network requests that were taking 2+ seconds or 30 seconds because of DNS queries that were timing out. This is sort of the same as “slow requests”, but its worse because queries can take several seconds to time out.

Sophie Haskins has a great blog post Misadventures with Kube DNS about DNS timeouts with Kubernetes.

problem: ndots

A few people mentioned a specific issue where Kubernetes sets ndots:5 in its /etc/resolv.conf

Heres an example /etc/resolv.conf from Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances.


    nameserver 100.64.0.10
    search namespace.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
    options ndots:5

My understanding is that if this is your /etc/resolv.conf and you look up google.com, your application will call the C getaddrinfo function, and getaddrinfo will:

  1. look up google.com.namespace.svc.cluster.local.
  2. look up google.com.svc.cluster.local.
  3. look up google.com.cluster.local.
  4. look up google.com.eu-west-1.compute.internal.
  5. look up google.com.

Basically it checks if google.com is actually a subdomain of everything on the search line.

So every time you make a DNS query, you need to wait for 4 DNS queries to fail before you can get to the actual real DNS query that succeeds.

problem: its hard to tell what DNS resolver(s) your system is using

This isnt a bug by itself, but when you run into a problem with DNS, often its related in some way to your DNS resolver. I dont know of any foolproof way to tell what DNS resolver is being used.

A few things I know:

  • on Linux, I think that most things use /etc/resolv.conf to choose a DNS resolver. There are definitely exceptions though, for example your browser might ignore /etc/resolv.conf and use a different DNS-over-HTTPS service instead.
  • if youre using UDP DNS, you can use sudo tcpdump port 53 to see where DNS requests are being sent. This doesnt work if youre using DNS over HTTPS or DNS over TLS though.

I also vaguely remember it being even more confusing on MacOS than on Linux, though I dont know why.

problem: DNS servers that return NXDOMAIN instead of NOERROR

Heres a problem that I ran into once, where nginx couldnt resolve a domain.

  • I set up nginx to use a specific DNS server to resolve DNS queries
  • when visiting the domain, nginx made 2 queries, one for an A record, and one for an AAAA record
  • the DNS server returned a NXDOMAIN reply for the A query
  • nginx decided “ok, that domain doesnt exist”, and gave up
  • the DNS server returned a successful reply for the AAAA query
  • nginx ignored the AAAA record because it had already given up

The problem was that the DNS server should have returned NOERROR that domain did exist, it was just that there werent any A records for it. I reported the bug, they fixed it, and that fixed the problem.

Ive implemented this bug myself too, so I understand why it happens its easy to think “there arent any records for this query, I should return an NXDOMAIN error”.

problem: negative DNS caching

If you visit a domain before creating a DNS record for it, the absence of the record will be cached. This is very surprising the first time your run into it I only learned about this last year!

The TTL for cache entry is the TTL of the domains SOA record for example for jvns.ca, its an hour.

problem: nginx caching DNS records forever

If you put this in your nginx config:


    location / {
        proxy_pass https://some.domain.com;
    }

then nginx will resolve some.domain.com once on startup and never again. This is especially dangerous if the IP address for some.domain.com changes infrequently, because it might keep happily working for months and then suddenly break at 2am one day.

There are pretty well-known ways to fix this and this post isnt about nginx so I wont get into it, but its surprising the first time you run into it.

Heres a blog post with a story of how this happened to someone with an AWS load balancer.

problem: Java caching DNS records forever

Same thing, but for Java: Apparently depending on how you configure Java, “the JVM default TTL [might be] set so that it will never refresh DNS entries until the JVM is restarted.”

I havent run into this myself but I asked a friend about it who writes more Java than me and they told me that its happened to them.

Of course, literally any software could have this problem of caching DNS records forever, but the main cases Ive heard of in practice are nginx and Java.

problem: that entry in /etc/hosts you forgot about

Another variant on caching issues: entries in /etc/hosts that override your usual DNS settings!

This is extra confusing because dig ignores /etc/hosts, so everything SEEMS like it should be fine (”dig whatever.com is working!“).

problem: your email isnt being sent / is going to spam

The way email is sent and validated is through DNS (MX records, SPF records, DKIM records), so a lot of email problems are DNS problems.

problem: internationalized domain names dont work

You can register domain names with non-ASCII characters or emoji like https://💩.la.

The way this works with DNS is that 💩.la gets translated into xn--ls8h.la with an encoding called “punycode”.

But even though theres a clear standard for how they should work with DNS, a lot of software doesnt handle internationalized domain names well! Theres a fun story about this in Julian Squires great talk The emoji that Killed Chrome!!.

problem: TCP DNS is blocked by a firewall

A couple of people mentioned that some firewalls allow UDP port 53 but not TCP port 53. But large DNS queries need to use TCP port 53, so this can cause weird intermittent problems that are hard to debug.

problem: musl doesnt support TCP DNS

A lot of applications use libcs getaddrinfo to make DNS queries. musl is an alternative to glibc thats used in Alpine Docker container which doesnt support TCP DNS. This can cause problems if you make DNS queries where the response would be too big to fit inside a regular DNS UDP packet (512 bytes).

Im still a bit fuzzy on this so I might have it wrong, but my understanding of how this can break is:

  1. musls getaddrinfo makes a DNS query
  2. the DNS server notices that the response is too big to fit in a single DNS response packet
  3. the DNS server returns an empty truncated response, expecting that the client will retry by making a TCP DNS query
  4. musl does not support TCP so it does not retry

A blog post about this: DNS resolution issue in Alpine Linux

problem: round robin DNS doesnt work with getaddrinfo

One way you could approach load balancing is to use “round robin DNS”. The idea is that every time you make a DNS query, you get a different IP address. Apparently this works if you use gethostbyname to make DNS queries, but it does not work if you use getaddrinfo because getaddrinfo sorts the IP responses it receives.

So you could run into an upsetting problem if you switch from gethostbyname to getaddrinfo behind the scenes without realising that this will break your DNS load balancing.

This is especially insidious because you might not realize that youre switching to gethostbyname to getaddrinfo at all if youre not writing a C program, those functions calls are hidden inside some library. So it could be part of a seemingly innocuous upgrade.

Here are a couple of pages discussing this:

problem: a race condition when starting a service

A problem someone mentioned with Kubernetes DNS: they had 2 containers which started simultaneously and immediately tried to resolve each other. But the DNS lookup failed because the Kubernetes DNS change hadnt happened yet, and then the failure was cached so it kept failing.

thats all!

Ive definitely missed some important DNS problems here, so Id love to hear what Ive missed. Id also love links to blog posts that write up examples of these problems I think its really useful to see how the problem specifically manifests in practice and how people debugged it.


via: https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/

作者:Julia Evans 选题:lujun9972 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出