选题[tech]: 20220115 Some ways DNS can break

sources/tech/20220115 Some ways DNS can break.md
This commit is contained in:
DarkSun 2022-01-18 05:03:00 +08:00
parent b7e509d714
commit 5d31b841d9

View File

@ -0,0 +1,206 @@
[#]: subject: "Some ways DNS can break"
[#]: via: "https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/"
[#]: author: "Julia Evans https://jvns.ca/"
[#]: collector: "lujun9972"
[#]: translator: " "
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Some ways DNS can break
======
When I first learned about it, DNS didnt seem like it should be THAT complicated. Like, there are DNS records, theyre stored on a server, whats the big deal?
But with DNS, reading about how it works in a textbook doesnt prepare you for the sheer volume of different ways DNS can break your system in practice. Its not just caching problems!
So I [asked people on Twitter][1] for example of DNS problems theyve run into, especially DNS problems that **didnt initially appear to be DNS problems**. (the popular “its always DNS” meme)
Im not going to discuss how to solve or avoid any of these problems in this post, but Ive linked to webpages discussing the problem where I could find them.
### problem: slow network requests
Your network requests are a little bit slower than expected, and its actually because your DNS resolver is slow for some reason. This might be because the resolver is under a lot of load, or it has a memory leak, or something else.
Ive run into this before with my routers DNS forwarder all of my DNS requests were slow, and I restarted my router and that fixed the problem.
### problem: DNS timeouts
A couple of people mentioned network requests that were taking 2+ seconds or 30 seconds because of DNS queries that were timing out. This is sort of the same as “slow requests”, but its worse because queries can take several seconds to time out.
Sophie Haskins has a great blog post [Misadventures with Kube DNS][2] about DNS timeouts with Kubernetes.
### problem: ndots
A few people mentioned a specific issue where Kubernetes sets `ndots:5` in its `/etc/resolv.conf`
Heres an example /etc/resolv.conf from [Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances][3].
```
nameserver 100.64.0.10
search namespace.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
options ndots:5
```
My understanding is that if this is your `/etc/resolv.conf` and you look up `google.com`, your application will call the C `getaddrinfo` function, and `getaddrinfo` will:
1. look up `google.com.namespace.svc.cluster.local.`
2. look up `google.com.svc.cluster.local.`
3. look up `google.com.cluster.local.`
4. look up `google.com.eu-west-1.compute.internal.`
5. look up `google.com.`
Basically it checks if `google.com` is actually a subdomain of everything on the `search` line.
So every time you make a DNS query, you need to wait for 4 DNS queries to fail before you can get to the actual real DNS query that succeeds.
### problem: its hard to tell what DNS resolver(s) your system is using
This isnt a bug by itself, but when you run into a problem with DNS, often its related in some way to your DNS resolver. I dont know of any foolproof way to tell what DNS resolver is being used.
A few things I know:
* on Linux, I think that most things use /etc/resolv.conf to choose a DNS resolver. There are definitely exceptions though, for example your browser might ignore /etc/resolv.conf and use a different DNS-over-HTTPS service instead.
* if youre using UDP DNS, you can use `sudo tcpdump port 53` to see where DNS requests are being sent. This doesnt work if youre using DNS over HTTPS or DNS over TLS though.
I also vaguely remember it being even more confusing on MacOS than on Linux, though I dont know why.
### problem: DNS servers that return NXDOMAIN instead of NOERROR
Heres a problem that I ran into once, where nginx couldnt resolve a domain.
* I set up nginx to use a specific DNS server to resolve DNS queries
* when visiting the domain, nginx made 2 queries, one for an `A` record, and one for an `AAAA` record
* the DNS server returned a `NXDOMAIN` reply for the `A` query
* nginx decided “ok, that domain doesnt exist”, and gave up
* the DNS server returned a successful reply for the `AAAA` query
* nginx ignored the `AAAA` record because it had already given up
The problem was that the DNS server should have returned `NOERROR` that domain _did_ exist, it was just that there werent any `A` records for it. I reported the bug, they fixed it, and that fixed the problem.
Ive implemented this bug myself too, so I understand why it happens its easy to think “there arent any records for this query, I should return an `NXDOMAIN` error”.
### problem: negative DNS caching
If you visit a domain before creating a DNS record for it, the **absence** of the record will be cached. This is very surprising the first time your run into it I only learned about this last year!
The TTL for cache entry is the TTL of the domains SOA record for example for `jvns.ca`, its an hour.
### problem: nginx caching DNS records forever
If you put this in your nginx config:
```
location / {
proxy_pass https://some.domain.com;
}
```
then nginx will resolve `some.domain.com` once on startup and never again. This is especially dangerous if the IP address for `some.domain.com` changes infrequently, because it might keep happily working for months and then suddenly break at 2am one day.
There are pretty well-known ways to fix this and this post isnt about nginx so I wont get into it, but its surprising the first time you run into it.
Heres a [blog post][4] with a story of how this happened to someone with an AWS load balancer.
### problem: Java caching DNS records forever
Same thing, but for Java: [Apparently][5] depending on how you configure Java, “the JVM default TTL [might be] set so that it will never refresh DNS entries until the JVM is restarted.”
I havent run into this myself but I asked a friend about it who writes more Java than me and they told me that its happened to them.
Of course, literally any software could have this problem of caching DNS records forever, but the main cases Ive heard of in practice are nginx and Java.
### problem: that entry in /etc/hosts you forgot about
Another variant on caching issues: entries in `/etc/hosts` that override your usual DNS settings!
This is extra confusing because `dig` ignores `/etc/hosts`, so everything SEEMS like it should be fine (”`dig whatever.com` is working!“).
### problem: your email isnt being sent / is going to spam
The way email is sent and validated is through DNS (MX records, SPF records, DKIM records), so a lot of email problems are DNS problems.
### problem: internationalized domain names dont work
You can register domain names with non-ASCII characters or emoji like [https://💩.la][6].
The way this works with DNS is that `💩.la` gets translated into `xn--ls8h.la` with an encoding called “punycode”.
But even though theres a clear standard for how they should work with DNS, a lot of software doesnt handle internationalized domain names well! Theres a fun story about this in Julian Squires great talk [The emoji that Killed Chrome!!][7].
### problem: TCP DNS is blocked by a firewall
A couple of people mentioned that some firewalls allow UDP port 53 but not TCP port 53. But large DNS queries need to use TCP port 53, so this can cause weird intermittent problems that are hard to debug.
### problem: musl doesnt support TCP DNS
A lot of applications use libcs `getaddrinfo` to make DNS queries. musl is an alternative to `glibc` thats used in Alpine Docker container which doesnt support TCP DNS. This can cause problems if you make DNS queries where the response would be too big to fit inside a regular DNS UDP packet (512 bytes).
Im still a bit fuzzy on this so I might have it wrong, but my understanding of how this can break is:
1. musls getaddrinfo makes a DNS query
2. the DNS server notices that the response is too big to fit in a single DNS response packet
3. the DNS server returns an **empty** truncated response, expecting that the client will retry by making a TCP DNS query
4. `musl` does not support TCP so it does not retry
A blog post about this: [DNS resolution issue in Alpine Linux][8]
### problem: round robin DNS doesnt work with `getaddrinfo`
One way you could approach load balancing is to use “round robin DNS”. The idea is that every time you make a DNS query, you get a different IP address. Apparently this works if you use `gethostbyname` to make DNS queries, but it does not work if you use `getaddrinfo` because `getaddrinfo` sorts the IP responses it receives.
So you could run into an upsetting problem if you switch from `gethostbyname` to `getaddrinfo` behind the scenes without realising that this will break your DNS load balancing.
This is especially insidious because you might not realize that youre switching to `gethostbyname` to `getaddrinfo` at all if youre not writing a C program, those functions calls are hidden inside some library. So it could be part of a seemingly innocuous upgrade.
Here are a couple of pages discussing this:
* [getaddrinfo breaks round robin DNS][9]
* [getaddrinfo with round robin DNS and happy eyeballs][10]
### problem: a race condition when starting a service
A problem someone [mentioned][11] with Kubernetes DNS: they had 2 containers which started simultaneously and immediately tried to resolve each other. But the DNS lookup failed because the Kubernetes DNS change hadnt happened yet, and then the failure was cached so it kept failing.
### thats all!
Ive definitely missed some important DNS problems here, so Id love to hear what Ive missed. Id also love links to blog posts that write up examples of these problems I think its really useful to see how the problem specifically manifests in practice and how people debugged it.
--------------------------------------------------------------------------------
via: https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/
作者:[Julia Evans][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://jvns.ca/
[b]: https://github.com/lujun9972
[1]: https://twitter.com/b0rk/status/1481265429897261058
[2]: https://blog.sophaskins.net/blog/misadventures-with-kube-dns/
[3]: https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
[4]: https://medium.com/driven-by-code/dynamic-dns-resolution-in-nginx-22133c22e3ab
[5]: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html
[6]: https://💩.la/
[7]: https://www.youtube.com/watch?v=UE-fJjMasec
[8]: https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux
[9]: https://groups.google.com/g/consul-tool/c/AGgPjrrkw3g
[10]: https://daniel.haxx.se/blog/2012/01/03/getaddrinfo-with-round-robin-dns-and-happy-eyeballs/
[11]: https://mobile.twitter.com/omatskiv/status/1481305175440646148