Merge pull request #29717 from toknow-gh/tr0717

Translated
This commit is contained in:
Xingyu.Wang 2023-07-17 10:13:56 +08:00 committed by GitHub
commit bddee35094
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 207 additions and 206 deletions

View File

@ -1,206 +0,0 @@
[#]: subject: "Some ways DNS can break"
[#]: via: "https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/"
[#]: author: "Julia Evans https://jvns.ca/"
[#]: collector: "lujun9972"
[#]: translator: "toknow-gh"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Some ways DNS can break
======
When I first learned about it, DNS didnt seem like it should be THAT complicated. Like, there are DNS records, theyre stored on a server, whats the big deal?
But with DNS, reading about how it works in a textbook doesnt prepare you for the sheer volume of different ways DNS can break your system in practice. Its not just caching problems!
So I [asked people on Twitter][1] for example of DNS problems theyve run into, especially DNS problems that **didnt initially appear to be DNS problems**. (the popular “its always DNS” meme)
Im not going to discuss how to solve or avoid any of these problems in this post, but Ive linked to webpages discussing the problem where I could find them.
### problem: slow network requests
Your network requests are a little bit slower than expected, and its actually because your DNS resolver is slow for some reason. This might be because the resolver is under a lot of load, or it has a memory leak, or something else.
Ive run into this before with my routers DNS forwarder all of my DNS requests were slow, and I restarted my router and that fixed the problem.
### problem: DNS timeouts
A couple of people mentioned network requests that were taking 2+ seconds or 30 seconds because of DNS queries that were timing out. This is sort of the same as “slow requests”, but its worse because queries can take several seconds to time out.
Sophie Haskins has a great blog post [Misadventures with Kube DNS][2] about DNS timeouts with Kubernetes.
### problem: ndots
A few people mentioned a specific issue where Kubernetes sets `ndots:5` in its `/etc/resolv.conf`
Heres an example /etc/resolv.conf from [Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances][3].
```
nameserver 100.64.0.10
search namespace.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
options ndots:5
```
My understanding is that if this is your `/etc/resolv.conf` and you look up `google.com`, your application will call the C `getaddrinfo` function, and `getaddrinfo` will:
1. look up `google.com.namespace.svc.cluster.local.`
2. look up `google.com.svc.cluster.local.`
3. look up `google.com.cluster.local.`
4. look up `google.com.eu-west-1.compute.internal.`
5. look up `google.com.`
Basically it checks if `google.com` is actually a subdomain of everything on the `search` line.
So every time you make a DNS query, you need to wait for 4 DNS queries to fail before you can get to the actual real DNS query that succeeds.
### problem: its hard to tell what DNS resolver(s) your system is using
This isnt a bug by itself, but when you run into a problem with DNS, often its related in some way to your DNS resolver. I dont know of any foolproof way to tell what DNS resolver is being used.
A few things I know:
* on Linux, I think that most things use /etc/resolv.conf to choose a DNS resolver. There are definitely exceptions though, for example your browser might ignore /etc/resolv.conf and use a different DNS-over-HTTPS service instead.
* if youre using UDP DNS, you can use `sudo tcpdump port 53` to see where DNS requests are being sent. This doesnt work if youre using DNS over HTTPS or DNS over TLS though.
I also vaguely remember it being even more confusing on MacOS than on Linux, though I dont know why.
### problem: DNS servers that return NXDOMAIN instead of NOERROR
Heres a problem that I ran into once, where nginx couldnt resolve a domain.
* I set up nginx to use a specific DNS server to resolve DNS queries
* when visiting the domain, nginx made 2 queries, one for an `A` record, and one for an `AAAA` record
* the DNS server returned a `NXDOMAIN` reply for the `A` query
* nginx decided “ok, that domain doesnt exist”, and gave up
* the DNS server returned a successful reply for the `AAAA` query
* nginx ignored the `AAAA` record because it had already given up
The problem was that the DNS server should have returned `NOERROR` that domain _did_ exist, it was just that there werent any `A` records for it. I reported the bug, they fixed it, and that fixed the problem.
Ive implemented this bug myself too, so I understand why it happens its easy to think “there arent any records for this query, I should return an `NXDOMAIN` error”.
### problem: negative DNS caching
If you visit a domain before creating a DNS record for it, the **absence** of the record will be cached. This is very surprising the first time your run into it I only learned about this last year!
The TTL for cache entry is the TTL of the domains SOA record for example for `jvns.ca`, its an hour.
### problem: nginx caching DNS records forever
If you put this in your nginx config:
```
location / {
proxy_pass https://some.domain.com;
}
```
then nginx will resolve `some.domain.com` once on startup and never again. This is especially dangerous if the IP address for `some.domain.com` changes infrequently, because it might keep happily working for months and then suddenly break at 2am one day.
There are pretty well-known ways to fix this and this post isnt about nginx so I wont get into it, but its surprising the first time you run into it.
Heres a [blog post][4] with a story of how this happened to someone with an AWS load balancer.
### problem: Java caching DNS records forever
Same thing, but for Java: [Apparently][5] depending on how you configure Java, “the JVM default TTL [might be] set so that it will never refresh DNS entries until the JVM is restarted.”
I havent run into this myself but I asked a friend about it who writes more Java than me and they told me that its happened to them.
Of course, literally any software could have this problem of caching DNS records forever, but the main cases Ive heard of in practice are nginx and Java.
### problem: that entry in /etc/hosts you forgot about
Another variant on caching issues: entries in `/etc/hosts` that override your usual DNS settings!
This is extra confusing because `dig` ignores `/etc/hosts`, so everything SEEMS like it should be fine (”`dig whatever.com` is working!“).
### problem: your email isnt being sent / is going to spam
The way email is sent and validated is through DNS (MX records, SPF records, DKIM records), so a lot of email problems are DNS problems.
### problem: internationalized domain names dont work
You can register domain names with non-ASCII characters or emoji like [https://💩.la][6].
The way this works with DNS is that `💩.la` gets translated into `xn--ls8h.la` with an encoding called “punycode”.
But even though theres a clear standard for how they should work with DNS, a lot of software doesnt handle internationalized domain names well! Theres a fun story about this in Julian Squires great talk [The emoji that Killed Chrome!!][7].
### problem: TCP DNS is blocked by a firewall
A couple of people mentioned that some firewalls allow UDP port 53 but not TCP port 53. But large DNS queries need to use TCP port 53, so this can cause weird intermittent problems that are hard to debug.
### problem: musl doesnt support TCP DNS
A lot of applications use libcs `getaddrinfo` to make DNS queries. musl is an alternative to `glibc` thats used in Alpine Docker container which doesnt support TCP DNS. This can cause problems if you make DNS queries where the response would be too big to fit inside a regular DNS UDP packet (512 bytes).
Im still a bit fuzzy on this so I might have it wrong, but my understanding of how this can break is:
1. musls getaddrinfo makes a DNS query
2. the DNS server notices that the response is too big to fit in a single DNS response packet
3. the DNS server returns an **empty** truncated response, expecting that the client will retry by making a TCP DNS query
4. `musl` does not support TCP so it does not retry
A blog post about this: [DNS resolution issue in Alpine Linux][8]
### problem: round robin DNS doesnt work with `getaddrinfo`
One way you could approach load balancing is to use “round robin DNS”. The idea is that every time you make a DNS query, you get a different IP address. Apparently this works if you use `gethostbyname` to make DNS queries, but it does not work if you use `getaddrinfo` because `getaddrinfo` sorts the IP responses it receives.
So you could run into an upsetting problem if you switch from `gethostbyname` to `getaddrinfo` behind the scenes without realising that this will break your DNS load balancing.
This is especially insidious because you might not realize that youre switching to `gethostbyname` to `getaddrinfo` at all if youre not writing a C program, those functions calls are hidden inside some library. So it could be part of a seemingly innocuous upgrade.
Here are a couple of pages discussing this:
* [getaddrinfo breaks round robin DNS][9]
* [getaddrinfo with round robin DNS and happy eyeballs][10]
### problem: a race condition when starting a service
A problem someone [mentioned][11] with Kubernetes DNS: they had 2 containers which started simultaneously and immediately tried to resolve each other. But the DNS lookup failed because the Kubernetes DNS change hadnt happened yet, and then the failure was cached so it kept failing.
### thats all!
Ive definitely missed some important DNS problems here, so Id love to hear what Ive missed. Id also love links to blog posts that write up examples of these problems I think its really useful to see how the problem specifically manifests in practice and how people debugged it.
--------------------------------------------------------------------------------
via: https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/
作者:[Julia Evans][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://jvns.ca/
[b]: https://github.com/lujun9972
[1]: https://twitter.com/b0rk/status/1481265429897261058
[2]: https://blog.sophaskins.net/blog/misadventures-with-kube-dns/
[3]: https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
[4]: https://medium.com/driven-by-code/dynamic-dns-resolution-in-nginx-22133c22e3ab
[5]: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html
[6]: https://💩.la/
[7]: https://www.youtube.com/watch?v=UE-fJjMasec
[8]: https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux
[9]: https://groups.google.com/g/consul-tool/c/AGgPjrrkw3g
[10]: https://daniel.haxx.se/blog/2012/01/03/getaddrinfo-with-round-robin-dns-and-happy-eyeballs/
[11]: https://mobile.twitter.com/omatskiv/status/1481305175440646148

View File

@ -0,0 +1,207 @@
[#]: subject: "Some ways DNS can break"
[#]: via: "https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/"
[#]: author: "Julia Evans https://jvns.ca/"
[#]: collector: "lujun9972"
[#]: translator: "toknow-gh"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
一些 DNS 故障
======
当我第一次知道 DNS 时,我想它应该不会很复杂。不就是一些存储在服务器上的 DNS 记录罢了。有什么大不了的?
但是教科书上只是介绍了 DNS 的原理,并没有告诉你实际使用中 DNS 可能会以多少种方式破坏你的系统。
所以我 [在 Twitter 上发起了一个提问][1],征集人们遇到的 DNS 问题,尤其是那些一开始看起来与 DNS 没什么关系的问题。(“总是 DNS问题”这个梗
我不打算在这篇文章中讨论如何解决或避免这些问题,但我会放一些讨论这些问题的链接,在那里可以找到解决问题的方法。
### 问题:网络请求缓慢
如果你的网络比预期的要慢,这是因为某些原因导致 DNS 解析器变慢了。这可能是解析器负载过大或者存在内存泄漏等原因导致的。
我的路由器的 DNS 转发器曾遇到过这个问题,导致我的所有 DNS 请求很慢。我通过重启路由器解决了这个问题。
### 问题DNS 超时
一些网友提到由于 DNS 查询超时,他们的网络请求需要耗时 2 秒多甚至 30 秒。这跟“网络请求缓慢”问题类似,但情况要更糟糕,因为 DNS 请求就会消耗掉几秒钟时间。
Sophie Haskins 有一篇关于 Kubernete DNS 超时的博客文章 [一次 Kube DNS 踩坑经历][2]。
### 问题ndots 设置
一些网友提到在 `/etc/resolv.conf` 中设置 `ndots:5` 时会出现问题。
下面是从 [Kubernetes pod 中 /etc/resolv.conf 里设置 ndots:5 为什么会拖慢你的程序性能][3] 中引用的 `/etc/resolv.conf `文件。
```
nameserver 100.64.0.10
search namespace.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
options ndots:5
```
如果你用上面的配置文件,想要查询得域名是 `google.com`,那么你的程序会调用 `getaddrinfo` 函数,而它会依次查询以下域名:
1. `google.com.namespace.svc.cluster.local.`
2. `google.com.svc.cluster.local.`
3. `google.com.cluster.local.`
4. `google.com.eu-west-1.compute.internal.`
5. `google.com.`
简单来说,它会检查 `google.com` 是不是 `search` 这一行中的某个子域名。
所以每发起一次 DNS 查询,你都得先等待前 4 次查询失败后才能获取到最终查询结果。
### 问题:难以判断系统使用的 DNS 解析器
这本身并不是一个 bug但当你遇到 DNS 问题时,一般都会跟 DNS 解析器有关。我没有一种判断 DNS 解析器的万能方法。
下面是我知道的方法:
* 在 Linux 系统上,最常见的是通过 `/etc/resolv.conf` 来选择 DNS 解析器。但是也有例外,比如浏览器可能会忽略 `/etc/resolv.conf`,而是使用 <ruby>基于 HTTPS 的 DNS <rt>DNS-over-HTTPS</rt></ruby> 服务。
* 如果你使用的是 UDP DNS你可以通过 `sudo tcpdump port 53` 来查看 DNS 请求被发送到了哪里。但如果你使用的是基于 HTTPS 的 DNS 或 <ruby>基于 TLS 的 DNS <rt>DNS over TLS</rt></ruby>,这个方法就不行了。
我依稀记得这在 MacOS 系统上会更加令人迷惑,我也不清楚原因。
### 问题DNS 服务器返回 NXDOMAIN 而不是 NOERROR
这是我曾经遇到过的一个 nginx 不能解析域名的问题。
* 我设置 nginx 使用一个特定的 DNS 服务器来解析 DNS 查询
* 当访问这个域名时nginx 做了两次查询,第一次是对 `A` 的,第二次是对 `AAAA`
* 对于 `A` 的查询DNS 服务器返回 `NXDOMAIN`
* nginx 认为这个域名不存在,然后放弃查询
* 对于 `AAAA` 的查询 DNS 服务器返回了成功
* 但 nginx 忽略了对 `AAAA` 返回的查询结果,因为它前面已经放弃查询了
问题出在 DNS 服务器本应该返回 `NOERROR` 的——那个域名确实存在,只是没有关于 `A` 的记录罢了。我报告了这个 bug然后他们修复了这个问题。
我自己也写出过这个 bug所以我理解为什么会发生这种情况——很容易想当然地认为“没有要查询的记录就应该返回 `NXDOMAIN` 错误码”。
### 问题:自动生效的 DNS 缓存
如果你在生成一个域名的 DNS 记录之前就访问这个域名,那么这个记录的缺失会被缓存起来。当你第一次遇到这个问题时一定会非常吃惊——我也是去年才知道有这个问题。
缓存的 TTL 就是域名的 <ruby>起始权限记录<rt>Start of Authority</rt></ruby> 记录的 TTL ——比如对于 `jvns.ca` ,这个值是一个小时。
### 问题nginx 永久缓存 DNS 记录
如果你在 nginx 中使用下面的配置:
```
location / {
proxy_pass https://some.domain.com;
}
```
nginx 只会在启动的时候解析一次 `some.domain.com`,以后不会再对其进行解析。这是非常危险的操作,尤其是对于那些 IP 地址经常变动的域名。它可能平安无事地运行几个月,然后突然在某个凌晨两点把你从床上纠起来。
针对这个问题已经有很多众所周知的方法了,但由于本文不是关于 nginx 的,所以我不打算深入探讨它。但你第一次遇到它时一定会很惊讶。
这是一篇关于这个问题发生在 AWS 负载均衡器上的 [博客文章]][4]。
### 问题Java 永久缓存 DNS 记录
跟上面类似的问题,只是出现在 Java 上:[据说][5] 这与你 Java 的配置有关。“JVM 的默认 TTL 设置可能会导致只有 JVM 重启时才会刷新 DNS 记录。”
我还没有遇到过这个问题,不过我那些经常写 Java 的朋友遇到过这个问题。
当然,任何软件都可能存在永久缓存 DNS 的问题,但据我所知它经常出现在 nginx 和 Java 上。
### 问题:被遗忘的 /etc/hosts 记录
这是另一种缓存问题:`/etc/hosts` 中的记录会覆盖你的常规 DNS 设置!
让人迷惑的是 `dig` 命令会忽略 `/etc/hosts` 文件。所以当你使用 `dig whatever.com` 来查询 DNS 信息时,它会告诉你一切正常。
### 问题:电子邮件未发送 / 将成为垃圾邮件
电子邮件是通过 DNSMX 记录, SPF 记录, DKIM 记录)来发送和验证的,所以有些电子邮件问题其实是 DNS 问题。
### 问题:对国际化域名无效
你可以使用非 ASCII 字符甚至是 emoji 来注册域名,比如 [拉屎网 https://💩.la][6]。
DNS 能够处理国际化域名是因为 `💩.la` 会被用 punycode 编码将转换为 `xn--ls8h.la`
尽管已经有了 DNS 处理国际化域名的标准很多软件并不能很好地处理国际化域名。Julian Squires 的 [干掉 Chrome 浏览器的 emoji][7] 就是一个非常有趣的例子。
### 问题TCP DNS 被防火墙拦截
有人提到一些防火墙会允许在 53 端口上使用 UDP 协议,但是禁止 TCP 协议。然而很多 DNS 查询需要在 53 端口上使用 TCP这可能会导致很难排查的间歇性的问题。
### 问题musl 不支持 TCP DNS
很多应用程序使用 `libc``getaddrinfo` 来做 DNS 查询。`musl` 是用在 Alpine Docker 容器上的 `glibc` 替代品。而它不支持 TCP DNS。如果你的 DNS 查询的响应数据超过 DNS UDP 数据包的大小(512 字节)就会出现问题。
我对此仍然不太清楚,我下面我的理解也可能是错的:
1. `musl``getaddrinfo` 发起一个 DNS 请求
2. DNS 服务器发现请求的响应数据太大了,没法放入一个 DNS 数据包中
3. DNS 服务器返回一个<ruby>空截断响应<rt>empty truncated response</rt></ruby>,并期望客户端通过 TCP DNS 重新用发起查询
4. 但 `musl` 不支持 TCP DNS所以根本不会重试
关于这个问题的文章:[在 Alpine Linux 上的 DNS 解析问题][8]。
### 问题getaddrinfo 不支持轮询 DNS
<ruby>轮询<rt>round robin</rt></ruby> DNS 是一种 <ruby>负载均衡<rt>load balancing</rt></ruby> 技术,每次 DNS 查询都会获得一个不同的 IP 地址。显然如果你使用 `gethostbyname` 做 DNS 查询不会有任何问题,但是用 `getaddrinfo` 就不行了。因为 `getaddrinfo` 会对获得的 IP 地址进行排序。
在你从 `gethostbyname` 切换到 `getaddrinfo` 时可能完全不会意识到这可能会引起负载均衡问题。
这个问题可能会非常隐蔽,如果你不是用 C 语言编程的话,这些函数调用被隐藏在各种调用库背后,你可能完全意识不到发生了这种改变。所以某次看似人畜无害的升级就可能导致你的 DNS 负载均衡失效。
下面是讨论这个的一些文章:
* [getaddrinfo 导致轮询 DNS 失效][9]
* [getaddrinfo轮询 DNS 和 happy eyeballs 算法][10]
### 问题:启动服务时的竞争条件
有人 [提到][11] 使用 Kubernete DNS 时遇到的问题:他们有两个同时启动的容器,一旦启动就会立即尝试解析对方的地址。由于 Kubernete DNS 还没有改变,所以 DNS 查询会失败。这个失败会被缓存起来,所以后续的查询会一直失败。
### 写在最后
我所列举的不过是 DNS 问题的冰山一角,期待大家告诉我那些我没有提到的问题和相关链接。我希望了解这些问题在实际中是如何发生的以及如何被解决的。
--------------------------------------------------------------------------------
via: https://jvns.ca/blog/2022/01/15/some-ways-dns-can-break/
作者:[Julia Evans][a]
选题:[lujun9972][b]
译者:[toknow-gh](https://github.com/toknow-gh)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://jvns.ca/
[b]: https://github.com/lujun9972
[1]: https://twitter.com/b0rk/status/1481265429897261058
[2]: https://blog.sophaskins.net/blog/misadventures-with-kube-dns/
[3]: https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
[4]: https://medium.com/driven-by-code/dynamic-dns-resolution-in-nginx-22133c22e3ab
[5]: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html
[6]: https://💩.la/
[7]: https://www.youtube.com/watch?v=UE-fJjMasec
[8]: https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux
[9]: https://groups.google.com/g/consul-tool/c/AGgPjrrkw3g
[10]: https://daniel.haxx.se/blog/2012/01/03/getaddrinfo-with-round-robin-dns-and-happy-eyeballs/
[11]: https://mobile.twitter.com/omatskiv/status/1481305175440646148