TranslateProject/sources/tech/20211104 How do you tell if a problem is caused by DNS.md
DarkSun 42058fa4a7 选题[tech]: 20211104 How do you tell if a problem is caused by DNS?
sources/tech/20211104 How do you tell if a problem is caused by DNS.md
2021-11-05 05:04:51 +08:00

11 KiB
Raw Blame History

How do you tell if a problem is caused by DNS?

I was looking into problems people were having with DNS a few months ago and I noticed one common theme a lot of people have server issues (“my server is down! or its slow!“), but they cant tell if the problem is caused by DNS or not.

So here are a few tools I use to tell if a problem Im having is caused by DNS, as well as a few DNS debuggging stories from my life.

I dont try to interpret browser error messages

First, lets talk briefly about browser error messages. You might think that your browser will tell you if the problem is DNS or not! And it could but mine doesnt seem to do so in any obvious way.

On my machine, if Firefox fails to resolve DNS for a site, it gives me the error: Hmm. Were having trouble finding that site. We cant connect to the server at bananas.wizardzines.com.

But if the DNS succeeds and it just cant establish a TCP connection to that service, I get the error: Unable to connect. Firefox cant establish a connection to the server at localhost:1324

These two error messages (“we cant connect to the server” and “firefox cant establish a connection to the server”) are so similar that I dont try to distinguish them if I see any kind of “connection failure” error in the browser, Ill immediately go the command line to investigate.

tool 1: error messages

I was complaining about browser error messages being misleading, but if youre writing a program, theres usually some kind of standard error message that you get for DNS errors. It often wont say “DNS” in it, itll usually be something about “unknown host” or “name or service not found” or “getaddrinfo”.

For example, lets run this Python program:


    import requests
    r = requests.get('http://examplezzz.com')

This gives me the error message:


    socket.gaierror: [Errno -2] Name or service not known

If I write the same program in Ruby, I get this error:


    Failed to open TCP connection to examplezzzzz.com:80 (getaddrinfo: Name or service not known

If I write the same program in Java, I get:


    Exception in thread "main" java.net.UnknownHostException: examplezzzz.com

In Node, I get:


    Error: getaddrinfo ENOTFOUND examplezzzz.com

These error messages arent quite as uniform as I thought they would be, there are quite a few different error messages in different languages for exact the same problem, and it depends on the library youre using too. But if you Google the error you can find out if it means “resolving DNS failed” or not.

tool 2: use dig to make sure its a DNS problem

For example, the other day I was setting up a new subdomain, lets say it was https://bananas.wizardzines.com.

I set up my DNS, but when I went to the site in Firefox, it wasnt working. So I ran dig to check whether the DNS was resolving for that domain, like this:


    $ dig bananas.wizardzines.com
    (empty response)

I didnt get a response, which is a failure. A success looks like this:


    $ dig wizardzines.com
    wizardzines.com.    283 IN  A   172.64.80.1

Even if my programming language gives me a clear DNS error, I like to use dig to independently confirm because there are still a lot of different error messages and I find them confusing.

tool 3: check against more than one DNS server

There are LOTS of DNS servers, and they often dont have the same information. So when Im investigating a potential DNS issue, I like to query more than one server.

For example, if its a site on the public internet Ill both use my local DNS server (dig domain.com) and a big public DNS server like 1.1.1.1 or 8.8.8.8 or 9.9.9.9 (dig @8.8.8.8 domain.com).

The other day, Id set up a new domain, lets say it was https://bananas.wizardzines.com.

Heres what I did:

  1. go to https://bananas.wizardzines.com in a browser (spoiler: huge mistake!)
  2. go to my DNS provider and set up bananas.wizardzines.com
  3. try to go to https://bananas.wizardzines.com in my browser. It fails! Oh no!

I wasnt sure why it failed, so I checked against 2 different DNS servers:


    $ dig bananas.wizardzines.com
    $ dig @8.8.8.8 bananas.wizardzines.com
    feedback.wizardzines.com. 300   IN  A   172.67.209.237
    feedback.wizardzines.com. 300   IN  A   104.21.85.200

From this I could see that 8.8.8.8 actually did have DNS records for my domain, and it was just my local DNS server that didnt.

This was because Id gone to https://bananas.wizardzines.com in my browser before Id created the DNS record (huge mistake!), and then my ISPs DNS server cached the absence of a DNS record, so it was returning an empty response until the negative cached expired.

I googled “negative cache time” and found a Stack Overflow post explaining where I could find the negative cache TTL (by running dig SOA wizardzines.com). It turned out the TTL was 3600 seconds or 1 hour, so I just needed to wait an hour for my ISP to update its cache.

tool 4: spy on the DNS requests being made with tcpdump

Another of my favourite things to do is spy on the DNS requests being made and check if theyre failing. There are at least 3 ways to do this:

  1. Use tcpdump (sudo tcpdump -i any port 53)
  2. Use wireshark
  3. Use a command line tool I wrote called dnspeep, which is like tcpdump but just for DNS queries and with friendlier output

Im going to give you 2 examples of DNS problems I diagnosed by looking at the DNS requests being made with tcpdump.

problem: the case of the slow websites

One day five years ago, my internet was slow. Really slow, it was taking 10+ seconds to get to websites. I thought “hmm, maybe its DNS!”, so started tcpdump and then opened one of the slow sites in my browser.

Heres what I saw in tcpdump:


    $ sudo tcpdump -n -i any port 53
    12:05:01.125021 wlp3s0 Out IP 192.168.1.181.56164 > 192.168.1.1.53: 11760+ [1au] A? ask.metafilter.com. (59)
    12:05:06.191382 wlp3s0 Out IP 192.168.1.181.56164 > 192.168.1.1.53: 11760+ [1au] A? ask.metafilter.com. (59)
    12:05:11.145056 wlp3s0 Out IP 192.168.1.181.56164 > 192.168.1.1.53: 11760+ [1au] A? ask.metafilter.com. (59)
    12:05:11.746358 wlp3s0 In  IP 192.168.1.1.53 > 192.168.1.181.56164: 11760 2/0/1 CNAME metafilter.com., A 54.244.168.112 (91)

The first 3 lines are DNS requests, and theyre separated by 5 seconds. Basically this is my browser timing out its DNS queries and retrying them.

Finally, on the 3rd query, a response comes back.

I dont actually know exactly why this happened, but I restarted my router and the problem went away. Hooray!

(by the way the reason I know that this is the tcpdump output I got 5 years ago is that I wrote about it in my zine on tcpdump, you can read that zine for free!)

problem: the case of the nginx failure

Earlier this year, I was using https://fly.io to set up a website, and I was having trouble getting nginx to redirect to my site all the requests were failing.

I eventually got SSH access to the server and ran tcpdump and heres what I saw:


    $ tcpdump -i any port 53
    17:16:04.216161 IP6 fly-local-6pn.55356 > fdaa::3.53: 46219+ A? myservice.internal. (42)
    17:16:04.216197 IP6 fly-local-6pn.55356 > fdaa::3.53: 11993+ AAAA? myservice.internal. (42)
    17:16:04.216946 IP6 fdaa::3.53 > fly-local-6pn.55356: 46219 NXDomain- 0/0/0 (42)
    17:16:04.217063 IP6 fly-local-6pn.43938 > fdaa::3.53: 32351+ PTR? 3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.a.a.d.f.ip6.arpa. (90)
    17:16:04.218378 IP6 fdaa::3.53 > fly-local-6pn.55356: 11993- 1/0/0 AAAA fdaa:0:bff:a7b:aa2:d426:1ab:2 (70)
    17:16:04.461646 IP6 fdaa::3.53 > fly-local-6pn.43938: 32351 NXDomain 0/1/0 (154)

This is a bit confusing to read, but basically:

  1. nginx requests an A record
  2. nginx requests an AAAA record
  3. the DNS server returns an NXDOMAIN reply for the A record
  4. the DNS server returns a successful reply for the AAAA record, with an IPv6 address

The NXDOMAIN reponse made nginx think that that domain didnt exist, so it ignored the IPv6 address it got later.

This was happening because there was a bug in the DNS server according to the DNS spec it should have been returning NOERROR instead of NXDOMAIN for the A record. I reported the bug and they fixed it right away.

I think it would have been literally impossible for me to guess what was happening here without using tcpdump to see what queries nginx was making.

if there are no DNS failures, it can still be a DNS problem

I originally wrote “if you can see the DNS requests, and there are no timeouts or failures, the problem isnt DNS”. But someone on Twitter pointed out that this isnt true!

One way you can have a DNS problem even without DNS failures is if your program is doing its own DNS caching. Heres how that can go wrong:

  1. Your program makes a DNS request and caches the result
  2. 6 days pass
  3. Your program never updates its IP address
  4. The IP address for the site changes
  5. You start getting errors

This is a DNS problem (your program should be requesting DNS updates more often!) but you have to diagnose it by noticing that there are missing DNS queries. This one is very tricky and the error messages youll get wont look like they have anything to do with DNS.

thats all for now

This definitely isnt a complete list of ways to tell if its DNS or not, but I hope it helps!

Id love to hear methods of checking “is it DNS?” that I missed Im pretty sure Ive missed at least one important method.


via: https://jvns.ca/blog/2021/11/04/how-do-you-tell-if-a-problem-is-caused-by-dns/

作者:Julia Evans 选题:lujun9972 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出