DNS and Yesterday's Outage

As I’m sure at least some of you noticed, Scientopia – along with a huge number of other sites – was unreachable for most of yesterday. I say “unreachable” instead of “down” because the site was up. But if you wanted to get to it, and you didn’t know it’s numeric address, you couldn’t reach it. That’s because your computer uses a service called DNS to find things.

So what’s this DNS thing? And what brought down so much of the net yesterday?

DNS stands for “domain name service”. What it does is provide a bridge between the way that you think of the name of a website, and the way that your computer actually talks to the website. You think of a website by a URL, like http://scientopia.org/blogs/goodmath. That URL basically says “the thing named ”blogs/goodmath” on a server named ”scientopia.org”, which you can access using a system called ”http”.”.

The catch is that your computer can’t just send a message to something named “scientopia.org”. Scientopia is a server in a datacenter somewhere, and to send a message to it, it needs to know a numeric address. Numeric addresses are usually written as a set of four numbers, each ranging from 0 to 255. For example, scientopia.org is currently 184.106.221.182.

You don’t want to have to remember those four meaningless numbers in order to connect to scientopia. But there’s more to it than just remembering: those numbers can change. The physical computer that scientopia works on has been moved around several times. It lives in a datacenter somewhere, and as that datacenter has grown, and hardware has needed to be either expanded or replaced, the numeric address has changed. Even though I’m the administrator of the site, there’ve been a couple of times where it’s changed, and I can’t pinpoint exactly when!

The reason that all of that works is DNS. When you tell your computer to connect to another computer by name, it uses DNS to ask “What’s the numeric address for this name?”, and once it gets that numeric address, it uses that to connect to the other computer.

When I first got access to the internet in college, it was shortly before DNS first got deployed. At the time, there was one canonical computer somewhere in California. Every other computer on the (then) arpanet would contact that computer every night, and download a file named “hosts.txt”. hosts.txt contained a list of every computer on the internet, and its numeric address. That was clearly not working any more, and DNS was designed as the replacement. When they decided to replace hosts.txt, they wanted to replace it with something much more resilient, and something that didn’t need to have all of the data in one place, and which didn’t rely on any single machine in order to work.

The result was DNS. It’s a fascinating system. It’s the first real distributed system that I ever learned about, and I still think it’s really interesting. I’m not going to go into all of the details: distributed systems always wind up with tons of kinks, to deal with the complexity of decentralized information in the real world. But I can explain the basics pretty easily.

The basic way that DNS works is really simple. You take a domain name. For example, my personal machine is named “sunwukong”, and my family network is “phouka.net” – so the hostname that I’m typing this on right now is “sunwukong.phouka.net”. To look it up, what happens is you take the host name, and split it at the dots. So for the example, that’s [“sun-wukong”, “phouka”, “net”].

You start with the last element at the list, which is called the top-level-domain, and you contact a special server called the root nameserver and ask “who do I ask for addresses ending with this tld? (“Who do I ask for ‘.net’?”) It sends a response to you, giving you the address of another nameserver. Then you contact that, and ask it “who do I ask for the next element”? (“Who do I ask for phouka?”) Finally, when you get to the last one, the address that it gives you is the address of the machine you’re looking for, instead of the name of another DNS server. At each level of the heirarchy, there’s a server or set of servers that’s registered with the level above to be the canonical server for some set of names. The “.com” servers are registered with the root level servers. The next level of domains is registered with the TLD servers. And so on.

So, to look up sunwukong.phouka.net, what happens is:

  1. You send a request to the root nameserver, and ask for the server for “.net”?
  2. The root server responds, giving you the “.net” server.
  3. You send a request to the .net server address that you just got, asking “Who can tell
    me about phouka?”
  4. The .net server responds, and gives you an address for a DNS server that can tell
    you where things are in phouka.net.
  5. You ask the “phouka.net” server for the address of sunwukong
  6. It finally gives you the address of your server.

This distributes the information nicely. You can have lots of root name servers – all they need to be able to do is find DNS servers for the list of top-level domains. And there can be – and are – thousands and thousands of nameservers for addresses in those top-level domains. And then each administrator of a private network can have a top-level nameserver for the computers that they directly manage. In this mess, any nameserver at any level can disappear, and all that will be affected are the names that it manages.

The problem is, there’s no one place to get information; and more importantly, every time you need to talk to another computer, you need to go through this elaborate sequence of steps.

To get around that last issue, you’ve got something called caches, on multiple levels. A cache is a copy of the canonical data, which gets kept around for some period of time. In the DNS system, you don’t need to talk to the canonical DNS server for a domain if someone has the data they need in a cache. You can (and generally do) talk to cacheing DNS servers. With a cacheing DNS server, you tell it the whole domain name that you want to look up, and it does the rest of the lookup process, and gives you the address you’re looking for. Every time it looks something up, it remembers what it did, so that it doesn’t need to ask again. So when you look up a “.com” address, your cacheing service will remember who it can ask for “.com” addresses. So most of the time, you only talk to one server, which does the hard work, and does it in a smart way so that it doesn’t need to unnecessarily repeat requests.

So now, we can get to what happened yesterday. GoDaddy, one of the major american DNS registries, went down. Exactly why it went down is not clear – a group of jerks claimed to have done a denial-of-service attack against them; but the company claims to have just had a configuration glitch. Honestly, I’m inclined to believe that it was the hackers, but I don’t know for sure.

But – the canonical server for scientopia.org was gone. So if you needed to look up scientopia.org and it wasn’t in your cache, then your cacheing nameserver would go to the .org server, and ask for scientopia; the .org nameserver would return the address of GoDaddy’s nameserver, and that wouldn’t respond. So you’d never reach us.

That’s also why some people were able to reach the site, and others weren’t. If your cacheing server had cached the address for scientopia, then you’d get the server address, and since the server was up the whole time, you’d connect right up, and everything would work.

6 thoughts on “DNS and Yesterday's Outage

  1. John Miller

    To add just a bit to your description, typically there is not just one DNS server associated with a name, but a set. DNS servers go down as often as anything else; usually there is a redundant system or two ( more like 12 for TLD/Root) that picks up the slack. The odd thing about GoDaddy is not that one DNS serve failed it is that all of them did at the same time, and it took more then a few minutes to bring any replacement service back up.

    Reply
  2. Simon Farnsworth

    Your lookup process is right in broad outline, but misses an important detail that you have to understand to cope with odd DNS setups that I’ve encountered in offices.

    You don’t ask the root nameserver for “.net”; you ask it for “sunwukong.phouka.net”; it replies with “The best I can do for you is tell you that these servers know about .net domains”. In turn, you ask the .net servers for “sunwukong.phouka.net”; they reply with “You need to ask these servers about phouka.net domains”. And then you ask the phouka.net server for “sunwukong.phouka.net” again; it finally replies with the right answer.

    This doesn’t matter for most purposes; however, it does matter for things like .org.uk, where that particular layer of naming has nothing in it.If you look up the machine I’m using, “logos.farnz.org.uk”; it starts out the same way (the root sends you to the .uk servers), but the .uk servers respond not with “Ask these machines about .org.uk”, but “Ask these machines about farnz.org.uk”.

    Reply
  3. Pingback: DNS and Yesterday's Outage | Good Math, Bad Math | DNS Internet

  4. Keith Gaughan

    The big lesson to learn from GoDaddy’s outages is that it’s never a good idea to host all your nameservers in the same AS (autonomous system, network basically). Part of the point of having multiple nameservers Is that if a network hosting one of them goes down, the others that are hosted on other networks can pick up the slack.

    Given GoDaddy’s explanation of the problem, it sounds almost as if the nameserver Scientopia use are hosted on the same network, and possibly in the same datacentre, which, if true, is something they ought to rectify.

    Also, if a host can be reached by IP address, but not by domain name, it’s not unreachable, but the domain name is unresolvable.

    Reply
    1. MarkCC Post author

      How’s this for a sad comment on the state of the internet?

      I almost deleted your comment, because it looked almost exactly like the tons of fake-compliment spam that I’ve gotten on the post.

      Reply

Leave a Reply