Skip to main content

https://technology.blog.gov.uk/2017/07/19/making-the-gov-uk-dns-more-resilient/

Making the GOV.UK DNS more resilient

Posted by: , Posted on: - Categories: GOV.UK

GOV-UK-team

Our job in the Infrastructure Team is to make sure GOV.UK is available. Find out how we've improved the uptime of GOV.UK by making our DNS more robust.

Our DNS can now be served by 3 providers meaning we can be more confident our website will be available even if one of the providers fail. Our DNS resource records can also be automatically validated, checked and deployed to our providers with a single click. In this post I’ll explain why we’ve made these changes.

Problems we had with our DNS

An incident occurred in October because our DNS supplier Dyn was unavailable due to a distributed denial-of-service (DDoS) attack, resulting in its systems not being available. We had a single point of failure because we’d used a single provider, and this is something we’re keen to avoid in the future.

The October outage also highlighted another problem: the state of our DNS was only stored in one place, with our provider. This meant Dyn had the only version of our DNS zone file (which listed all of our records), and we couldn’t download a copy while the system was unavailable.

Building a more resilient DNS

Firstly, to get rid of the single point of failure, we made Amazon Route53 our secondary DNS provider. This involved copying the existing BIND formatted zone file from Dyn and uploading it to Route53. Although this was an improvement, it did mean future modifications to the zone file now had to be made in two places: Dyn and Route53.  This introduced the risk of copying errors to files or forgetting to update one provider.

We had to remove these risks and make our DNS zone file more redundant. We wanted to make sure we had a single copy of the file, controlled by us, and that our team was always fully aware of the state of our DNS.

To do this, we copied our DNS zone file into our Git repository. We then wrote some simple scripts to convert the space-delimited BIND zone file into a more explicit format using YAML. We went with YAML as our developers are familiar with it and we can explicitly state what each piece of data is instead of relying on white space based columns (editing and maintaining field:value formats are is generally easier than white space columns).

With the zone file in our repository, we had more control and we could issue a pull request when we wanted to change our DNS (creating a history of modifications). We can then review the changes, comment on them, and check them before they are deployed.

We think these changes make us more secure because we are able to:

  • track versions and changes
  • audit all changes (we know who changed what, when and why)

Deploying the zone file to our DNS providers simultaneously

Now the zone file was in version control, we wanted a way to easily deploy the latest version to our DNS providers simultaneously. We wanted to avoid manual copying and pasting values into webforms.

To create this automatic deployment, we used the infrastructure management tool Terraform. We created a script (and github repository) that consumed the zone file and produced a JSON file formatted for Terraform. This JSON file is used by Terraform to produce the DNS records (Terraform calls these “resources”) and deploy these to our providers via its APIs.

Now we can start using a new provider (as long as they are supported by Terraform) by adding a few lines of code to the script, which produces a JSON file with the correct structure. We’ve added functionality for three providers now: our previous two (Dyn and Route53) as well as a new provider, Google CloudDNS. Writing these simple functions is a much quicker a simpler task than custom scripts and it has the advantage of being something we can test.

At this point we had effectively solved our original problem: we now had our DNS supplied by multiple providers, we had the zone file in code and we could deploy it with the push of a button. By automating the zone file deployment process, we’ve removed a lot of the provider specific requirements that can slow down making changes and introduce errors. We also picked up a few additional bonuses along the way: we can validate records in the zone file and verify the live DNS.

We made validating records part of our pull request process, so when developers add new records common mistakes are caught early (and automatically). We have also added a nightly job that queries public DNS to check that it matches our repository.

Your thoughts

Operating our DNS properly is vital to GOV.UK’s  survival and so we treat it with the same (if not more) rigor than any other piece of code. We’re keen to hear your thoughts on what we’ve done, or the steps you’ve taken to make your DNS more resilient. Please share your views or stories using the comments section below.

 Sign up now for email updates from this blog or subscribe to the feed.

If this sounds like a good place to work, take a look at Working for GDS - we're usually in search of talented people to come and join the team.

Sharing and comments

Share this page

3 comments

  1. Comment by Oli posted on

    Thanks for the excellent write up. I'd be interested in how you monitor your live DNS infrastructure but also how you structure your DNS/networking outside of production (test etc). Do you mirror live? How do you test your changes to production DNS?

  2. Comment by samcook posted on

    Hi Oli,

    We don't currently do any live monitoring of our DNS. The closest we have is a nightly job that checks against our expected state.

    We don't use DNS for much outside of service access so there's not much structure. We have staging and integration environments which are pointed to by the main zone file so it's all fairly flat.

    In terms of testing it's mainly done using dummy zones to check that the scripts run cleanly and the changes are what we expect.

  3. Comment by Brian posted on

    Out of curiosity, is there any reason you chose this approach over the conventional master-slave method?

    I can understand the logic of using version control (although we use a custom database to generate them, with change tracking), but distributing the zones to individual servers (well, services) seems overly cumbersome.