https://gdstechnology.blog.gov.uk/2014/08/27/taking-another-look-at-gov-uks-disaster-recovery/

Taking another look at GOV.UK's disaster recovery

As GOV.UK gets bigger, we often need to revisit the ways that we originally solved some problems. One thing that's changed recently is how we prepare for disaster recovery.

Disaster Recovery

The reality of working in technology is that software systems fail, more often than we'd like and usually in ways that are beyond our control. The process of thinking up high-level failure scenarios and solutions for them is called disaster recovery.

One extreme scenario for GOV.UK is that all our infrastructure disappears. All of of our applications, the servers they run on and even the infrastructure provider we're using. Gone.

Trying to solve this problem directly and up front is difficult. It's too generic and we don't know the root cause yet. But, we can give ourselves some time while we assess the situation and start to resume normal service for our users.

Creating a static copy of GOV.UK

Back in 2012, one of the nice properties of GOV.UK was that the majority of pages were static. Static HTML, CSS, JavaScript and images. Additionally, there were only 20,000 pages in total.

One of our solutions at the time was to run a task overnight to visit every page on GOV.UK that wasn't a form and save it to disk. This process would take a couple of hours to complete. We'd then transfer the files to some backup machines that were ready to be switched over to should the worst occur.

We called the code that did this the GOV.UK Mirror.

Fast forward to present day

Various agencies and departments have transitioned to GOV.UK and we're now up to over 140,000 pages. We found that the GOV.UK Mirror was now taking close to 50 hours to complete. It meant that a given page could be up to two days out of date should we need to switch to our mirror. For pages like foreign travel advice that update more regularly than once a day this was a problem.

What problems are we trying to solve?

We knew that the full crawl was taking too long at 50 hours. We would want this to complete within a day. We also couldn't crawl certain pages that get updated often on an ad-hoc basis such as foreign travel advice. Finally, there was no way of pausing or stopping the mirror process mid-run. We couldn't continue easily from the last good state so we had to restart from the beginning each time.

Building it

We made a conscious decision to split the GOV.UK Mirror into two components. A producer to give us the initial set of URLs and a consumer to crawl them, and write the pages to disk. The two components would communicate using a message queue. This way, we'd remove our reliance on the nightly task to complete the work and could use the message queue for crawling ad-hoc pages. Using a message queue also meant we could continue where we left off.

The producer is now a simpler process that retrieves a list of URLs from our Content API and publishes them to an exchange. Most of the work is done in the consumer component, which is written in Go and the message queue broker we're using is RabbitMQ.

We wanted the ability to horizontally scale out crawling to improve the rate at which we completed the work we were given. We could achieve parallelism on the queue by increasing the number of consumers, but we wouldn't be able to keep track of URLs that had been crawled across the nodes. We needed to think beyond a single process running at one time.

We used Redis to share state across the workers. We use Redis to keep track of URLs that had been crawled before and check whether or not to crawl them as we pick up URLs from the message queue. Now we can have many message queue consumers to get through work faster based on our workload. The total time for a full crawl is now 4 hours.

What have we learnt?

We had been running the GOV.UK Mirror for long enough to know which areas we didn't like, from operating it through to functionality we knew to be missing. Not only that, but we also understood the problem better than we did at GOV.UK's release. There was no magical epiphany that occurred; this is the nature of writing software – you have to adapt and update as you know more.

After two years of running GOV.UK we're finding that we have to revisit many of the choices we had made. The site has grown a lot and it's time to take another look at many of the applications we built back in 2012.

Iterate. Then iterate again.

You can find the code here: https://github.com/alphagov/govuk_crawler_worker/

If this sounds like a good place to work, take a look at Working for GDS - we're usually in search of talented people to come and join the team.

You can follow Kush on Twitter, sign up now for email updates from this blog or subscribe to the feed.

14 comments

  1. Comment by Jemima posted on

    Does the CMS add pages to the queue when pages are updated? Allowing you to prioritise updated content?

    Reply
  2. Comment by Felicity posted on

    Is this really a unique problem to GDS and thus requires an expensive custom solution?

    Reply
    • Replies to Felicity>

      Comment by Brad Wright posted on

      Hi Felicity, while static mirrors of content-managed data isn't a unique problem, our particular administration system is entirely bespoke because it's built around user needs. As Kush says, this mirroring is part of our disaster recovery setup so we absolutely need the mirror to be up to date and comprehensive. This requires integration with our publishing tools which we wouldn't easily get from a commercial system without customisation and lock-in - we outline some of our software buying vs. building thinking in the choosing technology page on the Service Manual.

      Reply
  3. Comment by Peter Smith posted on

    Could not you just rsync the webserver contents in a cron job? Do you not have all the documents source controlled anyway? Appears to be massively over-engineered, unless there's something I'm missing.

    Reply
    • Replies to Peter Smith>

      Comment by Brad Wright posted on

      Hi Peter, GOV.UK's content is served from a content management system and is passed through some presentation layers to provide consistent headers and footers - it's not stored on disk in full form, hence the need for a mirroring process to give us the content to rsync.

      Reply
  4. Comment by Sam posted on

    You mentioned it now only takes 6 hours for a full backup. Out of curiosity what now seems to be the bottleneck preventing you going any faster (other than just adding more crawler workers)?

    Reply
  5. Comment by serverhorror posted on

    Sounds like you could also achieve vast speedups by using information in the the headers (like those being used for client caching)

    Reply
    • Replies to serverhorror>

      Comment by Kushal Pisavadia posted on

      Potentially, but our intention is to use this as part of a workflow where when we 'publish' a page we can then immediately send a message to the exchange for the crawler. That way, we don't have to rely as much on cache control headers.

      Reply
  6. Comment by Storix posted on

    Your approach seems to be "over engineered". The pages are published through a CMS using templates and data from within a database. This is a very common approach and most administrators create a backup or sync the template data and the database on the web server. In this case, I would recommend using a backup product rather than crawling the pages for content.

    One important aspect to remember is backing up relational databases. You should either stop the database temporarily, or create a snapshot of the data so that it is backed up in a consistent state. I'm not here to sell you a solution, but we have customers who use our backup product to accomplish exactly what you are trying to achieve. You might want to re-evaluate your disaster recovery plans by incorporating backup software. Good luck.

    Reply
    • Replies to Storix>

      Comment by Kushal Pisavadia posted on

      As already stated in the blog post, this is just one of our many disaster recovery options. We have backups in place for our databases.

      In a scenario where our infrastructure disappears we would have to provision new infrastructure. This would need to complete before applying any database backups. Even more so if that's occurred across many providers or data centres.

      The static mirror gives us some extra time whilst we work on reprovisioning new infrastructure. It means we reduce the time it takes for users to access common areas of the site.

      Reply
  7. Comment by Eula Benham posted on

    I am actually pleased to read this website posts which contains lots of useful facts, thanks for providing such data.

    Reply

Leave a comment

We only ask for your email address so we know you're a real person