https://gdstechnology.blog.gov.uk/2018/05/16/how-we-load-tested-our-email-notification-system-when-moving-to-gov-uk-notify/

How we load-tested our email notification system when moving to GOV.UK Notify

2 team members checking dashboard

This post explains how our GOV.UK Email team used load testing and code profiling when migrating to GOV.UK Notify to make sure we could continue to send out 1 million emails per hour during peak times.  

This migration also meant GOV.UK Notify had to scale its systems.

Beginning our migration

As part of our migration to GOV.UK Notify, we needed to optimise our delivery system. To do this, we divided our internal email delivery system into 4 layers of processing (Diagram 1).

  1. The GOV.UK Email notification system starts by identifying when publishers make significant content changes to GOV.UK pages.
  2. Our system then searches for subscribers who have signed up to receive notifications about specific content changes.
  3. The system generates an email to send to the user.
  4. The user receives the email.

This division made it easier to optimise each system layer quickly and efficiently.

Diagram 1

This shows the 4 layer system we developed to send emails
This shows the 4 layer system we developed to send emails

Load testing our email notification system

We started load testing by working backwards through our 4 layer email notification system, fixing bottlenecks in each layer to speed up the overall delivery time. We began by testing how long it took to deliver emails to end users.

Our results showed that the email delivery layer was able to send the number of emails we wanted on its own, but not when combined with the email generation layer. This is because the time taken to generate emails was longer than the time it took to deliver them. We realised that one of our tasks for optimising would be to reduce the time it took to generate emails.

Next, we found that the time taken between identifying a content change and matching this change with the end users who had signed up for a notification was slow. This was because the code we’d created to run this task wasn’t optimised for speed. Other optimisation tasks was included improving the efficiency of our database record insertions and improving our rate limiting.

How we monitored load testing

We built a dashboard using Grafana and Statsd to visualise our progress with load testing on the GOV.UK Notify system in near real time. We used the dashboards to check:

  • the number of notification requests we were making to GOV.UK Notify
  • whether the requests were successes or failures
  • how long requests were taking to complete
  • the size of our email notification process queue
  • how much memory and CPU time we were using at any given time
 We visualised our load testing using dashboards
We visualised our load testing using dashboards

To meet our aim of being able to send 1 million emails per hour, we started performing a smaller load test by sending 83,333 emails in 5 minutes (we had 1 million emails and we broke them into 5 minute chunks).

How we optimised our email notification system

We combined our load testing findings with data generated by a code profiling tool to identify 3 areas we could optimise across all 4 layers of our notification system.

1. Speeding up email generation with queue prioritisation and reduced query time

Our email application uses Sidekiq, a queue processing tool. Jobs are placed into queues, and when there is a free slot available for processing, Sidekiq takes a notification request from the next priority queue and performs the task.

We chose to assign email generation with a higher priority than email delivery to speed up the overall delivery process. This helped but we found that generating emails required a long-running query with the whole database, which was initially triggered in response to a publisher content change and ran once per every email sent.

To reduce this query time, we decided to generate emails continuously instead of generating an email every time the content changed. Every 5 seconds, we queried the database and generated an email with all of the content changes that occurred in that time period. This meant we ran fewer queries and could speed up the overall email delivery process.

2. Optimising our database using bulk SQL queries

We discovered that we could optimise our database use. We were inserting many database records such as emails, subscribers and content changes using individual queries when it would be much faster to insert all records using the same query. We used a Ruby Gem called activerecord-import, which let us add database records to our system in bulk. This gave an approximate speed increase of 30 percent across all parts of the GOV.UK Email system.

3. Improving our rate limiting

GOV.UK Notify uses rate limiting to protect itself from denial of service (DoS) attacks. This means that a notification service like GOV.UK Email can only make up to 360 requests per second to Notify. Any requests above this limit are rejected. We implemented our own rate limiter into the GOV.UK code to avoid going over this limit.

We had a small problem with our rate limiting implementation as our subsystem assumed that all workers were sending emails. In reality, a few workers were always doing other tasks like generating emails or matching content changes with subscribers. We ended up with a lower rate limit than we wanted (300 requests per second) because of this assumption.

We reworked our rate limiter to use a Ruby Gem called ratelimit, which uses the Redis database. Reworking our rate limiter meant our system no longer assumed all workers were dedicated to delivering emails. This made our rate limit calculations more accurate and meant we could send more emails in the same time period.

Reaching our 1 million email target

Our initial calculations assumed it would take an average of 100 milliseconds to send an email to end users. Based on this, we could theoretically send 300 requests every second with 30 concurrent workers and meet our target. But it turned out that this wasn’t correct.

Our load testing results showed that it took a total of 250 milliseconds to send email notifications to subscribers using GOV.UK Notify. With our initial concurrency setup of 30 parallel workers, we only had a maximum of 120 requests per second. The maximum number of emails we could send was 432,000 notifications per hour, falling quite short of our 1 million target.

We increased our parallel workers to 90 as this would help us deliver a total of 1.3 million emails per hour when each request takes 250 milliseconds. We could send 360 requests every second for one hour (3,600 seconds). We then carried out sustained hour-long load testing to make sure we could handle the load for longer periods of time.

What we learned

By optimising our 4 layer notification system, we reached our target of delivering 1 million emails per hour in under 3 weeks.

Moving to GOV.UK Notify will help us manage our email subscription system internally so we can save money, improve reliability, and increase security and accessibility. We expect GOV.UK Notify to help us continue to scale and potentially improve our elasticity so we can cope with higher peaks as demand for notifications continue to increase.

Next steps

We’re continually trying to improve the performance of our email generation and delivery layers. For example, we will look into whether we can run the email generation and subscription matching workers more often than every 5 seconds using a long running service in parallel with the Sidekiq tool. This will help us send emails faster because we can free up queue space to deliver emails. Using this method also means we won’t need to queue up the email generation workers at regular intervals and won’t lose time when they’re not queued up.

To make sure you stay up to date with all the latest developments, you can sign up to alerts from the GDS Technology blog.

If this sounds like a good place to work, take a look at Working for GDS - we're usually in search of talented people to come and join the team.