https://gdstechnology.blog.gov.uk/2017/02/03/providing-access-to-datasets-through-apis/

Providing access to datasets through APIs

The Register team taking part in a standup and the Register sign

In this blog post I’ll discuss our thoughts on providing access to datasets through APIs. My thoughts are based on experiences with the Registers team, who have been conducting user research into how people use data, and agree with the US API standards.

There are two different ways to access a large dataset: you can either perform individual queries for individual pieces of data via an API, or you can download the whole dataset and perform queries on your local copy at leisure. Normally needs will dictate the type of data access you go for.

Performing individual record queries

If you’re building a service, you might want to access data about a user’s local authority without having to download data on all local authorities. This access pattern has many advantages: the data is ‘live’ and you only need to download the data you actually need. Building directly on an API can help when prototyping or building low volume clients for a small number of users.

However, there are disadvantages to this access pattern: if the originating API has a period of unavailability, your service will not be able to provide the data users need. Furthermore, the API will restrict the kinds of queries you can do on the data. For example, you won’t be able to perform a text search on a natural-language dataset if the API doesn’t support this kind of query.

Downloading a whole dataset in bulk

Some people may need to perform a task that requires access to the whole dataset. For instance, you may wish to plot a graph on school catchment areas in England.

Using a record-by-record API to access the entire school dataset would be suboptimal, both for the consumer and for the API. You could encounter barriers such as rate limits, which slow down access, or could possibly stop the whole dataset from downloading entirely. Furthermore, if the dataset is being updated concurrently with the record-by-record download, you may get an inconsistent view of the world with some records up-to-date while others are stale.

To support use cases like this, it would be more convenient to download the whole dataset and run your data analysis on your own machine.

This has advantages. You can easily index your local copy of data using your database technology of choice and then perform a query to meet your needs. Future API downtime won’t affect you because you already have all the data you need.

Keeping your local dataset copy up to date

If you have performed a bulk data download, it’s liable to become stale as updates are made to the original dataset. If you are going to continue using this downloaded data, it’s important to think about how stale it may become and what you can do to keep it up to date.

A simple approach is just to redownload the whole dataset periodically. You can see this in the code for the Petitions site, which downloads the contents of the country register every night. If the country register is unavailable, the Petitions site can just use the old version of data and their service can continue uninterrupted.

For larger datasets, this strategy becomes wasteful and impractical. One strategy we’re experimenting with on the Open Registers project is to let users download incremental lists of changes to a dataset. So rather than downloading a whole dataset again, they can download only the records that have changed since their last download. This should keep their own local copy of the data, while not having to redownload the whole dataset wastefully again and again.

There is no one obvious standard for this pattern, so we are researching using a number of different approaches, including encoding data in Atom/RSS feeds and emergent patterns, such as event streams used by products such as Apache Kafka.

Summary

Data-focused APIs present a particular set of challenges to API design. We are trying a few techniques to fit the varied needs of different data-focused API users. We learn by building things and seeing how people use them, and observing what works and what doesn’t.

If you’re interested in the approach described above, why not build something against our Registers API, which supports both bulk downloads and API access, and see what you think. For example, if you’re building an application like that of Petitions, which needs an up-to-date list of countries, use our Countries Register API to keep that list up to date.

We’d love to hear your feedback in the comments section below.

You can sign up now for email updates from this blog or subscribe to the feed.

If this sounds like a good place to work, take a look at Working for GDS - we're usually in search of talented people to come and join the team.

3 comments

  1. Comment by Joao Fernandes posted on

    You may want to review the links after "Registers API". Unless the random letters with links are the clue to some sort of puzzle of course.

    Reply
  2. Comment by Jan Suchal posted on

    One very easy way to do this updates/sync API is this. https://gist.github.com/jsuchal/360308ac6e703d21512a

    This works if you never delete any data. If you do, keep a separate event stream (you could use triggers in relational database to keep it in sync) and use the same API on top of it.

    The nice thing about this API is, that you can use it for initial import and updates without any change. Works like charm.

    Reply

Leave a comment

We only ask for your email address so we know you're a real person