https://gdstechnology.blog.gov.uk/2015/01/07/validating-a-distributed-architecture-with-json-schema/

Validating a distributed architecture with JSON Schema

One of the challenges involved in developing a loosely-coupled publishing architecture like www.gov.uk is ensuring that all the components are speaking the same language.

At a very basic level, that can just mean agreeing on a standard set of REST endpoints that accept or emit JSON. But that's not really enough to ensure that the actual content of that JSON is what's expected.

Our "publishing 2.0" architecture is split into three layers: the backend publishing applications; the frontend apps that display the content to users; and the middle tier, which we've called the content store. These three components can be worked on by three completely separate developers, or even three separate teams. So the problem is how to keep them in sync, so that:

  • the data being emitted by the publishing app remains in a format that can be accepted by the content store
  • the content store in turn produces data that can be displayed by the frontend app
  • and finally that the frontend app knows what to do with all the data it is being sent

We've chosen to use JSON Schema to validate the contents of our JSON data, and then use the schemas as part of a contract-based testing suite to ensure that all the apps are using the data correctly.

Defining a JSON Schema

One of the benefits of JSON over something like XML has always been its relatively fluid and unstructured nature. However, that very flexibility can become a liability when working in a distributed manner. JSON Schema is a draft set of conventions for describing the structure of a JSON document and validating that individual documents conform to that structure. Schemas are written in JSON, and there are validation libraries for a range of languages, including Ruby.

A schema, at its most basic level, defines what elements are allowed inside the document. Optionally, it can do some basic validation on the types of those elements, or constrain them to certain values. Here's a very cut-down version of the schema we are using to validate the Case Study format (for the full version, see the file in the govuk-content-schemas repo on Github):

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "additionalProperties": false,
  "required": [
    "base_path",
    "format",
    "update_type",
  ],
  "properties": {
    "base_path": {
      "type": "string"
    },
    "content_id": {
      "type": "string",
      "pattern": "^[a-f0-9]{8}-[a-f0-9]{4}-[1-5][a-f0-9]{3}-[89ab][a-f0-9]{3}-[a-f0-9]{12}$"
    },
    "title": {
      "type": "string"
    },
    "format": {
      "type": "string",
      "enum": [
        "case_study"
      ]
    },
    "public_updated_at": {
      "type": "string",
      "format": "date-time"
    },
    "update_type": {
      "enum": [
        "major",
        "minor",
        "republish"
      ]
    },
    },
    "details": {
      "type": "object",
      "additionalProperties": false,
      "required": [
        "body",
      ],
      "properties": {
        "body": {
          "type": "string"
        }
        "description": {
          "type": "string"
        }
      }
    }
  }
}

There's a fair bit going on here, so let's take it in order. Since a schema is itself a JSON document, the information is stored as properties of an object (ie what other languages call a hash or dict). The first property, $schema, defines the version of the schema we are using; in this case, the latest draft, 04. We then provide definitions for the base level of the data: here we know it will be an object (as opposed to an array).

additionalProperties is a boolean determining whether the object is allowed to contain properties we have not explicitly defined; we don't want that. Next, required is an array of the properties that must be present for the document to be valid.

Finally, we get to the properties themselves. Here we have a nested object whose keys are the names of properties in our document, and the values are further objects describing what data is valid for that property. In the case of base_path, for example, we only specify that it must be a string, whereas for content_id we supply a regex that validates the value is a UUID and public_updated_at must match the ISO date format (note there is no date type in JSON, so dates are sent as strings).

details is more interesting, because it defines a nested object that itself can be validated. In the content-store data structure, the top level contains the general elements that are supplied for all document types that we have already seen - base path, update_type etc - whereas all the specific content goes under details. In JSON Schema, this is done by defining what is effectively another schema inside the first: we leave out the $schema declaration, but otherwise the elements are the same as at the top level of the schema: type, required, additionalProperties, and then the properties themselves.

There's quite a bit more possible in JSON Schema, including validating the contents of arrays, and creating reusable definitions which can be referenced throughout a document. A good reference is Understanding JSON Schema.

Validating the data

Now that we have a schema that defines our data structure, we can use it to check that the data we're producing is actually valid. Since our publishing app Whitehall, the content-store, and the frontend app are all written in Ruby, we can use the json-schema gem to do this. It's as simple as:

JSON::Validator.fully_validate(@schema_path, @data)

where @schema_path is the path to the schema file, and @data is a string containing the data to be validated. This method returns the errors if the data is not valid.

We don't want to do this validity check live. Apart from any overhead, it's just unnecessary: the data is being produced by a presenter, and that presenter (like all our code) is fully covered by unit tests. So we can add a test that checks the presenter produces valid output. We've wrapped this up in a utility class which we can call from our presenter tests:

require 'json-schema'

class GovukContentSchema

  VALID_SCHEMA_NAMES = [
    'case_study',
    'unpublishing',
    'redirect',
    'coming_soon',
  ]

  def self.schema_path(schema_name)
    if VALID_SCHEMA_NAMES.include? schema_name
      Rails.root.join("lib/govuk_content_schemas/#{schema_name}.json").to_s
    end
  end

  class Validator
    def initialize(schema_name, data)
      @schema_path = GovukContentSchema.schema_path(schema_name)
      @data = data
    end

    def valid?
      errors.empty?
    end

    def errors
      @errors ||= JSON::Validator.fully_validate(@schema_path, @data)
    end
  end
end

and the tests themselves look something like this:

case_study = create(:published_case_study,
                    title: 'Case study title',
                    summary: 'The summary',
                    body: 'Some content')
content_item_hash = PublishingApiPresenters::CaseStudy.new(case_study).as_json

validator = GovukContentSchema::Validator.new(format, content_item_hash.to_json)
assert validator.valid?, "JSON not valid against #{format} schema: #{validator.errors.to_s}"

We do this with all the various permutations of data produced by the presenter, which gives us a good degree of certainty that it's doing the right thing.

Contract testing

By itself, simply having the schemas and running unit tests against them is not hugely valuable. Although we've ensured that the content coming out of the publishing app is "valid", we haven't done anything to ensure that the other layers in the architecture are expecting the data in that format, or that they correctly process it.

Where the schemas really come into their own is when they're used as part of a suite of contract tests between the different layers. David Heath, Tekin Suleyman, Edd Sowden and I have designed a workflow process that uses these contract tests to ensure backwards compatibility when working on changes to the various apps.

We're implementing this by moving the schemas themselves to a central repo, govuk-content-schemas, and pointing the tests in all the apps to the schemas there. The tests use an enviroment variable to determine the location of the local checkout of the schemas repo and run their validation against that.

The schemas repo also includes a set of curated example content for each schema, in the form of JSON files that illustrate the common case and the main variations for each format. For example, we have a file for a standard case study, one for a translated case study, and one for a case study that has been archived.

The benefit of this approach is that changes to the schema can be proposed by developers working on any of the systems, and they can trigger the tests to run in all the other apps to ensure that nothing is broken.

Here's an example of how the workflow might operate.

A frontend developer is iterating on the Publication format on GOV.UK. They want to include details of when each publication attachment was uploaded, but this information is not currently present in the content store representation of a publication.

They add the new field to the example file for a Publication in govuk-content-schemas, and iterate on the frontend template design to include the new information on the page.

Next they make a pull request for govuk-content-schemas that updates the schemas for a publication and updates the existing examples to include the new data.

The tests are run against their branch, checking that their changes still validate against the schema: and, importantly, that the existing tests on the publishing app still pass when run against the new schema. That ensures that the changes are backwards compatible and that nothing is broken by the new changes.

The changes to the frontend app and the schemas can now be merged, and the team maintaining the backend app can schedule their own work to start populating the new data.

We're still trying out this process, and it's yet to be used for real, so no doubt it will evolve as we iron out the problems. But we're hoping it will provide us with a more consistent and reliable approach to collaboration between the different teams working on the publishing architecture.

Leave a comment

We only ask for your email address so we know you're a real person