So, one year ago I wrote a post about sharing some, well, open thinking about how I wanted to do some open working and create some open data. This post contains some thoughts on a few things I have noticed whilst actually doing it in the form of developing and launching a public beta of some new functionality on the ONS web site.
So. Let's set the scene. I'm still working in the Office for National Statistics, near a roundabout, near the edge of Newport in South Wales. Statistics are still important. Our national discourse is complex and the need for the calm inclusion of statistics into this feels ever more important. I continue to work on a digital service to publish ONS statistics with as much utility for all of our users as possible. I happen to think that for statistics, The Web counts as just an important user as anyone else and this post is mainly about that.
So, the web as a user hey? Sounds like the kind of thing I would say, right? But what does it mean? We tend to think of users as an abstract concept. ONS (massive shout out to Al Davies here) helps to contextualise users on a spectrum from experts to (more or less) everyday citizens with differing needs. We, at times, use these personas to help us understand the type of user we may want to do some face to face research with. The web is a harder user to speak to, so we had to work this one out ourselves. The starting point was that we needed to evolve and the way the publishing of statistics is changing because of this. We are trying to move from attaching inconsistently formatted Excel files to HTML pages and move to consistent CSV files both attached to html pages (for humans) and in addition gently woven into the fabric of the web. The later is still for humans, but ones that we don't know the context for. Some examples of this are
Fact checkers who want to be able to, in real time, query a national statistical institution to see if a statement is using a statistic that is correct and contextually valid
A search engine being able to understand that a set of statistics we publish is a dataset and treat this as a thing in its own right and so showing it differently within a set of search results
A voice interface/assistant that you can ask questions of that is able to (behind the scenes) draw from the statistical information published
Countless civic activists and media staff who want to be able to quickly and easily make use of consistent and well described data to create a variety of time series charts and visualisations from the published information.
(in my day job I am speaking to people who are trying to do all of the above and I am very keen to speak to many many more. DM me. I will reply)
For any of these to work we need to publish data openly, license it correctly and describe it using the conventions of the web. This is so that it is easy for humans and machines to understand how to find and use information with as little effort and the greatest context as possible.
To step into the nerd zone for a second, this is where it gets tough.
To make this work a collection of things had to happen. We've written about this over on another post but essentially (over the last 12 months) a bunch of python scripts have been used to process messy data held in excel and turn them into tidy and constantly structured csv files. These are pushed into a graph database so that they can be treated as individual datapoints. Individual data points help answer specific questions, consistency in aggregations of these makes it easier to work with the data.
It has been really pleasing to see this consistent layer of data start to build across the outputs of ONS, but we have lots more ground to cover. However, even in doing this, a few things really jumped out at me.
CSV W is a wonderful thing
Despite having a name that caused no little confusion for some, I would suggest we all bet our farms on using CSV on the web. Having a light touch way of describing what is contained within a dataset, both in terms of the columns and the descriptive metadata is staggeringly useful. It isn't magic. Consistently described bad data would still be bad data, but as a few kb of JSON it is so sensible. Let's all agree to use it? Deal. (it also seems like such a sensible step towards creating RDF. I'm still not 100% convinced this is actually needed for the web as a user, but that is probably a different post)
Swagger is top
JSON-LD doesn't have to be complicated
Related to the CSV-W thoughts, a starting point for JSON-LD can be really simple. We've built a single context file that is linked to from APIs. This adds to the context and meaning of the data and acts as a very basic quasi Semantic link to the wider web. Just telling the world what we mean by a few simple terms was really easy and can be built on as we and the web evolve.
Schema.org is so fit.
Just a shout out to how important this little site is. It is by no means perfect and use cases can be tricky to map to others, but it is a foundational example of why we don’t need to keep inventing new things. Smart people have thought about how to represent information in a way that means we can develop common understandings. This is what The Web is good at.
We missed a trick in conneg
I am absolutely kicking myself for having a api. subdomain that is separate from the website. We have loads of practical reasons for doing this, but it has broken a little bit of the purity of what the site could and should be. It means that the CSV-W doesn’t quite sit where I want it to yet. Content negotiation is not a new thing, but I think it might actually still be one of the most important. (http calls, response headers, conneg. These are the building blocks)
These are just a few quick points, but it feels important to share the good and bad things that we have learnt. Please do get in touch if you want to easily and consistently bake stats into what you are building. I'd love to help you make a difference so that we can all save the world.