Turning to Tables: A Case for ‘Flat’ Files in Open Data

Data is changing. Once in the dim and distant past where data was created, collected and used on paper, all that mattered was the information on the page. The dawn of the computer added an additional dimension to our consideration of data: The file format. To the lay person, a file format is simply an extension to a given file that denotes which computer software it has been designed to work with. For example, a Microsoft excel file has a ‘.XLS’ extension and an Adobe Acrobat file a ‘.PDF’ However, a file format also describes the structure of a file: The way the information contained inside it is organised and retrieved. File formats are critical for Open Data, an umbrella term describing any publicly-available dataset than can be accessed and reused by anyone. Open Data changed the way people consider data formats because to be useful, data now had to be in a file format that people can easily open and play with.

The formats debate has undergone a gradual evolution over the past decade. In the early years of Open Data, many people championed Resource Descriptor Format (RDF), a file structure that displays data as a graph, because it was considered to be the most accurate way to release Open Data. Despite high technical accuracy, many ordinary people rejected RDF because it is highly complex to create and therefore excludes less technically-advanced data owners. From 2010, eXtensible Markup Language (XML), a format created by W3C, took over the crown as ‘the best choice’ for publishing Open Data. XML was championed because it contained a strong ‘schema’ used to organize the dataset and provide easy retrieval of information. However, since 2013, there has been a backlash against XML due to the difficulty of building the schema into mobile apps. Users and publishers now agree that Comma Separated Values (CSV), a schema-free format that uses tables to display data, is a superior choice for publication of Open Data. These tables are ‘flat’ meaning they display data only linked to a column and a row. This trend toward flat data is now backed by W3C’s highly influential Technical Architecture Group who declared 2014 ‘The Year of CSV’. Overall, the gradual evolution toward CSV indicates a desire for simplicity among data users but it also points to a deeper demand than now drives man Open Data advocates: A rejection of Data Provenance.

The trend toward flattened datasets suggests what 21c have found from working to open the data of over 120 Local Authorities across the world: The less provenance or ‘baggage’ a dataset carries the more useful it is. The challenge with the RDF and XML formats is that they preserve a relationship between the data points that is defined before the data is accessed by the end user. This relationship or ‘schema’ is like having a passenger tell you which parking space to choose; they may be right, but everyone prefers to choose for themselves. On the other hand ‘flat’ data, like CSV, has no pre-defined relationships, just a table of data points. This flat structure gives users a free hand to interpret the data how they want. The ‘flatter’ a dataset it, the better for users and this is why we always recommend Open Data be released as flat tables.

2 comments

  1. harrywood · December 8, 2014

    XML peaked a lot earlier than 2010. More like 2000 I would say. But maybe it depends which audience you’re talking about. I suppose google trends shows the general publics understanding of it: http://www.google.com/trends/explore#q=XML (doesn’t go back far enough to show when it peaked)

    Back in its heyday people rejoiced the “human readability” and “self-describing” nature of XML, but it seemed to lose that as the “enterprise modelling” people got carried away with it. There’s a lot of complex cruft that tends to come with XML, particularly all the description languages for schemas. The best choice depends on use cases of course. If your data naturally has a nested schema, then XML might be a better choice than CSV.

    But then there’s a recent trend to go for JSON and YAML for some of these cases (This is more like a 2010 thing, I would say) These are nice half-way house. A bit of nested structure, but a bit more flat-file-ish. We use JSON for TransportAPI.com . We’re looking into swagger/RAML specs for this, so maybe it’s getting too complex and going the way of XML!

    We do also output some good old csv . In fact CSV can be more complicated that you might imagine. Check out the “what is bad about CSV?” section here: http://data.okfn.org/doc/csv

    Liked by 1 person

    • 21cleaders · December 8, 2014

      A very good point to the article. In terms of the dates, we use the popularity ‘peaks’ specifically in relation to Open Data rather than data more generally, where XML has been a popular choice for a longer period. In terms of capability though, the fundamental point I believe is that CSV is the ‘foundation’ for Open Data. You can use a CSV foundation to build a ‘JSON’ house or a ‘YAML’ house or an @RAML’ house but CSV remains the most widely applicable format if you were to release in just one.

      Indeed here at 21c we advise our clients to output in CSV for maximum re-use and also JSON (specifically with embedded semantic information) to increase their appeal to developers. We have tools specifically designed to convert CSV to JSON but I would agree that JSON’s tree schema is superior to the XML version because the schema nesting is more user-friendly.

      One further point to make about CSV is that portals like CKAN are now allowing users to write direct queries of flat CSV files on-demand through API-style scripts. The ability to write pluggable queries of a CSV (even if this results in a crude dump file rather than direct app plugin) takes CSV closer to the mainstream acceptance and will potentially see it increase further in popularity.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s