Data is changing. Once in the dim and distant past where data was created, collected and used on paper, all that mattered was the information on the page. The dawn of the computer added an additional dimension to our consideration of data: The file format. To the lay person, a file format is simply an extension to a given file that denotes which computer software it has been designed to work with. For example, a Microsoft excel file has a ‘.XLS’ extension and an Adobe Acrobat file a ‘.PDF’ However, a file format also describes the structure of a file: The way the information contained inside it is organised and retrieved. File formats are critical for Open Data, an umbrella term describing any publicly-available dataset than can be accessed and reused by anyone. Open Data changed the way people consider data formats because to be useful, data now had to be in a file format that people can easily open and play with.
The formats debate has undergone a gradual evolution over the past decade. In the early years of Open Data, many people championed Resource Descriptor Format (RDF), a file structure that displays data as a graph, because it was considered to be the most accurate way to release Open Data. Despite high technical accuracy, many ordinary people rejected RDF because it is highly complex to create and therefore excludes less technically-advanced data owners. From 2010, eXtensible Markup Language (XML), a format created by W3C, took over the crown as ‘the best choice’ for publishing Open Data. XML was championed because it contained a strong ‘schema’ used to organize the dataset and provide easy retrieval of information. However, since 2013, there has been a backlash against XML due to the difficulty of building the schema into mobile apps. Users and publishers now agree that Comma Separated Values (CSV), a schema-free format that uses tables to display data, is a superior choice for publication of Open Data. These tables are ‘flat’ meaning they display data only linked to a column and a row. This trend toward flat data is now backed by W3C’s highly influential Technical Architecture Group who declared 2014 ‘The Year of CSV’. Overall, the gradual evolution toward CSV indicates a desire for simplicity among data users but it also points to a deeper demand than now drives man Open Data advocates: A rejection of Data Provenance.
The trend toward flattened datasets suggests what 21c have found from working to open the data of over 120 Local Authorities across the world: The less provenance or ‘baggage’ a dataset carries the more useful it is. The challenge with the RDF and XML formats is that they preserve a relationship between the data points that is defined before the data is accessed by the end user. This relationship or ‘schema’ is like having a passenger tell you which parking space to choose; they may be right, but everyone prefers to choose for themselves. On the other hand ‘flat’ data, like CSV, has no pre-defined relationships, just a table of data points. This flat structure gives users a free hand to interpret the data how they want. The ‘flatter’ a dataset it, the better for users and this is why we always recommend Open Data be released as flat tables.