The Future of Data Sharing
By Charles Roper, Sussex Biodiversity Record Centre
First published in Adastra 2013; February 2014
We at SxBRC would like to improve the way we share and distribute data so that we're better able to exploit modern technology and maximise reuse. In order to do this we need to sharpen up on our data sharing agreements and start the process of applying clear, machine-readable licenses to datasets. We need to do this in collaboration with data creators and with the blessing of recorders and groups. Read on for the details on what, why and how.
Table of Contents
- The World Wide Web
- A little data history
- So what is data sharing exactly?
- But what does it mean to be open data?
- Realising the potential of digital biological data
- The Creative Commons
- Creative Commons: Open Data
- Creative Commons: Some Rights Reserved
- Our Recommendations
The World Wide Web
Think back for a moment to Summer 2012. In four short hours, the British perception of the London Olympics spun from cynicism to delight to pride. Mid-way through this epic spectacle - Danny Boyle's aptly titled "Isles of Wonder" - a suburban house levitates revealing a man tapping away at a computer. The words this is for everyone enigmatically flash around the arena on vast LED screens. As with much of the ceremony, you probably thought what does it mean? while simultaneously being enthralled nonetheless.
The man at the computer was Sir Tim Berners-Lee, inventor of the World Wide Web, and you may not have realised the message was a tweet. It was a simple message posted live on the World Wide Web via the Twitter social network at the same time as being illuminated around the Olympic stadium and on into our homes via satellite broadcast television. The extraordinary story of the World Wide Web, invented just 24 years previously by the Cern scientist, reached a crescendo at that moment and so laid an important milestone in our history. The message, the medium and the moment were indeed for everyone. Berners-Lee summarised it thus:
"The Web is about connecting people through technology, not about documents. The Olympics are about connecting people too. It would be nice if the Olympics bring people to use the Web to understand each other, break down national and cultural barriers and look at each other from a more beautiful point of view."
The staggering achievement of the Web must not be underestimated. It is not merely a technology invented by one person or even a team, but one of humanity's finest examples of collaboration, agreement and freedom. It is easy take for granted how simple it is now to shop online, keep in touch with friends, look up information, research, work and pass the time via computers on our desks, and smart devices in our pockets. More and more people around the world are meeting online, starting friendships, falling in love. Most importantly, our ability to contribute knowledge is enhanced beyond anything anyone could have imagined. We can create a website, edit Wikipedia, blog, tweet, post pictures, upload data, write comments and answer questions. In doing so, anyone can become a publisher. Anyone can contribute. Anyone with something to say can now say it and anyone with a web-capable device can see it. That is extraordinary. Twenty years ago such civilised, democratic notions were the stuff of science fiction, but now it's woven into our everyday lives.
It is easy to forget that, at the time the web was invented, computers were abstruse (still are in many ways), difficult machines. The domain of experts and computer "whizz-kids". They were expensive, unfriendly and inaccessible. Furthermore, getting one computer to talk to another computer to exchange information was extremely difficult. Copying files from one system to another was fraught with complexity. Computers were essentially islands cut off from one another bar to a few intrepid techies. And companies such as Microsoft and Apple had a vested interest in keeping it so - choice and freedom may be good for us, but it's not good for business (or so they thought at the time). So in creating a highly fault-tolerant way allowing computers to read "pages" of information from other computers, and in creating a relatively simple way to do it, and in getting others to agree to use the same simple ways, and doing it in such a way that no one company, organisation, country or individual could completely control it, and in actually getting it off the ground at all - well, that's a little piece of genius.
Today, we can access vast tracts of human knowledge instantly with a simple search on a device we carry in our pocket. Paying for access to content is the exception rather than the rule. This free, super-abundant access to information, and the unrestricted ease with which we can publish it, has fuelled further content creation, insight, inspiration and collaboration. It's an information bloom. A virtuous circle of human creativity and consumption. And it's all based on agreement and collaboration.
A little data history
Four months after the Olympic opening ceremony, Berners-Lee together with fellow scientist Sir Nigel Shadbolt (Professor of Artificial Intelligence and Head of the Web and Internet Science Group at the University of Southampton), founded the Open Data Institute. In this new initiative Sir Tim and Sir Nigel are catalysing not only the next chapter in the story of the Web but also in the much longer history of our human instinct to share information. They have said the burgeoning Open Data movement is roughly at the stage the web was 20 years ago. The blue touch-paper has been lit and it's about to take off. Together with those same visionaries who kicked off the Web (and a few younger ones), they're hoping to do for sharing data what we've already achieved for sharing general information via the web.
So what is data sharing exactly?
It's not a new concept. Data has been around since humans have been able to think and record. Sumerian writing and Egyptian hieroglyphs from around 3500BC recorded many aspect of our lives. Pythagoras (572BC-495BC), whose ideas and theories we have based much of our knowledge of the universe, is quoted as saying "Friends share all things." Lascaux cave paintings in southwestern France depict hunting scenes with animals - data, in other words - messages, ideas, concepts. Indigenous Australians created "songlines" whereby they would sing out "the name of everything that crossed their path - birds, animals, plants, rocks, waterholes – and so singing the world into existence". Such creations are said to have guided them along the routes of their ancestors. The Inuit laid monumental "Inukshuk" as way-finding markers to fishing grounds, and the 150BC Turin Papyrus was one of the first topographical maps.
19th-century oceanographer and meteorologist Matthew Fontaine Maury was a pioneer. He extensively analysed ships logs and weather, mapped his data and in 1855 published his book, The Physical Geography of the Sea, so that "each may have before him, at a glance, the experience of all". He shared his findings, encouraged others to contribute and created a worldwide project that has formed the basis of safe global maritime navigation ever since. He was the Tim Berners-Lee of his day, in a sense. Others examples are plenty: Carl Linnaeus, Alfred Russel Wallace, Charles Darwin, Dr John Snow. All worked to produce data and knowledge for the good of all. Our contemporary biological recording community follows a similar tradition. Friends collect data and share it. A loose, decentralised, network of people working together for the benefit of all.
But what does it mean to be open data?
Open data is certain information the creator or custodian has explicitly decided to freely share with others for the good of all. By definition Open Data should be "freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control." In so doing, the data becomes a part of the Commons[^1].
But here's the rub: in the past, before the advent of computer networks, we would publish data in books. In the case of biological recording we would publish atlases, journals and papers, sometimes for a nominal sum, sometimes for free, but rarely with the intent of pure monetary gain. Publishers need their cut, and authors deserve remuneration for their efforts, but as the publications age, most would agree the true values lies in the data being easily accessible and usable for future generations. That is the spirit of Open Data. Where many see data as a commercial asset, to be exploited for business gain for the benefit of companies and shareholders, the Open Data ethos is that knowledge should be for everyone. However, it is difficult to pull data off the printed page and computerise it in such a way as to enable the powerful aggregation and analysis our digital age affords us. That's partly what Local Record Centres are for - we do much of the tricky digital dirty work. But digitising biological records into a database is only part of the story - the data also needs to be brought to a wide audience in a stable, sustainable way in order for the benefit to be realised. Bringing the data to wider audience stably and sustainably presents us with challenges.
Realising the potential of digital biological data
"If you love something, let it go."
In much the same way the Web has enabled general sharing of information and documents, mostly for free and without permission, Open Data will unlock the potential of sharing structured data. Companies such as Google have become adept at "mining" the web, allowing us to find what we're looking for quickly and easily. But even Google with its amassed billions in advertising revenue have difficulty mining data from the "deep" and the "dark" web; that is, data locked up in inaccessible and disconnected places. To counter this and to enable the plunging of buckets into these deep, dark, rich wells of knowledge we need to do certain things.
- We must make the data available on the Web.
- We must make it available as structured, computer-readable data (E.g., Excel instead of a PDF or scanned pages).
- We must use non-proprietary formats (E.g., CSV instead of Excel; open source software instead of commercially owned.)
- Allow for data to be accessible at stable web addresses (URLs) in the same way most reliable websites are accessible at stable addresses (E.g., you can always access Google by visiting google.com).
- Allow for data to be linked to other data so that records may be efficiently described. E.g., a taxon name should link to a canonical taxonomic database.
These are primarily technical concerns to be untangled by me and my technologist colleagues at the record centre. That's all part of what we do day in, day out. However, in order to tackle this thicket of technical challenge we also require some thinking, soul-searching and decision-making on the part of data-creators; i.e., naturalists and biological recorders. We need your permission. Some decisions to be made are:
- Do I want to share my data and ensure future use for generations to come?
- What data do I want to share and with whom?
- Do I want to restrict access to certain data, and for how long?
The opening of data is not an all-or-nothing decision. We can be nuanced about it. Much, if not most, information contained on Web sites is copyright restricted and yet is still free-to-access. Much content has also been designated as part of the commons. Wikipedia and the StackExchange network are two notable examples. Plus many commons-licensed photos, videos, music and other media are made-available via sites such as Flickr, YouTube, Vimeo, SoundCloud, Wikimedia Commons and more. Other sites such as The Guardian allow their content - over 1M articles going back over 10 years in addition to today's continuously updated content - to be used and remixed with permission.
So, we can put information and data online. We can designate some as a part of the cultural commons while retaining control over others. We can open data gradually over time as its sensitivity or commercial benefit wanes. We can control the spatial resolution we make the data available at (10km, 2km, 1km of full-resolution). We make these decisions explicit by applying a license to the appropriate data in the appropriate form and at the appropriate time. If we fail to explicitly decide how we wish for our data to be licensed in this way, then it remains locked in its default, copyrighted form for 15 years at the very minimum.
The Creative Commons
"Creative Commons develops, supports, and stewards legal and technical infrastructure that maximizes digital creativity, sharing, and innovation."
If licensing data[^2] seems like a dry, legal, complicated prospect, fear not: we have help at hand which makes it very, very easy. You may or may not have heard of Creative Commons before. Founded in 2001, they are a non-profit organisation devoted to expanding the range of creative works - and data - available for others to build upon legally and to share. To quote their vision:
"Our vision is nothing less than realizing the full potential of the Internet — universal access to research and education, full participation in culture — to drive a new era of development, growth, and productivity."
They realise this vision primarily through the creation of a set of licences. These legally-rigorous tools are free, easy-to-use, and provide "a simple, standardized way to give the public permission to share and use your creative work — on conditions of your choice." They provide the licensor with a flexible range of options from "all rights reserved" to "public domain". Further, they provide an easy way to select a license that suits you using their License Chooser:
Most critically, whichever one you choose, these licenses are easily understood by both ordinary people browsing the web and computers reading them in an automated fashion. So, as long as we at SxBRC are getting our bit right and making the data available online in the right format, and as long as a Creative Commons license has been applied, the world at large can find the data, easily determine under what conditions the data may be used, and then use it. That's a huge step in the right direction.
Creative Commons: Open Data
A Creative Commons license gives us the tools we need to grant rights to our data in a simple, clear way. But there are only a few varieties that truly make data "open" in the strict sense of the definition. These are, broadly, Creative Commons Attribution and Creative Commons Attribution Share-Alike. These grant rights that allow for anyone to use the data for any purpose, so long as the creator is attributed. In that latter case, "Share-Alike" means that any derivative work based on the data must also be published under the same license under the same terms. Simple.
These liberal open data licenses are what we should be ultimately aiming for. They provide the basis for long-term, stable data publishing. If SxBRC were to become defunct, the licenses provide us with the legal framework necessary to ensure the data endures. It's the digital equivalent of donating a collection to a museum. But while they are to be thought of as the gold standard of our aspirations for opening access to data, it is still perfectly acceptable to start moving towards openness without going all-the-way just yet.
Creative Commons: Some Rights Reserved
A common license type with some rights reserved is Creative Commons Attribution Non-Commercial. This allows for the sharing of data, so long as it is not used for commercial purposes. This license already aligns neatly with the operational code of SxBRC: We charge for most commercial uses of data, while provide services for free to non-commercial customers and clients. Creative Commons offers further even more restrictive licenses and these are worth investigating too.
Being a resource-strapped record centre, we rely on just over a quarter of our income from commercial data requests. Being non-profit, we plough the proceeds back into running the centre and working with the recording community. The commercial data requests we receive are mostly formed of a small subset of the data we hold: protected and BAP species, bat and certain bird records, rare species and invasive aliens. Of these it is the contemporary records - anything from the last 5 years, say - that are of interest to commercial enquirers. Beyond those, a small number of records are excluded at the request of the recorder due to sensitivity. Public knowledge of sensitive records could be detrimental to the conservation of said species and therefore we hold back on these, at least initially. That leaves the vast majority of biological records stored in our database largely underused. So, how can be put the data to better use? Here are our initial suggestions:
- Non-sensitive historical data greater than 10 years old should be released with an Open Data conformant attribution, share-alike license on the Web.
- Non-sensitive, non-commercially viable data less than 10 years old should be released with a non-commercial, attribution, share-alike license on the Web.
- Commercially viable data should only be published on the web at 10km or 2km resolution, but should continue to be made fully available on request from SxBRC under our standard terms. We will continue to charge for supply of this data.
- Sensitive data will remain sensitive and permission for use will only be granted on a case-by-case basis.
In short, older data will be freely available; more contemporary data will be freely available, but only for non-commercial use; while data we derive our all important income from will be restricted, as will sensitive data. This is also known as a "freemium" model. The only truly new things we're suggesting here are encompassed by numbers 1 and 2. We believe making these categories of data more widely available will drive custom and reputation for the record centre while providing an important public good - the Sussex recorders' contribution to the data commons. It will raise our profile, the profile of recording generally and ease administrative burden. The data will become more discoverable and more useful. Assigning licenses will also clarify what the data can be used for and how. We're confident that this a positive step forward and very much in the spirit of the data sharing of times past and present.
We do not want to do this without the blessing of the data creators; that is, recorders. We have data going back hundreds of years with over 8,000 recorders - asking each and every one is an impossible task. So we'd like to propose a 2 point plan:
- Allow for a 6 month consultation period for any recorder or group to come to us, discuss any aspect of the plans so that we may choose a suitable license together, OR choose to refrain from Creative Commons licensing altogether, in effect continuing arrangements we have at present.
- After 6 months we shall assume you don't object to the proposals and shall proceed to implement the data licensing suggestions as outlined above or as per the wishes of data owners.
We hope this plan seems reasonable. It should be emphasised: this is a draft plan and is likely to change. The idea is to give anyone the opportunity to opt-out while ensuring other data will be opened up, made discoverable and put to good use. We should re-emphasis, it is your data and so you have final say. If you don't want to Creative Commons license it, please do let us know and we will happily exclude it.
We're pleased to say we already have willing and eager agreement from Patrick Roper, who has contributed one of the largest and most diverse datasets we hold. Hopefully many more of you will choose to follow along. As Isaac Newton famously wrote, "If I have seen further it is by standing on the shoulders of giants." Together we can be provide the shoulders upon which the next generation will stand as they strive to protect our planet. Join us.
If you can any questions, please do get in touch with me, Charles Roper, at the Sussex Biodiversity Record Centre. My email address is email@example.com
[^1]: "The commons were traditionally defined as the elements of the environment - forests, atmosphere, rivers, fisheries or grazing land - that are shared, used and enjoyed by all. Today, the commons are also understood within a cultural sphere. These commons include literature, music, arts, design, film, video, television, radio, information, software and sites of heritage. The commons can also include public goods such as public space, public education, health and the infrastructure that allows our society to function (such as electricity or water delivery systems). There also exists the ‘life commons’, e.g. the human genome."; http://en.wikipedia.org/wiki/Commons
[^2]: If you have no idea what "licensing" your data means, have a read of the ODI's excellent Publishers' Guide to Open Data Licensing: http://theodi.org/guides/publishers-guide-open-data-licensing
 "This is for everyone" tweet: https://twitter.com/timberners_lee/status/228960085672599552
 The Open Data Institute: http://theodi.org/
 Auer, S. R.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. (2007). "DBpedia: A Nucleus for a Web of Open Data". The Semantic Web. Lecture Notes in Computer Science 4825. p. 722. doi:10.1007/978-3-540-76298-0_52. ISBN 978-3-540-76297-3.
 UK Copyright Service Factsheet No. P-01; http://www.copyrightservice.co.uk/ukcs/docs/edupack.pdf
 Creative Commons Licenses: http://creativecommons.org/licenses/
 Open Data definition: http://opendefinition.org/
 Freemium: http://en.wikipedia.org/wiki/Freemium