server_playground/doc/www.w3.org/DesignIssues/GovData.html


								<html xmlns="http://www.w3.org/1999/xhtml">

								  <head>

								    <meta name="generator" content=

								    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />

								    <title>

								      Putting Government Data online - Design Issues

								    </title>

								    <link rel="Stylesheet" href="di.css" type="text/css" />

								    <meta http-equiv="Content-Type" content="text/html" />

								  </head>

								  <body bgcolor="#DDFFDD" text="#000000">

								    <address>

								      Tim Berners-Lee<br />

								      Date: 2009-06, last change: $Date: 2009/06/30 15:49:50

								      $<br />

								      Status: personal view only. Editing status: Good enough for

								      folk. Notes after talking with various people in UK and US

								      governments who would like to put data on the web and want to

								      know the next steps.

								    </address>

								    <p>

								      <a href="./">Up to Design Issues</a>

								    </p>

								    <hr />

								    <h1>

								      Putting Government Data online

								    </h1>

								    <h4>

								      Abstract

								    </h4>

								    <p class="abstract">

								      Government data is being put online to increase

								      accountability, contribute valuable information about the

								      world, and to enable government, the country, and the world

								      to function more efficiently. All of these purposes are

								      served by putting the information on the Web as Linked Data.

								      Start with the "low-hanging fruit". Whatever else, the raw

								      data should be made available as soon as possible.

								      Preferably, it should be put up as Linked Data. As a third

								      priority, it should be linked to other sources. As a lower

								      priority, nice user interfaces should be made to it -- if

								      interested communities outside government have not already

								      done it. The Linked Data technology, unlike any other

								      technology, allows any data communication to be composed of

								      many mixed vocabularies. Each vocabulary is from a community,

								      be it international, national, state or local; or specific to

								      an industry sector. This optimizes the usual trade-off

								      between the expense and difficulty of getting wide agreement,

								      and the practicality of working in a smaller community.

								      Effort toward interoperability can be spent where most

								      needed, making the evolution with time smoother and more

								      productive.

								    </p>

								    <h2>

								      Introduction

								    </h2>

								    <p>

								      This, 2009, is the year for putting government data online.

								      Both <a href=

								      "http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/">

								      US</a> and <a href=

								      "http://www.cabinetoffice.gov.uk/newsroom/news_releases/2009/090610_web.aspx">

								      UK</a> governments made public commitments toward open data.

								      The <a href=

								      "http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">

								      TED talk on Linked Data</a> was in February. Groups from the

								      <a href=

								      "http://www.guardian.co.uk/technology/free-our-data">Guardian</a>

								      to the <a href="http://www.sunlightfoundation.com/">Sunlight

								      Foundation</a> had already been pushing for it for a long

								      time. People like Watchdog.net, mysociety.org, and

								      govtrack.us had been pushing by publishing government data

								      themselves in various formats, including Linked Data.

								    </p>

								    <p>

								      So if you want to do this, what should you do? This article

								      addresses this question very briefly, and makes a set of

								      points which will probably be outdated by later developments,

								      but answer a set of relevant question, asked or not.

								    </p>

								    <h2>

								      Using Linked Data as the interconnection bus

								    </h2>

								    <p>

								      Government data is put online typically for 3 reasons:

								    </p>

								    <ol>

								      <li>Increasing citizen awareness of government functions to

								      enable greater accountability;

								      </li>

								      <li>Contributing valuable information about the world; and

								      </li>

								      <li>Enabling the government, the country, and the world to

								      function more efficiently.

								      </li>

								    </ol>

								    <p>

								      Each of these purposes is best served by using Linked Data

								      techniques.

								    </p>

								    <p>

								      In general Linked Data is:

								    </p>

								    <p>

								      <strong>Open</strong>: Linked Data is accessible through an

								      unlimited variety of applications and applications because it

								      is expressed in open, non-proprietary formats.

								    </p>

								    <p>

								      <strong>Modular</strong>: Linked Data can be combined

								      (mashed-up) with any other piece of Linked Data. For example,

								      government data on health care expenditures for a given

								      geographical area can be combined with other data about the

								      characteristics of the population of that region in order to

								      assess effectiveness of the government programs. No advance

								      planning is required to integrate these data sources as long

								      as they both use Linked Data standards.

								    </p>

								    <p>

								      <strong>Scalable</strong>: It's easy to add more Linked Data

								      to what's already there, even when the terms and definitions

								      that are used change over time.

								    </p>

								    <p>

								      The essential message is that whatever data format people

								      want the data in, and whatever format they give it to you in,

								      you use the RDF model as the interconnection bus. That's

								      because RDF connects better than any other model.

								    </p>

								    <ul>

								      <li>It uses URIs and so allows linking of things and concepts

								      </li>

								      <li>It allows separate systems designed independently to be

								      later joined at the edges

								      </li>

								      <li>It allows interoperability to be added where

								      cost-effective

								      </li>

								      <li>It allows any data to be expressed in a mixture of

								      vocabularies.

								      </li>

								    </ul>

								    <p>

								      That's enough about why it is useful. That is elaborated

								      elsewhere, but it can be difficult for those familiar with

								      other technologies to understand the difference. Sometimes it

								      is better just to do it.

								    </p>

								    <h2>

								      Just do it

								    </h2>

								    <p>

								      The chances are quite high that the data your

								      department/agency runs off will be largely in relational

								      databases, often with a large amount in spreadsheets.

								    </p>

								    <p>

								      There are two philosophies to putting data on the web. The

								      top-down one is to make a corporate or national plan, by

								      getting committees together of all the interested parties,

								      and make a consistent set of terms (<em>ontology</em>) into

								      which everything fits. This in fact takes so long it is often

								      never finished, and anyway does not in fact get corporate or

								      national consensus in the end. The other method experience

								      recommends is to do it bottom up. A top-level mandate is

								      extremely valuable, but grass-roots action is essential. Put

								      the data up where it is: join it together later.

								    </p>

								    <p>

								      A wise and cautious step is to make a thorough inventory of

								      all the data you have, and figure out which dataset is going

								      to be most cost-effective to put up as linked data. However,

								      the survey may take longer than just doing it. So, take some

								      data.

								    </p>

								    <p>

								      A really important rule when considering which data could be

								      put on the web is not to threaten or disturb the systems and

								      the people who currently are responsible for that data. It

								      often takes years of negotiation to put together a given set

								      of data. The people involved may be very invested in it.

								      There are social as well as technical systems which have been

								      set up. So you leave the existing system undisturbed, and

								      find a way of extracting the data from it using existing

								      export or conversion facilities. You add, a thin shim to

								      adapt the existing system to the standard.

								    </p>

								    <p>

								      Ok, so you have some data. What form is it in?

								    </p>

								    <h3>

								      Relational databases

								    </h3>

								    <p>

								      There are (2009) a number of open source tools for putting

								      relational databases up as Linked Data, <em>D2RServer</em>

								      and <em>Triplify</em> being two.

								    </p>

								    <p>

								      These each use a mapping file, in some language, to explain

								      how the database structure actually represents things and the

								      relationships. <sup>1</sup>

								    </p>

								    <p>

								      You probably don't want to to run a publicly available server

								      on your existing database unless it is generally set up for

								      high volume use. You might want to take a copy of the whole

								      database, and run a live semantic web server from it, or you

								      can generate the RDF once and make a copy of that to serve.

								    </p>

								    <h4>

								      Using other people's terms

								    </h4>

								    <p>

								      It is wise and friendly and interoperable, when you public

								      RDF data, to use terms other people are already sharing. Like

								      foaf:name for the name of a person, or dc:title for the title

								      of something, and so one. Like geo:lat and geo:long for

								      latitude and longitude<sup><a href="">2</a></sup>. There are

								      a number of these, growing of course. The <a href=

								      "http://www.w3.org/2001/sw/interest/">Semantic Web Interest

								      Group</a> is a community which can help you find them: there

								      are also online tools such as <a href=

								      "http://swoogle.umbc.edu/">Swoogle</a>, Sindice, etc.

								      <a href=""></a>

								    </p>

								    <h3>

								      Spreadsheets

								    </h3>

								    <p>

								      In many organizations a surprising amount of information,

								      sometimes critical information, is emailed around in

								      spreadsheets. Much of the early recovery.gov data was

								      published in spreadsheet form. Some of these are raw tables,

								      with a header in the top row. These are close to raw data.

								      You can export them as a comma-separated (or tab-separated)

								      file, CSV. Others are spreadsheets with a lot of

								      substructure, and little headings and notes all over them for

								      the human user. These are less easy to convert.

								    </p>

								    <p>

								      There are a number of <a href=

								      "http://esw.w3.org/topic/ConverterToRdf">tools</a> for

								      converting the format of a spreadsheet, typically in CSV

								      form, into RDF.

								    </p>

								    <h3>

								      XML

								    </h3>

								    <p>

								      If you have existing data in XML, first, put that XML up on

								      the web while you think. Then, figure out what the XML is

								      about, what things and what relationships. Then, commission

								      or write a program, possibly a simple script, maybe written

								      in XSLT, or your favorite scripting language, to convert each

								      XML file into RDF. You might need to add a file which points

								      to all the things you have data about, if they are not

								      already linked.

								    </p>

								    <h3>

								      Random application formats

								    </h3>

								    <p>

								      Ok, so your data is not in any of the above forms. It is in a

								      proprietary format, or managed by a proprietary program. But

								      there is some way you can get at it. So someone will have to

								      write a program somewhere, to get it out, and convert it to

								      one of the Linked Data standard forms.

								    </p>

								    <p>

								      (It is actually fairly simple. First, you think of what

								      things the data is about. You make up URIs for those things.

								      Suppose for example your data is about books and shelves. You

								      decide the URI for the books will be

								      http://id.example.com/id/isbn/123457890 and the URIs for

								      shelves will be like http://id.example.com/id/shelf/746 .

								      Then you write a (CGI) script, which, when given that a URI

								      like that extracts the data about the book (including which

								      shelf it is on) and outputs it, or similarly for the shelf

								      (including a list of the books on the shelf). It outputs it

								      in RDF/XML or N3. That script is your web server of virtual

								      linked data.)

								    </p>

								    <h3>

								      Existing Web Site

								    </h3>

								    <p>

								      If you have an existing web site with, maybe, a page about

								      each thing, there is an easy way of putting the data in those

								      pages into Linked Data. You can change the scripts which

								      generate the site so that the data which is behind each page

								      is in fact put into the page so that it can be re-extracted

								      by others as data. The technology to do this is called

								      <a href="http://rdfa.info/">RDFa</a> <sup><a href=

								      "#L451"></a>3</sup>. An alternative is for the each web page

								      to have a parallel page which has the data in RDF/XML.

								      <sup><a href="#L454">4</a></sup>

								    </p>

								    <h2>

								      Giving access to data

								    </h2>

								    <p>

								      Ok, so you have your data in RDF as Linked Data. Now what?

								    </p>

								    <h3>

								      Index it

								    </h3>

								    <p>

								      The semantic web toolkit includes the SPARQL query language

								      which allows a client anywhere on the net to query a SPARQL

								      service. Some methods of publishing data, like D2RServer,

								      provide a built-in SPARQL service. If you have generated a

								      bunch of linked data, then there are various products, free

								      or commercial, which will scoop it up into a "triple store"

								      and provide a SPARQL service.

								    </p>

								    <p>

								      A SPARQL service is a generally useful tool for technically

								      aware users. Many clients and analytical tools just use a

								      SPARQL server. A SPARQL server looks for patterns in the data

								      and for each match, or outputs what it found in one of a

								      number of formats, including constructed RDF, XML and, in

								      some cases, JSON, and maybe even CSV.

								    </p>

								    <h3>

								      Generating XML with SPARQL

								    </h3>

								    <p>

								      SPARQL, then, can be used as an RDF to XML converter. You

								      amass a heap of linked data. Then you think of a combination

								      of data, involving connections across different data. There

								      is a SPARQL query for that data with the results expressed in

								      XML. That SPARQL query can be encoded into a long URI, a URI

								      for a virtual XML document for that particular view.

								    </p>

								    <h3>

								      Generating CSV files and JSON

								    </h3>

								    <p>

								      Some SPARQL servers also support JSON as an output format.

								      This is easy to use in Web Applications.

								    </p>

								    <h3>

								      Generating nice web pages

								    </h3>

								    <p>

								      The priority first is to get raw data onto the net, and

								      preferably converted into Linked Data form. This is partly

								      because there may be other sites, commercial or not, who pick

								      it up and make great interfaces to that data. Of course there

								      are times when the government site must provide a easy human

								      interface for ordinary users to access the data.

								    </p>

								    <p>

								      There are many routes to pretty HTML for real users. Tools

								      like Exhibit provide facetted browser views, given a

								      configuration set up by the web master, for example.

								    </p>

								    <p>

								      Webmasters can can run script in languages (not standardized

								      yet) like XSPARQL or N3 rules, or write custom code in their

								      favorite programming language such as PHP, Python, Ruby, or

								      server-side Javascript.

								    </p>

								    <p>

								      Note, though, there are two ways though that a department or

								      agency web site can never be expected to compete with

								      external sites. One is because there are as yet no user

								      interface techniques which allow a normal user to create

								      their own query, (though tools like Tabulator are getting

								      close).

								    </p>

								    <p>

								      The second is that an external site will add value to the

								      data by joining it to other data from different sites for a

								      particular purpose. If the Department of Transport publishes

								      road accident data, a cycling site selects the cycle accident

								      subset, and can publish it as a map adding cycle routes and

								      hills, and cycle shops. An agency publishes data about the

								      amount of money given to different towns, another maps it

								      against the per capital income levels in those towns. And so

								      on in uncountable permutation.

								    </p>

								    <p>

								      An informal random sample of some public feedback suggests

								      that there are users who would prefer each of these formats

								      above, so a system which generates them automatically is

								      clearly called for.

								    </p>

								    <h2>

								      Metadata

								    </h2>

								    <p>

								      When you write or generate a small RDF file for each dataset

								      exported, the results can be harvested as more useful linked

								      data to form a catalog. Like the data, this can be

								      distributed form as linked data, and also sucked into a

								      repository to be indexed and SPARQLed. Remember that, as with

								      the data, RDF allows you to mix vocabularies, so you can

								      record everything you or others may feel is important about

								      the datasets. This provenance information is very valuable.

								      It clearly is one of the many areas this note touches on

								      which much more could be said.

								    </p>

								    <p>

								      Neither does it really address licensing issues. In the US,

								      government data is generally in the Public Domain. It is good

								      to put the fact that a given resource has a given license in

								      a machine-readable way. The creative commons cc:license term

								      is appropriate. Creative commons also have produced a "CC0"

								      waiver which disclaims all rights appropriately (and where

								      possible) for each country.

								    </p>

								    <h2>

								      Privacy

								    </h2>

								    <p>

								      A very common and important concern is the privacy of data

								      which contains personally identifiable nformation. This

								      article does not suggest that all data should be made public,

								      nor does it discuss issues with anonymisation of data.

								      Systems where PIP is an issue will probably not be an early

								      choice when selecting those to put on the web. However, in

								      cases in which these issues have already been resolved and

								      the data is already public but not in the standard form,

								      converting it to Linked Data is an excellent idea. In

								      general, new government systems should be built to be aware

								      of the provenance of the data they use, and of the

								      appropriate use to which it may be put. But the design of

								      these <a href=

								      "http://dig.csail.mit.edu/2008/06/info-accountability-cacm-weitzner.pdf">

								      accountable systems</a> is another topic we do not have space

								      for here.

								    </p>

								    <h2>

								      Conclusion

								    </h2>

								    <p>

								      This brief note is too short to go into great detail, and has

								      ignored many important topics. It has stressed the practical

								      technical steps. Deeper information, about techniques and

								      also about the social issues and challenges, are being

								      produced frequently elsewhere. Many cities have Semantic Web

								      gatherings or <a href="http://semweb.meetup.com/">meetup

								      groups</a>, which can be a source of mutual support for those

								      involved in or interested in the technology. The W3C eGov

								      Interest Group is an international group of people sharing

								      challenges and solutions.

								    </p>

								    <hr />

								    <h4>

								      Footnote: Do's and Don'ts

								    </h4>

								    <ul>

								      <li>Do pick URIs which are likely to be <a href=

								      "../Provider/Style/URI">persistent</a>

								      </li>

								      <li>Do put RDF metadata giving the license.

								      </li>

								      <li>Do use the RDF and SPARQL standards

								      </li>

								      <li>Make sure your human readable pages are <a href=

								      "http://www.w3.org/WAI">accessible</a>.

								      </li>

								    </ul>

								    <ul>

								      <li>Do NOT hide data files inside zip files unless they are

								      also available directly.

								      </li>

								      <li>Do NOT put data up in proprietary formats.

								      </li>

								      <li>Do NOT wait until you have a complete schema or ontology

								      to publish data.

								      </li>

								      <li>Do NOT seek to replace existing data systems.

								      </li>

								    </ul>

								    <p>

								      <a name="L419" id="L419">[1]</a> D2RServer will generate a

								      default mapping file, which will not make a very good RDF

								      graph. Browsing the resulting RDF with am RDF browser (such

								      as Tabulator) will however often show up the deficiencies and

								      suggest improvements

								    </p>

								    <p>

								      <a name="L470" id="L470">[2]</a> WGS84 latitude and

								      longitude, like you get from a normal GPS unit. (<a href=

								      "http://www.w3.org/2003/01/geo/">more</a>)

								    </p>

								    <p>

								      <a name="L451" id="L451">[3]</a> RDFa is used, for example,

								      in the UK <a href=

								      "http://www.civilservice.gov.uk/jobs/index.aspx">Civil

								      Service Jobs</a> web site. (<a href=

								      "http://www.civilservice.gov.uk/jobs/careers-detail.aspx?JobId=4730">example</a>)

								    </p>

								    <p>

								      <a name="L454" id="L454">[4]</a> Separate RDF/XML web pages

								      are used, for example, in the <a href=

								      "http://www.bbc.co.uk/programmes">BBC programmes</a> data.

								      Here content negotiation gives RDF/XML to data clients, and

								      HTML to document browsers. (<a href=

								      "http://www.bbc.co.uk/programmes/genres/comedy#genre">example</a>)

								    </p>

								    <h2>

								      References and Resources

								    </h2>

								    <ul>

								      <li>

								        <a href=

								        "http://www.thenationaldialogue.org/ideas/linked-open-data">

								        Linked Open Data</a>, in "The National Dialogue" about US

								        recovery transparency.

								      </li>

								      <li>

								        <a href=

								        "http://ShowUsABetterWay.com/">ShowUsABetterWay.com</a>

								        (UK)

								      </li>

								      <li>

								        <a href=

								        "http://www.showusabetterway.co.uk/call/data.html">Example

								        UK Data available for reuse</a>

								      </li>

								      <li>

								        <a href=

								        "http://TheNationalDialog.org/">TheNationalDialog</a>.org

								        (US)

								      </li>

								      <li>

								        <a href="http://www.whitehouse.gov/open/">Open Government

								        Initiative</a> (US)

								      </li>

								      <li>

								        <a href=

								        "http://www.cabinetoffice.gov.uk/reports/power_of_information.aspx">

								        The Power of Information Taskforce Report</a> (UK Gov) one

								        of whose recommendations is linked government data

								      </li>

								      <li>

								        <a href="http://www.w3.org/2007/eGov/">eGovernment at

								        W3C</a>

								      </li>

								      <li>

								        <a href="http://www.w3.org/2007/eGov/IG/">W3C eGovernment

								        Interest Group</a>

								      </li>

								      <li>

								        <a href="http://www.w3.org/TR/egov-improving/">Improving

								        Access to Government through Better Use of the Web</a>, W3C

								        eGov IG

								      </li>

								      <li>

								        <a href=

								        "http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/">

								        Transparency and Open Government</a>, Memorandum for the

								        Heads of Executive Departments and Agencies, Barack Obama,

								        2009-01-21

								      </li>

								      <li>

								        <a href="http://eprints.ecs.soton.ac.uk/14429/">Paper on

								        the lessons from the UK AKTivePSI project</a>

								      </li>

								      <li>

								        <a href="http://esw.w3.org/topic/SemanticWebTools">Semantic

								        Web Development Tools</a>, eSW Wiki.

								      </li>

								      <li>

								        <a href="http://esw.w3.org/topic/ConverterToRdf">Tools to

								        convert data into RDF</a>, in eSW Wiki. Don't just look in

								        the wiki for things -- add things you have found!

								      </li>

								      <li>

								        <a href="http://rdfa.info/">RDFA.info</a> a resource about

								        RDFa. Ben Adida.

								      </li>

								    </ul>

								    <h4>

								      Acknowledgements

								    </h4>

								    <p>

								      <small>Thanks for input to this article from Nigel Shadbolt

								      and Danny Weitzner. Thanks also to the chairs (John Sheridan

								      and Kevin Novak) and members of the W3C eGov interest group,

								      and all those in UK and US governments with whom we have

								      discussed these issues at these early stages.</small>

								    </p>

								    <hr />

								    <p>

								      <a href="Overview.html">Up to Design Issues</a>

								    </p>

								    <p>

								      <a href="../People/Berners-Lee">Tim BL</a>

								    </p>

								  </body>

								</html>