Skip to content

Instantly share code, notes, and snippets.

@NTerpo
Last active April 9, 2016 14:33
Show Gist options
  • Save NTerpo/b81a0b195ceb99a7e53a to your computer and use it in GitHub Desktop.
Save NTerpo/b81a0b195ceb99a7e53a to your computer and use it in GitHub Desktop.
{
"language": "en",
"name": "Paris Data",
"description": "City of Paris Open Data portal",
"url": "http://opendata.paris.fr/",
"linked_portals": ["http://data.gouv.fr", "http://data.iledefrance.fr"],
"data_language": ["fr"],
"modified": "2016-03-04T13:44:44+00:00",
"themes": ["Culture, Heritage", "Education, Training, Research, Teaching", "Environment", "Transport, Movements", "Spatial Planning, Town Planning, Buildings, Equipment, Housing", "Health", "Economy, Business, SME, Economic development, Employment", "Services, Social", "Administration, Government, Public finances, Citizenship", "Justice, Safety, Police, Crime", "Sports, Leisure", "Accommodation, Hospitality Industry"],
"links": [
{"url": "http://opendata.paris.fr/explore/download/", "rel": "Catalog CSV"},
{"url": "http://opendata.paris.fr/api/", "rel": "API v1"},
{"url": "http://opendata.paris.fr/api/datasets/1.0/search?format=rdf", "rel": "Catalog RDF"}
],
"version": "1.0",
"number_of_datasets": 176,
"organization_in_charge_of_the_portal":{
"name": "City of Paris",
"url": "http://www.paris.fr/"
},
"spatial": {
"country": "FR",
"coordinates": [
48.8567,
2.3508
],
"locality": "Paris",
"data_spatial_coverage": "a Geojson with the data coverage"
},
"type": "Local Government Open Data Portal",
"datapackages": [
"http://opendata.paris.fr/explore/dataset/liste_des_sites_des_hotspots_paris_wifi/datapackage.json",
"http://opendata.paris.fr/explore/dataset/points-de-vote-du-budget-participatif/datapackage.json",
"http://opendata.paris.fr/explore/dataset/cinemas-a-paris/datapackage.json"
]
}
@jpmckinney
Copy link

Just a quick general note: I think dataportal.json can be implemented alongside DCAT (which ProjectOpenData is based on); it is complementary. I don't think publishers need to choose one or the other. They fulfill different use cases.

In terms of the schema, is this the only available documentation? Anyhow, here are some recommendations to bring it more in line with other international standards:

  • lang:
    • What is the definition/semantics of this field? Is it describing the language of the values in the dataportal.json file? In DCAT, a Catalog's language is "the language used in the textual metadata describing titles, descriptions, etc.".
    • Rename to language. There's no reason for abbreviations. language also aligns with Dublin Core's language term, also used by DCAT. Dublin Core is the most used vocabulary.
    • Are values restricted to ISO 639-1 codes or are BCP 47 codes allowed?
  • data_lang: For consistency with datasets_count and DCAT, this should be named dataset_language. At minimum, it should be renamed to data_language to avoid an unnecessary abbreviation.
  • linked_portals: What is the definition/semantics of this field? Some portals link to dozens of other portals, just as a convenience to users; the portals may not even be related. I don't think it's useful to replicate such a list here.
  • last_update: Rename to modified to match Dublin Core's modified term, also used by DCAT.
  • catalog: I'm not sure of the semantics of this field, considering it has a format property. Is this meant to be a URL to download all the catalog's metadata? That's not implied by the word catalog. It should be renamed to something else, or moved into the links field that I propose below.
  • version: Is this the version of the dataportal.json schema being used, or the version of the file?
  • datasets_count: While Rails and other frameworks would use this term automatically, "[plural] count" is not grammatical. So, it should either be renamed dataset_count or number_of_datasets. number_of_datasets is unambiguously clear.
  • organization: What is the definition/semantics of this field? If it's the publisher, then it would be clearer to name this field publisher, which matches Dublin Core's publisher term, also used by DCAT.
  • address: Portals don't have addresses. You don't visit the physical address, or send mail to the address, etc. Rename to spatial to match Dublin Core's spatial term, also used by DCAT.
    • country: To avoid having country names in multiple languages across different files, which is not useful, the value should be restricted to ISO 3166-1 alpha-2 codes. If so, the field should be renamed to country_code, used by the GeoNames ontology, to distinguish it from the country name.
    • location: Since these are coordinates, rename to coordinates, used by the GeoJSON Point object.
    • city: Rename to locality, as used by vCard, which is reused by Schema.org and others. Not everything is a city, which is why so many standards use "locality".
  • type: Unless this field's value is restricted to a controlled vocabulary, it is somewhat useless.
  • api_url: To allow for additional related URL fields to be added, I propose changing to a links field, which has an array of objects as its value, where each object has url and rel fields. rel is used by loads of hyper-media APIs.

@jpmckinney
Copy link

GitHub does not send notifications for updates to gists, so please send me a note at the email in my profile if you reply.

@technickle
Copy link

Agree with @jpmckinney's comments. Also the location field shouldn't be a simple point. It would be more useful for it to be a GeoJSON Multipolygon object which defines the boundaries of the jurisdiction operating the portal.

Additionally, it would be helpful if there is a similar multipolygon object which defines boundaries within which all the spatial data on the portal falls.

@jpmckinney
Copy link

+1 GeoJSON MultiPolygon object

@rebeccawilliams
Copy link

I've added this as a Project Open Data Issue as well: project-open-data/project-open-data.github.io#554; your input on data.json is always welcome!

@philipashlock
Copy link

I disagree with @jpmckinney in that I think this is almost entirely redundant with the Catalog class of DCAT. Some of the fields are certainly complimentary, but I would just add them as extensions to the Catalog class rather than recreate it. The Project Open Data data.json schema currently only uses one field in the DCAT Catalog class, but extends the Dataset and Distribution classes with a few additional fields. I've included an object model diagram of the Project Open Data data.json schema with a more complete listing of the DCAT Catalog class fields below and described the duplicated dataportal.json fields here:

  • lang: This is already provided by the language field in the DCAT Catalog class. See documentation
  • name: This is already provided by the title field in the DCAT Catalog class. See documentation
  • description: This is already provided by the description field in the DCAT Catalog class. See documentation
  • url: This is already provided by the homepage field in the DCAT Catalog class. See documentation
  • linked_portals: I'd echo James' comments here
  • data_lang: This is already provided by the language field in the DCAT Dataset class. It doesn't really make sense to include in the Catalog class. See documentation
  • last_update: This is already provided by the modified field in the DCAT Catalog class. See documentation
  • themes: This is already provided by the theme field in the DCAT Catalog class. See documentation
  • catalog: This is already provided by the dataset field in the DCAT Catalog class. With data.json it's embedded directly, but it could just as easily be referenced by a URL. See documentation
  • version: I'm not quite sure what this is, but it might be redundant with the modified field.
  • datasets_count: This should be automatically derived from what's listed under dataset but seems useful. I'd echo James' comments tho
  • organization: This is already provided by the publisher field in the DCAT Catalog class. See documentation
  • address: This is already provided by the spatial field in the DCAT Catalog class. See documentation
  • type: I'm not sure what this is
  • api_url: James' suggestion seems reasonable here, but in some cases, this could also be redundant with the dataset field/URL. An API endpoint URL is not very helpful without documentation either, so you might consider whether this is pointing directly to an API endpint or to documentation about the API. With Project Open Data we require that each agency including their catalog as a dataset in their catalog which also allows them to list an API endpoint and documentation so that's another way to approach this, but currently there's no common way to identify which dataset listing is the catalog itself.
  • datapackages: In DCAT this would be provided using the Distribution class within the Dataset class. With Project Open Data we extended the Distribution class with a few additional fields including one to reference schemas (describedBy) and schema types (describedByType) that define a dataset so for a tabular data package you'd point to the CSV for the Distribution's downloadURL, you'd point to the datapackage.json schema for the describedBy URL, and you'd specify the describedByType as application/vnd.datapackage+json as discussed here

The diagram below is the Project Open Data schema with more of the catalog class fields from DCAT. Here's the full DCAT object model diagram and here's the currently documented Project Open Data version.

diagram

@NTerpo
Copy link
Author

NTerpo commented Mar 25, 2016

Hi, thanks everyone for your comments :)

I've responded to some concerns on project-open-data/project-open-data.github.io#554. You're not wrong in saying that many of the fields may be redundant with dcat/data.json. The point is to better add new fields to existing work (and to see how to make it more commonly used outside of the US), or to experiment around dataportal.json. Either way is totally fine for us: the only thing we care about is how to have world open data portals massively share basic information at the same place :D (which is not the case yet).

I've made lot of changes on the dataportal.json example following your suggestions:

  • lang => renamed language. Indeed describing the language of the values in the dataportal.json file. I have no idea about the restriction to ISO 639-1 codes, what do you think?
  • data_lang => renamed data_language
  • linked_portals: An array of URLs of other data portals. Do you have examples of the portals linking to the other portals you talk about? That's totally something we should encourage IMO. And it's not something I've seen in a lot of portals. It's also a way to tell bots where to look.
  • last_update=> Renamed to modified
  • catalog: In my mind it may be a way to link to all the description implemented by the portal : at OpenDataSoft we have a DCAT RDF catalog and a CSV catalog => I change it to a links field that include the API.
  • version: version of the dataportal.json schema being used.
  • datasets_count => renamed number_of_datasets
  • organization: the publisher field is quite limited to the name: I'm totally willing to rename it but I find it useful to let "publishers" describe it more profoundly with URLs, contact informations or anything, easily.
  • address => renamed spatial.
    • country: I agree on ISO 3166-1 alpha-2 codes => renamed to country_code.
    • location => renamed to coordinates.
    • city => renamed to locality.
      *data_spatial_coverage : Geojson
      About the Geojson: there are two things relevant about the spatial field : the location of the people in charge of the portal, and the spatial coverage of the data. I'd keep with point coordinates for the people in charge and add a Geojson field pour the data coverage.
  • type: Yeah we had a controlled vocabulary in mind.

Once again thank you everyone. I think there are two paradoxical things:

  1. What to do to have most of the portals use the same file, in the same place, with the same standardized basic info.
  2. How to not be redundant and how not to make everybody re-work on the same issues all the time.

And we are honestly willing to discuss every way, and are totally willing to abandon dataportal.jsonif we are the only ones to find it useful. Remember, we want there to be something for the portal level so that when I or a non-technical person search for certain data on Google or anywhere else, they will get more relevant results, easier. But the most important thing is that it's not useful nor complete if only a few of the portals implement it.

@jpmckinney
Copy link

Re: linked_portals Canada has this page, but including this field in the schema is just like how back in the early 2000s we would have blogs with "blog rolls" in a sidebar with favorite blogs, or Geocities websites with links to unrelated websites including links to search engines. People added those links because, back in those days, those websites were not easily discoverable. But that's no longer the case.

Data catalogs may have discoverability problems today, but the long-term solution isn't to replicate blog rolls. It's for those websites to get better SEO over time, for data to be better federated across portals (using things like data.json and DCAT), etc.

I propose dropping linked_portals from the schema.

@jpmckinney
Copy link

I created a fork which you can diff. Some fields are missing as we would need to find a way to add them back in. Main changes:

  • name -> title
  • url -> homepage (also inside organization_in_charge_of_the_portal)
  • version -> conformsTo (and change the value to the eventual URL of the schema)
  • organization_in_charge_of_the_portal -> publisher

Other feedback:

  • language should use BCP 47, which is a superset of ISO 639.
  • publisher is not limited to a name. See documentation. It can be a foaf:Agent, which can have lots of properties.
  • data_language: I tend to agree with Phil. If people want to filter to only data in a language they understand, they need to filter at the dataset level, not the catalog level. Most data is nonlinguistic (geospatial, CSVs of numbers, etc.). Filtering at the catalog level would eliminate a lot of potentially relevant datasets. I recommend removing this field.
  • spatial is not much specified by DCAT. The EU's DCAT-AP recommends using its Core Location Vocabulary. data.json specifies possible values for a dataset's spatial field. Canada's dataset's spatial field allows GeoJSON. I'd like to see an example of a GML Simple Features Profile (one of the options in data.json) before deciding between GeoJSON or GML.
  • type Until a controlled vocabulary is defined, I don't recommend adding the field.
  • links: I recommend either following Phil's suggestion, or moving this out of this file, and instead recommending the use of discovery standards like Web Host Metadata, which has a JSON representation.
  • datapackages: I didn't really understand what this field was doing the first time, but I think it should be eliminated in favor of just using dataset from DCAT, in the way that Phil describes. That said, if the purpose of dataportal.json is to just describe the catalog, not the datasets, then datapackages should be eliminated entirely.
  • themes: While DCAT does have themeTaxonomy at the catalog level, its representation won't be as simple as an array of strings. It would be like:
"@context": {
  "containsConcept": { "@reverse": "http://www.w3.org/2004/02/skos/core#inScheme " }
},
"themeTaxonomy": {
  "containsConcept": [
    {"prefLabel": "Environment"},
    {"prefLabel": "Health"}
  ]
}

@NTerpo
Copy link
Author

NTerpo commented Mar 28, 2016

Cool for the fork :)

  • language : fine for BCP 47
  • publisher : yeah I know there is the FOAF possibility. But going through DCAT is already a huge step for most of the portals owner we meet, I'm pretty sure most stop before getting to FOAF. I agree that's annoying but as open data goes, there will be more and more people in charge of open data not knowing anything about a norm or linked between ontologies. That means we have to do the job for them. We can extract the most useful Foaf field and make them mandatory.
  • data_language : I agree about the filter but that is not the use case I imagine. I may want to create a world map of Open Data portals and add a language facet. Or maybe compare the global trends in Open Data between Spanish-speaking or English-speaking countries. For most portals it's kind of easy to say "there are data in both french and English". Filtering and the search for datasets should be handled by the portal and has nothing to do with dataportal.json.
  • type defining a controlled vocabulary should not be too complicated (or I'm still a bit naive here haha). What about :
    • Multi-national Open Data Portal
    • Country Open Data Portal
    • Agency Open Data Portal
    • Local Government Open Data Portal
    • Business Open Data Portal
    • Non-Profit Open Data Portal
    • Individual Open Data Portal
  • I don't really understand what is the problem with the links fields. dataportal.json or data.json should describe the portal and be useful. The links field should be like a 'what's next'. You obviously have to get some documentation to use an API but if I want to compare Open Data portals around the world I want to know if there is an API available (and obviously link to it), I want to know if there is an turtle file to describe the catalog. The real goal of the dataportal.json is to allow this kind of comparison and the development of new services to allow people to find data.
  • datapackages : yeah it may be removed => it's redundant with data.json. Linking to the datasets metadata is important. We likeddatapackagesbecause it's really light but CAT metadata are totally fine. Link to the datasets meta (or todata.json) can be done in thelinks` field.
  • I don't really understand why the themeTaxonomy representation is better? It maybe more grammatically correct but it feels more 'obscure' for a beginner.
  • linked-portals indeed looks like "blog rolls" but "blog rolls" became obsolete once there where sufficiently content, users, and data about it to replace it. Until it's not economically viable to develop a real user friendly Google for data, "blog rolls" looks like a cheap and useful way to do. For now we still rely on SEO in looking for data, but SEO optimize for content, not really data and I'm not sure we want Google to be the gateway to open data. linked-portals field is clearly a short-term solution, but it's a solution. Also, I do agree that federated data are a really nice solution : let's have a way to give the information directly in the portal meta-data (not only at the dataset level) :)

Thanks once again for the comments.

@jpmckinney
Copy link

  • publisher: Implementers of data.json (and of dataportal.json) don't need to know the RDF ontologies on which they are based. They just need to know the JSON Schema, the definition of the JSON fields, which fields are required, etc. So, using FOAF behind the scenes doesn't add any burden to adopters, because they don't need to know that FOAF is being used.
  • data_language: For the use cases you describe, which of those can't be handled by the other language field?
  • type: The types here seem to all be describing the publisher. If that's all we want, then we should add a classification field to the publisher (from W3C Organization Ontology). I think it would be fair to leave the code list open for consultation longer than the schema, as there is often more disagreement around what values to include in a code list.
  • datapackages: Yes, I agree with removal. It's strange to set up a two-tier system where some datasets would appear under data.json's datasets, and others would be promoted to a special datapackages field - especially considering the low adoption of Data Packages.
  • themeTaxonomy: It's not better, I was just giving an example of how it would need to be done for this file to be expressible as JSON-LD, and which is how data.json is likely to do it, since they are already compliant with JSON-LD.
  • linked_portals: Standards are designed as long-term solutions, not short-term solutions. I think there are alternatives you can pursue to fulfill your use case here. I don't think it's appropriate to put this into a proposed standard.
  • links: I think there is a fairly significant risk that this field will become useless across catalogs, and will just become a field with a wide range of links that publishers consider relevant to share. A standard is only useful if the data from different publishers is comparable. A highly-varied list of links would not be useful...

@NTerpo
Copy link
Author

NTerpo commented Apr 4, 2016

  • publisher: I totally agree with you, we should use FOAF behind the scene.
  • data_language: the other language field only describes the document dataportal.json when data_language aims to describe the data themselves.
  • type : yes I also believe there may be a lot of discussion about what to include in the list. But that's the whole point of our approach : let usage decide and give time to experimentation.
  • themeTaxonomy : ok :)
  • linked-portals, links : I do understand your point, but I don't exactly want to design a long-term standard. I would prefer a set of common practices : something that's really easy for people/org to implement right now, that can give concrete return right away and something that will be easy (because of the easy implementation) to abandon the day Linked Data is mainstream or at least every issues has a better solution. Both fields may become irrelevant, but at least we will have an idea of what publishers want to link to in real life. If the whole document is pushed as a set of common practices and not as a standard, when the publishers will implement it (if they want, which is not sure haha) they will make a trade-off between how they think they will optimize their data diffusion and how other data portals are dealing with that document. For now we have consistent and compliant standards, but we don't have a lot of portals describing their catalogs. We have to understand why and how we can design something they will really use.

@jpmckinney
Copy link

The best way to understand why a standard isn't being adopted is to ask the potential adopters (with an unbiased questionnaire, methodology, etc.). That said, I don't think the problem is that DCAT is too hard. I think it's that:

  1. Publishers don't know what standards to adopt. When talking to publishers, this is really the most common reason in my experience.
  2. Publishers don't know how to interpret the standard's documentation. The solution to this is to provide good documentation written for implementers for existing standards. The W3C documentation for DCAT is written for RDF experts; a user-friendly, implementation-focused, jargon-free version of those docs would go a long way towards easing adoption.
  3. Publishers are using third-party software that doesn't provide machine-readable catalog metadata out of the box. The solution to this isn't to introduce some new practice – which will similarly not be adopted by those suppliers. The solution is to convince the major suppliers to implement a common standard (like DCAT).

In short we need:

  1. Awareness building
  2. Better documentation
  3. Vendor adoption

Creating some new format is not a solution for any of those. I really don't believe the problem is, "DCAT is hard." Let's at least validate what the real problems are before investing time and effort into a solution. Does that make sense?

@jpmckinney
Copy link

Anyway, for better alignment with DCAT please change:

  • name -> title
  • url -> homepage
  • version -> conformsTo and change the value to the URL for the documentation of this format (which should be a versioned web page)
  • organization_in_charge_of_the_portal -> publisher
    • url -> homepage
  • spatial -> make the value an actual GeoJSON feature, so:
{
  "type": "Feature",
  "geometry": {
    "type": ...,
    "coordinates": ...
  },
  "properties": {
    "name": "Paris",
    "country": "FR"
  }
}

@ColinMaudry
Copy link

I've created a draft JSON-LD fork in order to actually enable round tripping with RDF: https://gist.github.com/ColinMaudry/5163ecade149a837aa25694fdd7ac46f. It's still incomplete, but it gives an idea.

And here is how it behaves when processing the JSON to RDF with the context: http://tinyurl.com/hdza9yp

Suggestions if we want to go further in that direction:

  • type value should either be a keyword that we can resolve to a URI (ex: local-government) or a URI
  • As-is, themes values cannot be used in a UI in another language than English. A pity for French data :) Setting up a list of themes URI would enable multilingual support. As for type, in the JSON, the themes value could either be lower case keywords in English or URIs
  • I'm not very comfortable with property name in plural form. I assume it's a hint to know that the value is an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment