Skip to content

Instantly share code, notes, and snippets.

@RubenVerborgh
Last active April 30, 2017 20:19
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save RubenVerborgh/7684361 to your computer and use it in GitHub Desktop.
Save RubenVerborgh/7684361 to your computer and use it in GitHub Desktop.

Accessing data through APIs

Update: my blog post The lie of the API details the issues with current APIs.

Background: I'm a researcher in semantic hypermedia, at the moment comparing different APIs for accessing metadata for human and machine consumption.

Story: I am browsing a cultural website and want to retrieve the metadata of the object I'm looking at in a machine-readable format. The steps below are the actual steps that I've undertaken on different sites.

Example: Cooper-Hewitt museum

I'm looking at the object http://collection.cooperhewitt.org/objects/35460799/.

  1. To retrieve this in JSON, I just take copy that URL and do:
$ curl -H "Accept: application/json" http://collection.cooperhewitt.org/objects/35460799/

Example: DBpedia

I'm looking at the person http://dbpedia.org/resource/Arthur_Rimbaud

  1. To retrieve this in JSON, I just take copy that URL and do:
$ curl -L -H "Accept: application/json" http://dbpedia.org/resource/Arthur_Rimbaud

There's even RDF if I need it (same URL): ``` $ curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/Arthur_Rimbaud ```

Example: Europeana

I'm looking at the object http://www.europeana.eu/portal/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html.html?start=1&query=david+ochterlony+hookah&startPage=1&rows=24

  1. To retrieve JSON, I try
$ curl -H "Accept: application/json" http://www.europeana.eu/portal/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html.html
  1. I try to make sense of the following output:
<html><head><title>Apache Tomcat/6.0.24 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 406 - </h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u></u></p><p><b>description</b> <u>The resource identified by this request is only capable of generating responses with characteristics not acceptable according to the request "accept" headers ().</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/6.0.24</h3></body></html>
  1. I search for the documentation.
  2. I end up on this page and click "API documentation".
  3. I end up on the Introduction page, where I see that I have to register.
  4. On the registration page, I enter my e-mail address.
  5. I receive an e-mail and click the link.
  6. I receive my API key.
  7. I click through to Working with the API and take a mental note about a field named apikey.
  8. I go to Sample code. No, that's not it.
  9. I go to API methods and see that record.json (is it a method or a file) looks like what I need, so I click it.
  10. I am informed that I need to use the URL template http://europeana.eu/api/v2/record/[recordID].json. This URL template has the parameters recordID, callback, profile. I only understand the second one without reading, but I don't need it (not using JSON-P).
  11. Hoping to find the Record ID, I go back to the page I opened in the beginning. I look through the whole page and find nothing called "Record ID", but I find a field "Identifier" with string 019ADDOR0000002U00000000.
  12. I now feel ready to make my first API call and try
$ curl http://europeana.eu/api/v2/record/019ADDOR0000002U00000000.json?apikey=xxxxxxxxx

where xxxxxxxxx is my actual API key, using the apikey field name I found earlier. 15. I try to make sense of the following output:

<html><head><title>Apache Tomcat/6.0.24 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /api/v2/record/019ADDOR0000002U00000000.json</h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u>/api/v2/record/019ADDOR0000002U00000000.json</u></p><p><b>description</b> <u>The requested resource (/api/v2/record/019ADDOR0000002U00000000.json) is not available.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/6.0.24</h3></body></html>
  1. Thinking I might have not used the API key properly, I go back to Working with the API and now see something about a wskey parameter. So the field is called apikey but the parameter wskey. I assume this is a URL query string parameter.
  2. I try the request again:
$ curl http://europeana.eu/api/v2/record/019ADDOR0000002U00000000.json?wskey=xxxxxxxxx
  1. I visually check whether the error output is the same:
<html><head><title>Apache Tomcat/6.0.24 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /api/v2/record/019ADDOR0000002U00000000.json</h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u>/api/v2/record/019ADDOR0000002U00000000.json</u></p><p><b>description</b> <u>The requested resource (/api/v2/record/019ADDOR0000002U00000000.json) is not available.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/6.0.24</h3></body></html>
  1. I suspect I might have gotten the identifier wrong. I go back to the original page and start looking into the source code whether I can find an identifier. I only find 019ADDOR0000002U00000000, which I have tried already.
  2. I go back to the Working with the API page and click the link Europeana ID next to the recordID field, where I read the following explanation: _Digital records delivered to Europeana are assigned a unique identifier, Europeana ID, that serves to further identify the records when using the API. Usually, this identifier is based on the original metadata that are provided for the record and internal Europeana identifiers of the provider and the dataset containing the record. For example, a Europeana ID of an object can look as follows: /09102/_GNM_1234 where 091 is the identifier of the provider, 02 is the id of the dataset and GNM_1234 is derived from the unique identifier of the record in the context of the provider.
  3. I inspect the URL to see whether I can find such an identifier: http://www.europeana.eu/portal/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html.html?start=1&query=david+ochterlony+hookah&startPage=1&rows=24. Indeed, there is a part "92037/", but the thing that follows it does not look like that. I find this strange, but try it anyway:
$ curl http://europeana.eu/api/v2/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html?apikey=xxxxxxxxx
  1. I get the error message
<html><head><title>Apache Tomcat/6.0.24 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /api/v2/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html</h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u>/api/v2/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html</u></p><p><b>description</b> <u>The requested resource (/api/v2/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html) is not available.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/6.0.24</h3></body></html>
  1. I try to Google for "http://europeana.eu/api/v2/record" to see if anybody else got the API working.
  2. I arrive at the npm package registry and find a JSON fragment that mentions the link http://europeana.eu/api/v2/record/08501/03F4577D418DC84979C4E2EE36F99FECED4C7B11.json?wskey=abc123.
  3. I add my own API key to test whether I can retrieve this random object:
$ curl http://europeana.eu/api/v2/record/08501/03F4577D418DC84979C4E2EE36F99FECED4C7B11.json?wskey=xxxxxxxxx
  1. This works; but it's not the object that I wanted. Now let's try replacing the object identifier by 92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html:
$ curl http://europeana.eu/api/v2/record/92037/_http___www_bl_uk_onlinegallery_onlineex_apac_addorimss_s_019addor0000002u00000000_html.json?wskey=xxxxxxxxx

This works.

  1. I wonder why it didn't work in step 21, only to find out that I had not added the extension .json. I also wonder if there is any other way of getting the object ID instead of copying from the URL.
@DavidHaskiya
Copy link

Hi Ruben,
Ouch! And thanks. Well, it's obvious that we have a lot of improvements to do on our documentation! We'll take your experience to heart as we're now working on a major update of our API-docs. I'll get back to you once we've improved our docs and hopefully your next review will be a bit more positive.

As to your question on 27 I guess one of the mistakes we've made is that we wrongfully assumed that API-users would begin with a search, e.g. http://www.europeana.eu/portal/api/console.html?function=search&query=multatuli and then pick up the id and/or provided record call directly from the response (both are included) for the full record call, e.g. http://www.europeana.eu/portal/api/console.html?function=record&profile=full&recordId=%2F92062%2F8E88751AB58C3D950E96A4C92505DB8600BB99C4

Bad assumption.

Cheers,
David

@RubenVerborgh
Copy link
Author

Hi David,

Thanks for getting back on this, I appreciate it and would be glad to check out the improved version.

Documentation is indeed one part (and often neglected, but you obviously invested a lot in it). The assumption you mention is an important one. It indicates that people had an RPC-style scenario in mind: first call this, then call that.

For me, a huge collection is all about the resources and how they interlink, and not so much about a sequence of operations performed on those resources. And this, unfortunately, is about API design as well and I know that is much more difficult to change than documentation. I'm afraid that the major issue here is that users have to read the documentation before they can get started. With the Cooper-Hewitt and DBpedia APIs above, I didn't have to read documentation: the identifier of each object is the URL, and this URL allows me to retrieve the object both manually and programmatically. I have a hard time understanding the technical necessity to make APIs more complex than that. (Of course, there are other necessities.)

But even if there are non-technical reasons to have a separate API, they should correspond to each other. The Object ID problem illustrates this: I had to manually find a part of the path in my URL and then paste this into another URL. It would be a good idea for the HTML version to list this ID; and a good idea for the JSON version to link to the HTML version by its URL. Another problem is that I cannot share the JSON version: it is impossible for me to link to it, as the URL includes my private key. Even worse, I cannot share the JSON body as-is, because it also contains my API key.

Furthermore, I'm also afraid that this API key makes it impossible to develop AJAX applications that use the Europeana API. Suppose I have a museum website, how can my pages retrieve objects from Europeana in a dynamic way? It's impossible to add this to the client-side code, because this would mean disclosing my API key. This means it has to happen on the server side, but that was possible before anyway. I'm not saying I assume every API must be open; it's just that having one representation open (HTML) and another closed (JSON) is strange. The information is not shielded off; the representation is: we could just equally scrape the HTML pages and extract the JSON information out of them, as they all follow a structured template. In that regard, it doesn't make sense to provide difference affordances to human and machine clients. Limiting access per IP address is far more effective to combat misuse, given that API access is free. Plus, a keyless API would allow use in Web applications (where IP addresses are distributed across clients, so blocking is not an issue.)

Finally, perhaps even more important than the documentation is improving the error messages. They are not helpful at all; sometimes HTML and sometimes JSON. Couldn't they just include links to example calls I can do with my API key? Or if I got the key wrong, indicate how I need to add it? Self-descriptiveness is important here.

Fortunately, APIs are living things, and I'm sure that your improvement work will change Europeana for the better!

Cheers,

Ruben

@RubenVerborgh
Copy link
Author

See my blog post for more.
Also, our forthcoming book Linked Data for Libraries, Archives and Museums handles the topic in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment