Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save vjcitn/9302bd4a75e864bbbee2cef4660cbd23 to your computer and use it in GitHub Desktop.
Save vjcitn/9302bd4a75e864bbbee2cef4660cbd23 to your computer and use it in GitHub Desktop.
Quick introduction to the OmicIDX GraphQL API
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to the OmicIDX API\n",
"\n",
"OmicIDX parses and then serves public genomics repository metadata. These metadata\n",
"are growing quickly, updated often, and are now very large when taken as a whole.\n",
"Because of the interrelated nature of the metadata and the myriad approaches and use cases that \n",
"exist, including search, bulk download, and even data mining, we serve the data via\n",
"a GraphQL endpoint.\n",
"\n",
"Currently, OmicIDX contains the SRA and Biosample metadata sets. These overlap with \n",
"each other, but SRA metadata contains deeper metadata than Biosample data on the same\n",
"samples. However, Biosample contains many more samples and currently includes metadata \n",
"about a subset of NCBI GEO, all SRA samples, and some additional samples from projects \n",
"like Genbank.\n",
"\n",
"# GraphQL for accessing OmicIDX data\n",
"\n",
"[GraphQL] is a query language for APIs and a runtime for fulfilling those queries with existing data. \n",
"GraphQL provides a complete and understandable description of the data in the API, gives clients \n",
"the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, \n",
"and enables powerful developer tools.\n",
"\n",
"GraphQL has only a *single url*, called the endpoint, which allows access to all data in the \n",
"API. GraphQL is also a *query language*. It is the GraphQL query that is submitted to the GraphQL\n",
"endpoint that results in data being returned. \n",
"\n",
"\n",
"[GraphQL]: https://graphql.org/\n",
"\n",
"## What is a GraphQL query?\n",
"\n",
"A GraphQL query looks a bit like JSON, except without quotes or commas. Here is an example\n",
"GraphQL query for a fictitious GraphQL API.\n",
"\n",
"```\n",
"{ \n",
" allCharacters {\n",
" name\n",
" }\n",
"}\n",
"```\n",
"\n",
"If we had a server that contained Star Wars trivia, the response from the server might look like:\n",
"\n",
"```\n",
"{ \"data\": {\n",
" \"allCharacters\": [\n",
" { \n",
" \"name\":\"Luke\"\n",
" },\n",
" { \n",
" \"name\": \"Darth\"\n",
" },\n",
" ...\n",
" ]\n",
" }\n",
"}\n",
"```\n",
"\n",
"If we changed the query to:\n",
"\n",
"```\n",
"{ \n",
" allCharacters {\n",
" name\n",
" mass\n",
" }\n",
"}\n",
"```\n",
"\n",
"the response would now look like:\n",
"\n",
"```\n",
"{ \"data\": {\n",
" \"allCharacters\": [\n",
" { \n",
" \"name\":\"Luke\",\n",
" \"mass\": 80\n",
" },\n",
" { \n",
" \"name\": \"Darth\",\n",
" \"mass\": 140\n",
" },\n",
" ...\n",
" ]\n",
" }\n",
"}\n",
"```\n",
"\n",
"## How do I know what is in the GraphQL endpoint?\n",
"\n",
"The GraphQL **schema** describes the data model(s) contained in the GraphQL endpoint. GraphQL is \n",
"strongly typed, has the concept of relationships between data types, and is self-documenting. \n",
"In fact, one can use the GraphQL endpoint to discover what is in the endpoint. I will not go \n",
"into the details right now, but this *introspection* capability makes possible some powerful\n",
"tooling. One of the most ubiguitous is the so-called **Graph*i*QL** (note the *i* in the name) \n",
"tool.\n",
"\n",
"## Exercise 1\n",
"\n",
"Navigate to the [OmicIDX GraphiQL] and follow along with [this video](https://youtu.be/1Zg_Fbt56kc).\n",
"\n",
"[OmicIDX GraphiQL]: http://graphql-omicidx.cancerdatasci.org/graphiql\n",
"\n",
"# Querying OmicIDX programmatically\n",
"\n",
"GraphQL is quite easy to work with programmatically. All queries are made via a post request to the GraphQL endpoint.\n",
"\n",
"- Current OmicIDX GraphQL endpoint: http://graphql-omicidx.cancerdatasci.org/graphql\n",
"\n",
"For example, let us get the first 500 SRA studies (500 is a limit to the number of results that we can retrieve in one go. \n",
"The GraphQL might look like this:\n",
"\n",
"```\n",
"{\n",
" allSraStudies {\n",
" edges {\n",
" node {\n",
" accession\n",
" title\n",
" abstract\n",
" }\n",
" }\n",
" }\n",
"}\n",
"```\n",
"\n",
"From exercise 1, you know that you can copy this query into the Graph*i*QL browser and get results. How about using [curl](https://curl.haxx.se/)?\n",
"\n",
"```\n",
"curl \\\n",
" -X POST \\\n",
" -H \"Content-Type: application/json\" \\\n",
" http://graphql-omicidx.cancerdatasci.org/graphql \\\n",
" --data @- << EOF\n",
" \n",
" { \"query\":\"{\n",
" allSraStudies {\n",
" edges {\n",
" node {\n",
" accession\n",
" title\n",
" abstract\n",
" }\n",
" }\n",
" }\n",
"}\"\n",
"}\n",
"EOF\n",
"```\n",
"\n",
"Performing the same query in python is also straightforward. Making a post request with the [requests](http://docs.python-requests.org/en/master/user/quickstart/#more-complicated-post-requests) library is pretty straightforward\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"graphql_endpoint = \"http://graphql-omicidx.cancerdatasci.org/graphql\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"query_dict = {'query' : \"\"\"\n",
"{\n",
" allSraStudies(first: 3) {\n",
" edges {\n",
" node {\n",
" accession\n",
" title\n",
" abstract\n",
" }\n",
" }\n",
" }\n",
"}\n",
"\"\"\"}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"response ok? True\n"
]
}
],
"source": [
"response = requests.post(graphql_endpoint, json = query_dict)\n",
"print(\"response ok?\", response.ok)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the `response.json()` value is exactly what we would have seen in the Graph*i*QL browser."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'data': {'allSraStudies': {'edges': [{'node': {'accession': 'DRP000001',\n",
" 'title': 'Bacillus subtilis subsp. natto BEST195 genome sequencing project',\n",
" 'abstract': '<b><i>Bacillus subtilis</i> subsp. <i>natto</i> BEST195</b>. i>Bacillus subtilis</i> subsp. <i>natto</i> BEST195 was isolated from fermented soybeans and will be used for comparative genome analysis.'}},\n",
" {'node': {'accession': 'DRP000002',\n",
" 'title': 'Model organism for prokaryotic cell differentiation and development',\n",
" 'abstract': None}},\n",
" {'node': {'accession': 'DRP000003',\n",
" 'title': 'Comprehensive identification and characterization of the nucleosome structure',\n",
" 'abstract': 'Comprehensive identification and characterization of the nucleosome structure in mammalian genes were attempted. We used Nucleosome-Seq method, in which next gene sequencing technology and micrococcus nuclease digestion assay were combined.'}}]}}}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response.json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make querying a bit more flexible and incorporate variables and gzip compression using a simple function. \n",
"All GraphQL queries return results in JSON format that conforms to the query, so complex libraries are not \n",
"required when dealing with GraphQL APIs. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def query(query, variables = None, headers = {\"Accept-Encoding\": \"gzip\"}):\n",
" \"\"\"Perform query against a graphql endpoint\n",
" \n",
" GraphQL is very easy to program with, as there is just one\n",
" endpoint for all queries. \n",
" \n",
" Parameters\n",
" ----------\n",
" query: str\n",
" a graphql query string\n",
" variables: dict\n",
" a dictionary of variables to substitute into the graphql query\n",
" headers: dict\n",
" for now, the accept-encoding header is the only one\n",
" of importance. As specified, it asks the server to\n",
" gzip results. Most clients (including python requests)\n",
" can unzip on the fly.\n",
" \"\"\"\n",
" resp = requests.post('http://graphql-omicidx.cancerdatasci.org/graphql', \n",
" json = {\"query\" : query, \"variables\": variables},\n",
" headers = headers\n",
" )\n",
" if(resp.ok):\n",
" # All queries return well-formated json if successful.\n",
" ret = resp.json()\n",
" ret.update({\"ok\": True})\n",
" return ret\n",
" else:\n",
" return({\"ok\": False,\n",
" \"status_code\": resp.status_code})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following query incorporates the concept of variables. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"q3 = \"\"\"\n",
"query allStudies($first: Int=500 $after: Cursor=null) {\n",
" allSraStudies(first: $first after: $after) {\n",
" edges {\n",
" node {\n",
" bioproject\n",
" gse\n",
" abstract\n",
" alias\n",
" attributes\n",
" brokerName\n",
" centerName\n",
" description\n",
" identifiers\n",
" studyType\n",
" title\n",
" xrefs\n",
" status\n",
" updated\n",
" published\n",
" received\n",
" visibility\n",
" bioProject\n",
" replacedBy\n",
" }\n",
" }\n",
" pageInfo {\n",
" hasNextPage\n",
" endCursor\n",
" }\n",
" }\n",
"}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"z = query(q3, {\"first\": 2}) # Feel free to change \"2\" to some other value, but values over 500 are automatically capped"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2\n",
"[{'node': {'abstract': '<b><i>Bacillus subtilis</i> subsp. <i>natto</i> '\n",
" 'BEST195</b>. i>Bacillus subtilis</i> subsp. '\n",
" '<i>natto</i> BEST195 was isolated from fermented '\n",
" 'soybeans and will be used for comparative genome '\n",
" 'analysis.',\n",
" 'alias': 'DRP000001',\n",
" 'attributes': None,\n",
" 'bioProject': 'PRJDA38027',\n",
" 'bioproject': 'PRJDA38027',\n",
" 'brokerName': None,\n",
" 'centerName': 'KEIO',\n",
" 'description': None,\n",
" 'gse': None,\n",
" 'identifiers': '[{\"id\":\"PRJDA38027\",\"namespace\":\"BioProject\"},{\"id\":\"DRP000001\",\"namespace\":\"KEIO\"}]',\n",
" 'published': '2015-07-31T15:20:44',\n",
" 'received': '2009-06-20T02:48:02',\n",
" 'replacedBy': None,\n",
" 'status': 'live',\n",
" 'studyType': 'Whole Genome Sequencing',\n",
" 'title': 'Bacillus subtilis subsp. natto BEST195 genome sequencing '\n",
" 'project',\n",
" 'updated': '2019-01-25T16:06:49',\n",
" 'visibility': 'public',\n",
" 'xrefs': '[{\"db\":\"pubmed\",\"id\":\"20398357\"},{\"db\":\"pubmed\",\"id\":\"25329997\"}]'}},\n",
" {'node': {'abstract': None,\n",
" 'alias': 'DRP000002',\n",
" 'attributes': None,\n",
" 'bioProject': 'PRJDA39275',\n",
" 'bioproject': 'PRJDA39275',\n",
" 'brokerName': None,\n",
" 'centerName': 'KEIO',\n",
" 'description': None,\n",
" 'gse': None,\n",
" 'identifiers': '[{\"id\":\"PRJDA39275\",\"namespace\":\"BioProject\"},{\"id\":\"DRP000002\",\"namespace\":\"KEIO\"}]',\n",
" 'published': '2010-03-24T03:11:55',\n",
" 'received': '2009-08-04T07:37:05',\n",
" 'replacedBy': None,\n",
" 'status': 'live',\n",
" 'studyType': 'Whole Genome Sequencing',\n",
" 'title': 'Model organism for prokaryotic cell differentiation and '\n",
" 'development',\n",
" 'updated': '2017-09-17T10:08:49',\n",
" 'visibility': 'public',\n",
" 'xrefs': '[{\"db\":\"pubmed\",\"id\":\"20398357\"}]'}}]\n"
]
}
],
"source": [
"import pprint\n",
"if(z['ok']): \n",
" if('data' in z):\n",
" print(len(z['data']['allSraStudies']['edges']))\n",
" pprint.pprint(z['data']['allSraStudies']['edges'][0:5])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['edges', 'pageInfo'])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z['data']['allSraStudies'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Up to now, we have been simply fetching the default first 500 records. We can use the pageInfo \n",
"object and the cursor to page through results to get the first 3000 records in chunks of 500. \n",
"Note how the cursor is used below. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAwNTgwIl1d\n",
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAxMDkxIl1d\n",
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAxNjMxIl1d\n",
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAyMjU2Il1d\n",
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAyNzU4Il1d\n",
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAzMjU4Il1d\n",
"True\n",
"WyJwcmltYXJ5X2tleV9hc2MiLFsiRFJQMDAzNzYxIl1d\n"
]
}
],
"source": [
"hasNext=True\n",
"cursor=None\n",
"import json\n",
"n = 0\n",
"n_max = 3000\n",
"import gzip\n",
"with gzip.open('/tmp/allStudies.json.gz', 'wb') as outfile:\n",
" while(hasNext and n <= n_max):\n",
" res = query(q3,{\"after\": cursor} )\n",
" n+=len(res['data']['allSraStudies']['edges'])\n",
" for row in res['data']['allSraStudies']['edges']:\n",
" outfile.write(bytes(json.dumps(row['node'])+ \"\\n\", encoding='UTF-8'))\n",
" if(n % 5000 == 0):\n",
" print(n)\n",
" hasNext = res['data']['allSraStudies']['pageInfo']['hasNextPage']\n",
" print(hasNext)\n",
" cursor = res['data']['allSraStudies']['pageInfo']['endCursor']\n",
" print(cursor)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment