Skip to content

Instantly share code, notes, and snippets.

@SandyRogers
Last active March 22, 2022 10:24
Show Gist options
  • Save SandyRogers/5d9eff7f1f7b08cfa40265f5e2adf9cd to your computer and use it in GitHub Desktop.
Save SandyRogers/5d9eff7f1f7b08cfa40265f5e2adf9cd to your computer and use it in GitHub Desktop.
Fetch paginated tabular data from MGnify, the EBI Metagenomics API. These are Python examples of how to fetch data from the https://jsonapi.org based endpoints. The API Browser is available at https://www.ebi.ac.uk/metagenomics/api
# Using common libraries.
#
# Dependencies:
# pandas jsonapi-client
# Install them from the command line, with e.g.
# $ pip install pandas jsonapi-client
from jsonapi_client import Session
import pandas as pd
# See https://www.ebi.ac.uk/metagenomics/api/docs/ for endpoints and API documentation.
endpoint = 'super-studies'
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
resources = map(lambda r: r.json, mgnify.iterate(endpoint))
resources = pd.json_normalize(resources)
resources.to_csv(f"{endpoint}.csv")
# Using Python 3 standard library only, with no extra python packages needed:
import urllib.request
import json
import csv
# See https://www.ebi.ac.uk/metagenomics/api/docs/ for endpoints and API documentation
# including attributes you may want as CSV columns.
endpoint = 'super-studies'
attribute_columns = ["super-study-id", "title", "description"]
def get_page(url):
next_url = url
while next_url:
with urllib.request.urlopen(next_url) as page:
response = json.loads(page.read().decode())
data = response['data']
yield data
next_url = response['links']['next']
with open(f"{endpoint}.csv", "w") as csv_file:
c = csv.writer(csv_file)
c.writerow(attribute_columns)
for page in get_page(f"https://www.ebi.ac.uk/metagenomics/api/v1/{endpoint}"):
for resource in page:
c.writerow([resource['attributes'].get(col) for col in attribute_columns])
# Using common libraries.
#
# Dependencies:
# pandas jsonapi-client
# Install them from the command line, with e.g.
# $ pip install pandas jsonapi-client
# This example includes a Filter.
# You can explore the Filters available for each endpoint using the interactive API browser:
# https://www.ebi.ac.uk/metagenomics/api/v1
#
# If you don't use any Filters, endpoints with many pages of data (like /samples) might fail.
from jsonapi_client import Session, Filter
import pandas as pd
class MGnifyFilter(Filter):
def format_filter_query(self, **kwargs: 'FilterKeywords') -> str:
"""
The MGnify API uses a slimmer syntax for filters than the JSON:API default.
Filter keywords are not wrapped in by the word "filter", like, filter[foo]=bar,
but are instead plain, like foo=bar.
"""
def jsonify_key(key):
return key.replace('__', '.').replace('_', '-')
return '&'.join(f'{jsonify_key(key)}={value}'
for key, value in kwargs.items())
endpoint = 'samples'
filters = {
'lineage': 'root:Engineered:Bioremediation'
}
with Session("https://www.ebi.ac.uk/metagenomics/api/v1/") as mgnify:
resources = map(lambda r: r.json, mgnify.iterate(endpoint, MGnifyFilter(**filters)))
resources = pd.json_normalize(resources)
resources.to_csv(f"{endpoint}.csv")
@taylorreiter
Copy link

taylorreiter commented Mar 18, 2022

fetch_paginated_mgnify_data.py works for me when I used biomes or super-studies, but fails with this error for endpointsamples

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/miniconda3/envs/tidy_jupyter/lib/python3.10/site-packages/requests/models.py:910, in Response.json(self, **kwargs)
    909 try:
--> 910     return complexjson.loads(self.text, **kwargs)
    911 except JSONDecodeError as e:
    912     # Catch JSON-related errors and raise as requests.JSONDecodeError
    913     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/miniconda3/envs/tidy_jupyter/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/miniconda3/envs/tidy_jupyter/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File ~/miniconda3/envs/tidy_jupyter/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

<longer traceback trimmed by Gist author>

@SandyRogers
Copy link
Author

@taylorreiter
This is likely because /samples returns a lot of pages of data. So, you have two strategies to pick from:

  • Use Filters to limit the amount of data you're requesting. This is almost always what you want to do in a real use case. I've added a third example to this gist, showing how to do that.
  • Throttle your requests to limit how quickly you're trying to pull data, if you really do need a lot of unfiltered data. I haven't included a copy-and-paste example of this here because it's usually not the most efficient approach (compared to limiting the data you request in the first place). You can achieve it using jsonapi_client package, e.g. using page = mgnify.fetch_document_by_url(f'{mgnify.url_prefix}/{endpoint}'), and after a time.sleep, following page.links.next etc.

Please feel free to contact the MGnify helpdesk if you need a large query / dataset and aren't able to access it this way.

@taylorreiter
Copy link

Thank you @SandyRogers! I did need the whole set of metadata, but your second gist worked to retrieve. I had tried to use filters to download it piece-meal and couldn't figure it out, so the third example is a wonderful new resource.

This is the notebook I ended up with: https://github.com/taylorreiter/2022-sra-gather/blob/main/notebooks/20220318_fetch_paginated_data_ebi.ipynb

I really appreciate you making these gists available, they were instrumental in figuring out how to use the api!

@SandyRogers
Copy link
Author

Brilliant, thanks @taylorreiter for the feedback. I will add some more examples of fetching paginated data to our Jupyter Lab Notebooks as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment