Skip to content

Instantly share code, notes, and snippets.

@SandyRogers
Last active March 22, 2022 10:24
Show Gist options
  • Save SandyRogers/5d9eff7f1f7b08cfa40265f5e2adf9cd to your computer and use it in GitHub Desktop.
Save SandyRogers/5d9eff7f1f7b08cfa40265f5e2adf9cd to your computer and use it in GitHub Desktop.
Fetch paginated tabular data from MGnify, the EBI Metagenomics API. These are Python examples of how to fetch data from the https://jsonapi.org based endpoints. The API Browser is available at https://www.ebi.ac.uk/metagenomics/api
# Using common libraries.
#
# Dependencies:
# pandas jsonapi-client
# Install them from the command line, with e.g.
# $ pip install pandas jsonapi-client
from jsonapi_client import Session
import pandas as pd
# See https://www.ebi.ac.uk/metagenomics/api/docs/ for endpoints and API documentation.
endpoint = 'super-studies'
with Session("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
resources = map(lambda r: r.json, mgnify.iterate(endpoint))
resources = pd.json_normalize(resources)
resources.to_csv(f"{endpoint}.csv")
# Using Python 3 standard library only, with no extra python packages needed:
import urllib.request
import json
import csv
# See https://www.ebi.ac.uk/metagenomics/api/docs/ for endpoints and API documentation
# including attributes you may want as CSV columns.
endpoint = 'super-studies'
attribute_columns = ["super-study-id", "title", "description"]
def get_page(url):
next_url = url
while next_url:
with urllib.request.urlopen(next_url) as page:
response = json.loads(page.read().decode())
data = response['data']
yield data
next_url = response['links']['next']
with open(f"{endpoint}.csv", "w") as csv_file:
c = csv.writer(csv_file)
c.writerow(attribute_columns)
for page in get_page(f"https://www.ebi.ac.uk/metagenomics/api/v1/{endpoint}"):
for resource in page:
c.writerow([resource['attributes'].get(col) for col in attribute_columns])
# Using common libraries.
#
# Dependencies:
# pandas jsonapi-client
# Install them from the command line, with e.g.
# $ pip install pandas jsonapi-client
# This example includes a Filter.
# You can explore the Filters available for each endpoint using the interactive API browser:
# https://www.ebi.ac.uk/metagenomics/api/v1
#
# If you don't use any Filters, endpoints with many pages of data (like /samples) might fail.
from jsonapi_client import Session, Filter
import pandas as pd
class MGnifyFilter(Filter):
def format_filter_query(self, **kwargs: 'FilterKeywords') -> str:
"""
The MGnify API uses a slimmer syntax for filters than the JSON:API default.
Filter keywords are not wrapped in by the word "filter", like, filter[foo]=bar,
but are instead plain, like foo=bar.
"""
def jsonify_key(key):
return key.replace('__', '.').replace('_', '-')
return '&'.join(f'{jsonify_key(key)}={value}'
for key, value in kwargs.items())
endpoint = 'samples'
filters = {
'lineage': 'root:Engineered:Bioremediation'
}
with Session("https://www.ebi.ac.uk/metagenomics/api/v1/") as mgnify:
resources = map(lambda r: r.json, mgnify.iterate(endpoint, MGnifyFilter(**filters)))
resources = pd.json_normalize(resources)
resources.to_csv(f"{endpoint}.csv")
@taylorreiter
Copy link

Thank you @SandyRogers! I did need the whole set of metadata, but your second gist worked to retrieve. I had tried to use filters to download it piece-meal and couldn't figure it out, so the third example is a wonderful new resource.

This is the notebook I ended up with: https://github.com/taylorreiter/2022-sra-gather/blob/main/notebooks/20220318_fetch_paginated_data_ebi.ipynb

I really appreciate you making these gists available, they were instrumental in figuring out how to use the api!

@SandyRogers
Copy link
Author

Brilliant, thanks @taylorreiter for the feedback. I will add some more examples of fetching paginated data to our Jupyter Lab Notebooks as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment