Skip to content

Instantly share code, notes, and snippets.

@andrewyatz
Last active January 22, 2020 16:32
Show Gist options
  • Save andrewyatz/2a7f604ad991f3249e39cc6a75e779ff to your computer and use it in GitHub Desktop.
Save andrewyatz/2a7f604ad991f3249e39cc6a75e779ff to your computer and use it in GitHub Desktop.
A possible cave entrance for refget. Want to move into a more central place

Refget: access to reference sequences

What is refget?

Refget is a specification to define a standard way to access reference sequences using an identifier system derived from the sequence itself. It is a fundamental building block of GA4GH providing a way for our standards to access sequences and to unambiguously identify them.

How refget works

How do I use refget?

Getting sequence

Refget is a HTTP based standard so you can access sequences using any HTTP library. All you need to access a refget service is a HTTP library and a known identifier. The following Python code retrieves the first 10 bases of Saccharomyces cerevisiae chromosome I (TRUNC512 identifier 6681ac2f62509cfc220d78751b8dc524).

import requests

url = 'https://refget.herokuapp.com/sequence/{}'.format('6681ac2f62509cfc220d78751b8dc524')
r = requests.get(url, headers={'Accept':'text/plain'}, params={'start':0, 'end':10})
print(r.text)
'CCACACCACA'

Omit the start and end parameters to retrieve the entire sequence.

Getting sequence metadata

import requests

url = 'https://refget.herokuapp.com/sequence/{}/metadata'.format('6681ac2f62509cfc220d78751b8dc524')
r = requests.get(url, headers={'Accept':'application/json'})
print(r.json())
{'metadata': {'aliases': [{{'alias': 'I', 'naming_authority': 'unknown'}], 'length': 230218, 'md5': '6681ac2f62509cfc220d78751b8dc524', 'trunc512': '959cb1883fc1ca9ae1394ceb475a356ead1ecceff5824ae7'}}

Essential refget information

Specifications

Compliance

Running implementations

What does an identifier derived from the sequence mean?

Sequences such as reference genomes have a multitude of names. For example chromosome 1 from the latest build of the human genome (GRCh38) can be known as chr1, 1, CM000663.2 or NC_000001.11 depending on where you accessed your sequence from. Refget instead uses a cryptographic hash function to create an identifier based on the sequence content by digesting the A,C,G and Ts from a chromosome and passing it through the MD5 or SHA512 algorithm creating a string. Chromosome 1 can now be referred to as 6aef897c3d6ff0c78aff06ac189178dd.

How to get involved

You can contribute changes to the hts-specs GitHub repository. If you want to be more involved we host regular conference calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment