Skip to content

Instantly share code, notes, and snippets.

@wleepang
Last active November 22, 2022 14:45
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wleepang/d17e0b18476d45860893313e2d78b3eb to your computer and use it in GitHub Desktop.
Save wleepang/d17e0b18476d45860893313e2d78b3eb to your computer and use it in GitHub Desktop.
read data from cloud object store with biopython
"""
Reads a fastq file directly from the 1000genomes AWS public dataset
into a Bio.SeqRecord set
Requires an AWS Account
"""
from smart_open import open
from Bio import SeqIO
# file handle-like reference to ~60MB object in S3
fh = open('s3://1000genomes/phase3/data/NA12878/sequence_read/SRR622461.filt.fastq.gz')
for record in SeqIO.parse(fh, 'fastq'):
print(record.id)
"""
Reads a fastq file directly from the 1000genomes AWS public dataset
into a Bio.SeqRecord set
Does not require an AWS Account
"""
import io
from gzip import GzipFile
import s3fs
from Bio import SeqIO
fs = s3fs.S3FileSystem(anon=True)
with fs.open('1000genomes/phase3/data/NA12878/sequence_read/SRR622461.filt.fastq.gz','rb') as f:
for record in SeqIO.parse(io.TextIOWrapper(GzipFile(fileobj=f)), 'fastq'):
print(record.id)
@wleepang
Copy link
Author

wleepang commented Jul 27, 2019

An alternative to smart_open would be to use s3fs. The later is based on the newer boto3 package.

s3fs allows anonymous reads from public s3 buckets, which allows testing against available AWS Public Datasets like 1000genomes.

@peterjc
Copy link

peterjc commented Jul 27, 2019

Anonymous access sounds great for an example in the main Biopython tutorial 👍

@peterjc
Copy link

peterjc commented Jul 27, 2019

I wonder if smart_open will be updated to use boto3 and then support anonymous access?

@wleepang
Copy link
Author

Looks like smart_open does use boto3, but does not provide anonymous access to S3. That feature seems unique to s3fs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment