Skip to content

Instantly share code, notes, and snippets.

@wleepang
Last active November 22, 2022 14:45
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wleepang/d17e0b18476d45860893313e2d78b3eb to your computer and use it in GitHub Desktop.
Save wleepang/d17e0b18476d45860893313e2d78b3eb to your computer and use it in GitHub Desktop.
read data from cloud object store with biopython
"""
Reads a fastq file directly from the 1000genomes AWS public dataset
into a Bio.SeqRecord set
Requires an AWS Account
"""
from smart_open import open
from Bio import SeqIO
# file handle-like reference to ~60MB object in S3
fh = open('s3://1000genomes/phase3/data/NA12878/sequence_read/SRR622461.filt.fastq.gz')
for record in SeqIO.parse(fh, 'fastq'):
print(record.id)
"""
Reads a fastq file directly from the 1000genomes AWS public dataset
into a Bio.SeqRecord set
Does not require an AWS Account
"""
import io
from gzip import GzipFile
import s3fs
from Bio import SeqIO
fs = s3fs.S3FileSystem(anon=True)
with fs.open('1000genomes/phase3/data/NA12878/sequence_read/SRR622461.filt.fastq.gz','rb') as f:
for record in SeqIO.parse(io.TextIOWrapper(GzipFile(fileobj=f)), 'fastq'):
print(record.id)
@peterjc
Copy link

peterjc commented Jul 27, 2019

I wonder if smart_open will be updated to use boto3 and then support anonymous access?

@wleepang
Copy link
Author

Looks like smart_open does use boto3, but does not provide anonymous access to S3. That feature seems unique to s3fs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment