Skip to content

Instantly share code, notes, and snippets.

@wleepang
Last active November 22, 2022 14:45
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wleepang/d17e0b18476d45860893313e2d78b3eb to your computer and use it in GitHub Desktop.
Save wleepang/d17e0b18476d45860893313e2d78b3eb to your computer and use it in GitHub Desktop.
read data from cloud object store with biopython
"""
Reads a fastq file directly from the 1000genomes AWS public dataset
into a Bio.SeqRecord set
Requires an AWS Account
"""
from smart_open import open
from Bio import SeqIO
# file handle-like reference to ~60MB object in S3
fh = open('s3://1000genomes/phase3/data/NA12878/sequence_read/SRR622461.filt.fastq.gz')
for record in SeqIO.parse(fh, 'fastq'):
print(record.id)
"""
Reads a fastq file directly from the 1000genomes AWS public dataset
into a Bio.SeqRecord set
Does not require an AWS Account
"""
import io
from gzip import GzipFile
import s3fs
from Bio import SeqIO
fs = s3fs.S3FileSystem(anon=True)
with fs.open('1000genomes/phase3/data/NA12878/sequence_read/SRR622461.filt.fastq.gz','rb') as f:
for record in SeqIO.parse(io.TextIOWrapper(GzipFile(fileobj=f)), 'fastq'):
print(record.id)
@wleepang
Copy link
Author

Looks like smart_open does use boto3, but does not provide anonymous access to S3. That feature seems unique to s3fs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment