Skip to content

Instantly share code, notes, and snippets.

@alexpreynolds
Last active April 25, 2024 13:26
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alexpreynolds/01f54bceee01e41bfc0770f6ee416d78 to your computer and use it in GitHub Desktop.
Save alexpreynolds/01f54bceee01e41bfc0770f6ee416d78 to your computer and use it in GitHub Desktop.
Create an indexed tabix file from a Pandas dataframe via "pure" Python
#!/usr/bin/env python
'''
Create an indexed tabix file from a Pandas dataframe
via "pure" Python (i.e., no subprocess)
'''
import os
import io
import pandas as pd
import pysam
import bgzip
ds = io.StringIO('''chr1 842320 842327
chr1 842328 842330
chr1 842328 842330
chr1 855426 855427
chr1 855739 855740''')
df = pd.read_csv(ds, delimiter='\t', header=None)
df.columns = ['chrom', 'start', 'stop']
out_bgz_fn = "test_pd.bed.gz"
with open(out_bgz_fn, "wb") as out_bgz:
with bgzip.BGZipWriter(out_bgz) as out_bgz_fh:
for index, row in df.iterrows():
out_line = '{}\t{}\t{}\n'.format(row['chrom'], row['start'], row['stop'])
out_bgz_fh.write(out_line.encode())
if not os.path.exists(out_bgz_fn):
raise Exception("Error: Could not create bgzip archive")
out_index_fn = "{}.tbi".format(out_bgz_fn)
if not os.path.exists(out_index_fn):
pysam.tabix_index(out_bgz_fn, preset="bed")
if not os.path.exists(out_index_fn):
raise Exception("Error: Could not create index of bgzip archive")
@dbolser
Copy link

dbolser commented Apr 25, 2024

How does this compare (speed wise) to df.to_csv(..., compression="gzip")?

Obviously I know that a gzip'ed file can't be tabix indexed...

@dbolser
Copy link

dbolser commented Apr 25, 2024

wow... interaction of bgzip and pandas is weird...I find that just importing bgzip means to_csv automatically writes f"{filename}.gz in bgzip format... like it's hooking into it somehow...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment