Skip to content

Instantly share code, notes, and snippets.

@dceoy
Last active January 23, 2024 12:42
Show Gist options
  • Save dceoy/99d976a2c01e7f0ba1c813778f9db744 to your computer and use it in GitHub Desktop.
Save dceoy/99d976a2c01e7f0ba1c813778f9db744 to your computer and use it in GitHub Desktop.
[Python] Read VCF (variant call format) as pandas.DataFrame
#!/usr/bin/env python
import io
import os
import pandas as pd
def read_vcf(path):
with open(path, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
@YoavEtzioni
Copy link

Nice. Very useful.

@dharbi
Copy link

dharbi commented Nov 19, 2018

Really convenient!

@hansonglee
Copy link

Oh thank you

@pdorsaint
Copy link

Hi,

Thank you so much for this script! I am trying to run this script on a vcf file.
Do you run the script like this "python read_vcf.py vcf_filename" ?

Thanks!

@dceoy
Copy link
Author

dceoy commented Jul 13, 2019

I developed pdbio package. Please use it. @pdorsaint

https://github.com/dceoy/pdbio

This package is a Pandas-based data handling tool and supports the use from a command-line.

Example of VCF data handling:

$ pdbio vcf2csv --tsv ./test/example.vcf

@DouglasAbrams
Copy link

DouglasAbrams commented May 7, 2020

a way of doing it that will use all fields on any vcf using pyvcf https://pyvcf.readthedocs.io/en/v0.4.6/INTRO.html

import pandas as pd
import vcf

def read(f):
    reader = vcf.Reader(open(f))
    df = pd.DataFrame([vars(r) for r in reader])
    out = df.merge(pd.DataFrame(df.INFO.tolist()),
                   left_index=True, right_index=True)
    return out

run read(your_vcf)

@sbslee
Copy link

sbslee commented May 4, 2021

If anyone's interested, I was looking for a way to do this too and ended up writing the pyvcf submodule:

A quick example of pyvcf.VcfFrame:

data = {
    'CHROM': ['chr1', 'chr2'],
    'POS': [100, 101],
    'ID': ['.', '.'],
    'REF': ['G', 'T'],
    'ALT': ['A', 'C'],
    'QUAL': ['.', '.'],
    'FILTER': ['.', '.'],
    'INFO': ['.', '.'],
    'FORMAT': ['GT', 'GT'],
    'Steven': ['0/1', '1/1']
}
vf = pyvcf.VcfFrame.from_dict([], data)
vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1

To read a VCF file into VcfFrame:

vf = pyvcf.VcfFrame.from_file('example.vcf')

@Vicbuz
Copy link

Vicbuz commented Jun 21, 2021

This was so so useful. Thank you very much @dceoy

@upendrak
Copy link

It works great. Thanks

@Mohammed-Alfayyadh
Copy link

Hi,
Did you find a solution for not finding the result after you use the python script ? I am facing the same issue

@SciNanda
Copy link

SciNanda commented Nov 7, 2022

This was all I need for now. Thank you very much!! :)

@NajlaAbassi
Copy link

That was indeed usefull! Thank you very much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment