Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Melt VCF files into a reasonable tab delimited format, one genotype per line
#!/usr/bin/env python
""" Melt a VCF file into a tab delimited set of calls, one per line
VCF files have all the calls from different samples on one line. This
script reads vcf on stdin and writes all calls to stdout in tab delimited
format with one call in one sample per line. This makes it easy to find
a given sample's genotype with, say, grep.
"""
import sys
import csv
out = csv.writer(sys.stdout, delimiter='\t')
fixed = None
samples = None
format = None
def parse_samples(line):
toks = line[1:].split('\t')
fixed, samples = toks[:9], toks[9:]
return fixed, samples
def parse_format(line):
return line.split('\t')[8].split(':')
def parse_variant(line):
return line.split('\t')[:8]
def parse_calls(line):
for call in line.split('\t')[9:]:
yield call.split(':')
for line in sys.stdin:
line = line.rstrip()
if line.startswith('#'):
if not line.startswith('##'):
fixed, samples = parse_samples(line)
else:
if not format:
format = parse_format(line)
out.writerow(['SAMPLE'] + format + fixed)
variant = parse_variant(line)
for sample, call in zip(samples, parse_calls(line)):
while len(call) < len(format):
call.append('.')
out.writerow([sample] + call + variant)
@hbeale

This comment has been minimized.

Copy link

@hbeale hbeale commented Sep 4, 2012

This was fantastically useful. Thanks!

@hbeale

This comment has been minimized.

Copy link

@hbeale hbeale commented Sep 5, 2012

It turned out not to work on the full length file. I think the problem had to do with line breaks; there were only line breaks between each variant in the vcf file, but the output contained lines that didn't start with reference IDs (i.e. chr1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment