Skip to content

Instantly share code, notes, and snippets.

@darencard
Created January 12, 2017 17:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darencard/f112c4b9c243a7ced2ce0745ed602ebd to your computer and use it in GitHub Desktop.
Save darencard/f112c4b9c243a7ced2ce0745ed602ebd to your computer and use it in GitHub Desktop.
Filter away samples from a VCF/BCF that have high amounts of missing data

Simply replace <<INPUT>>, <<OUTPUT>>, and <<PROP>> with the input file name, output file name, and proportion missing data at which points samples begin to get excluded, repectively. For example, 0.75 means that samples with greater than 75% missing data are filtered away. Requires bcftools v. 1.2+.

bcftools view -S ^<(paste <(bcftools query -f '[%SAMPLE\t]\n' <<INPUT>> | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' <<INPUT>> | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2) | awk '{ if ($2 > <<PROP>>) print $1 }') <<INPUT>> | bgzip > <<OUTPUT>>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment