Skip to content

Instantly share code, notes, and snippets.

@janxkoci
Last active February 1, 2023 12:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save janxkoci/ca3294ffadd66de29d87335559014b8c to your computer and use it in GitHub Desktop.
Save janxkoci/ca3294ffadd66de29d87335559014b8c to your computer and use it in GitHub Desktop.
get derived allele counts for set of populations (HQ archaics, africans)
#!/usr/bin/awk -f
BEGIN {
OFS = "\t"
print "chrom_pos", "ref", "alt", "aa", "da", "Altai", "Denisova3", "Chagyrskaya", "Vindija", "Africa"
}
$4 == $5 { # chimp matches gorilla
## concatenate African genotypes to a string (columns 10..NF)
afr = ""
for(i = 10; i <= NF; i++) {
afr = afr$i
}
## count derived alleles per pop
## gsub() returns number of replacements in a string
if ($4 ~ /\./) { # chimp no data
next
} else if ($4 ~ /0/) { # chimp has REF
aa = $2 # REF
da = $3 # ALT
altai = gsub(1, "", $6)
denis = gsub(1, "", $7)
chagyr = gsub(1, "", $8)
vindija = gsub(1, "", $9)
africa = gsub(1, "", afr)
} else if ($4 ~ /1/) { # chimp has ALT
aa = $3 # ALT
da = $2 # REF
altai = gsub(0, "", $6)
denis = gsub(0, "", $7)
chagyr = gsub(0, "", $8)
vindija = gsub(0, "", $9)
africa = gsub(0, "", afr)
}
print $1,$2,$3, aa, da, altai, denis, chagyr, vindija, africa #, afr
}
@janxkoci
Copy link
Author

janxkoci commented Oct 14, 2022

README

The script first filters to sites where chimp and gorilla have a matching genotype. Then it finds the derived allele at each of these sites and counts derived alleles across populations.

N.B.: The script was prepared to process particular datasets and is not very general (i.e. there are hardcoded names and positions of populations), however it should be easy to modify for other datasets.

Input

The expected input is as follows:

  • 3 columns with position info: CHROM_POS, REF, ALT
  • 2 columns for apes' genotypes (numeric format, e.g. 0/1)
  • 4 columns for high-coverage archaics' genotypes
  • variable number of columns for some modern population (starting at column 10)

Having a list of samples (samples_arch_yoruba.txt, created with e.g. bcftools query -l and grep), this input can be produced from a VCF/BCF file with bcftools query:

VCF=input.vcf
SAMPLES=samples_arch_yoruba.txt

(echo CHROM_POS REF ALT $(cat $SAMPLES) | tr " " "\t";\
bcftools query -S $SAMPLES -f '%CHROM_%POS\t%REF\t%ALT[\t%GT]\n' $VCF) > $(basename -s .vcf $VCF).tsv

The resulting TSV table will have a header with sample names included. This TSV table is then processed by the above awk script:

Run script

The script takes the input table and produces a new one with per-population derived allele counts:

awk -f derived.awk table.tsv > table_dac.tsv

Positions with any missing data can be filtered out with grep:

grep -vF './.' table.tsv | awk -f derived.awk > table_dac.tsv

Output postprocessing

The file can then be loaded into R to produce site pattern counts:

library(tidyverse)

snps <- read_tsv("table_dac.tsv")

full <- snps %>% 
    group_by(Altai,Denisova3,Chagyrskaya,Vindija,Africa) %>% 
    summarise(count = n())

To filter only positions with G/C or A/T polymorphisms, we can use the following:

filtered <- snps %>% 
    mutate(aa = toupper(aa), da = toupper(da), gt = paste0(aa, da)) %>% 
    subset(grepl("AT|TA|CG|GC", .$gt)) %>% 
    group_by(Altai,Denisova3,Chagyrskaya,Vindija,Africa) %>% 
    summarise(at_gc = n())

Finally, we can merge the two tables:

final <- merge(full, filtered, all.x = TRUE)
write_tsv(final, "dac_yoruba_final.tsv")

@janxkoci
Copy link
Author

janxkoci commented Oct 25, 2022

Future enhancements

A more robust way to approach this problem would be to provide multiple files to the script, one file per population (with the first one being outgroup, since awk has a simple idiom for treating the first input file differently, i.e. the FNR==NR pattern).

Alternatively, the program could read a popfile defining populations in the data file. This is probably easier to implement, as most of the current logic can stay the same.

Update

The updated script with popfile support is now available in my new repo PopGen.awk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment