Skip to content

Instantly share code, notes, and snippets.

@janxkoci
Last active February 1, 2023 12:27
Show Gist options
  • Save janxkoci/ca3294ffadd66de29d87335559014b8c to your computer and use it in GitHub Desktop.
Save janxkoci/ca3294ffadd66de29d87335559014b8c to your computer and use it in GitHub Desktop.
get derived allele counts for set of populations (HQ archaics, africans)
#!/usr/bin/awk -f
BEGIN {
OFS = "\t"
print "chrom_pos", "ref", "alt", "aa", "da", "Altai", "Denisova3", "Chagyrskaya", "Vindija", "Africa"
}
$4 == $5 { # chimp matches gorilla
## concatenate African genotypes to a string (columns 10..NF)
afr = ""
for(i = 10; i <= NF; i++) {
afr = afr$i
}
## count derived alleles per pop
## gsub() returns number of replacements in a string
if ($4 ~ /\./) { # chimp no data
next
} else if ($4 ~ /0/) { # chimp has REF
aa = $2 # REF
da = $3 # ALT
altai = gsub(1, "", $6)
denis = gsub(1, "", $7)
chagyr = gsub(1, "", $8)
vindija = gsub(1, "", $9)
africa = gsub(1, "", afr)
} else if ($4 ~ /1/) { # chimp has ALT
aa = $3 # ALT
da = $2 # REF
altai = gsub(0, "", $6)
denis = gsub(0, "", $7)
chagyr = gsub(0, "", $8)
vindija = gsub(0, "", $9)
africa = gsub(0, "", afr)
}
print $1,$2,$3, aa, da, altai, denis, chagyr, vindija, africa #, afr
}
@janxkoci
Copy link
Author

janxkoci commented Oct 25, 2022

Future enhancements

A more robust way to approach this problem would be to provide multiple files to the script, one file per population (with the first one being outgroup, since awk has a simple idiom for treating the first input file differently, i.e. the FNR==NR pattern).

Alternatively, the program could read a popfile defining populations in the data file. This is probably easier to implement, as most of the current logic can stay the same.

Update

The updated script with popfile support is now available in my new repo PopGen.awk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment