Last active
February 1, 2023 12:27
-
-
Save janxkoci/ca3294ffadd66de29d87335559014b8c to your computer and use it in GitHub Desktop.
get derived allele counts for set of populations (HQ archaics, africans)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/awk -f | |
BEGIN { | |
OFS = "\t" | |
print "chrom_pos", "ref", "alt", "aa", "da", "Altai", "Denisova3", "Chagyrskaya", "Vindija", "Africa" | |
} | |
$4 == $5 { # chimp matches gorilla | |
## concatenate African genotypes to a string (columns 10..NF) | |
afr = "" | |
for(i = 10; i <= NF; i++) { | |
afr = afr$i | |
} | |
## count derived alleles per pop | |
## gsub() returns number of replacements in a string | |
if ($4 ~ /\./) { # chimp no data | |
next | |
} else if ($4 ~ /0/) { # chimp has REF | |
aa = $2 # REF | |
da = $3 # ALT | |
altai = gsub(1, "", $6) | |
denis = gsub(1, "", $7) | |
chagyr = gsub(1, "", $8) | |
vindija = gsub(1, "", $9) | |
africa = gsub(1, "", afr) | |
} else if ($4 ~ /1/) { # chimp has ALT | |
aa = $3 # ALT | |
da = $2 # REF | |
altai = gsub(0, "", $6) | |
denis = gsub(0, "", $7) | |
chagyr = gsub(0, "", $8) | |
vindija = gsub(0, "", $9) | |
africa = gsub(0, "", afr) | |
} | |
print $1,$2,$3, aa, da, altai, denis, chagyr, vindija, africa #, afr | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Future enhancements
A more robust way to approach this problem would be to provide multiple files to the script, one file per population (with the first one being outgroup, since awk has a simple idiom for treating the first input file differently, i.e. the
FNR==NR
pattern).Alternatively, the program could read a popfile defining populations in the data file. This is probably easier to implement, as most of the current logic can stay the same.
Update
The updated script with popfile support is now available in my new repo PopGen.awk.