Last active
February 1, 2023 12:27
-
-
Save janxkoci/ca3294ffadd66de29d87335559014b8c to your computer and use it in GitHub Desktop.
get derived allele counts for set of populations (HQ archaics, africans)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/awk -f | |
BEGIN { | |
OFS = "\t" | |
print "chrom_pos", "ref", "alt", "aa", "da", "Altai", "Denisova3", "Chagyrskaya", "Vindija", "Africa" | |
} | |
$4 == $5 { # chimp matches gorilla | |
## concatenate African genotypes to a string (columns 10..NF) | |
afr = "" | |
for(i = 10; i <= NF; i++) { | |
afr = afr$i | |
} | |
## count derived alleles per pop | |
## gsub() returns number of replacements in a string | |
if ($4 ~ /\./) { # chimp no data | |
next | |
} else if ($4 ~ /0/) { # chimp has REF | |
aa = $2 # REF | |
da = $3 # ALT | |
altai = gsub(1, "", $6) | |
denis = gsub(1, "", $7) | |
chagyr = gsub(1, "", $8) | |
vindija = gsub(1, "", $9) | |
africa = gsub(1, "", afr) | |
} else if ($4 ~ /1/) { # chimp has ALT | |
aa = $3 # ALT | |
da = $2 # REF | |
altai = gsub(0, "", $6) | |
denis = gsub(0, "", $7) | |
chagyr = gsub(0, "", $8) | |
vindija = gsub(0, "", $9) | |
africa = gsub(0, "", afr) | |
} | |
print $1,$2,$3, aa, da, altai, denis, chagyr, vindija, africa #, afr | |
} |
Future enhancements
A more robust way to approach this problem would be to provide multiple files to the script, one file per population (with the first one being outgroup, since awk has a simple idiom for treating the first input file differently, i.e. the FNR==NR
pattern).
Alternatively, the program could read a popfile defining populations in the data file. This is probably easier to implement, as most of the current logic can stay the same.
Update
The updated script with popfile support is now available in my new repo PopGen.awk.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
README
The script first filters to sites where chimp and gorilla have a matching genotype. Then it finds the derived allele at each of these sites and counts derived alleles across populations.
N.B.: The script was prepared to process particular datasets and is not very general (i.e. there are hardcoded names and positions of populations), however it should be easy to modify for other datasets.
Input
The expected input is as follows:
0/1
)Having a list of samples (
samples_arch_yoruba.txt
, created with e.g.bcftools query -l
andgrep
), this input can be produced from a VCF/BCF file withbcftools query
:The resulting TSV table will have a header with sample names included. This TSV table is then processed by the above
awk
script:Run script
The script takes the input table and produces a new one with per-population derived allele counts:
awk -f derived.awk table.tsv > table_dac.tsv
Positions with any missing data can be filtered out with
grep
:Output postprocessing
The file can then be loaded into
R
to produce site pattern counts:To filter only positions with G/C or A/T polymorphisms, we can use the following:
Finally, we can merge the two tables: