Last active
May 6, 2022 15:01
-
-
Save janxkoci/5fca0cf87a20f168eaeadd78c921e41f to your computer and use it in GitHub Desktop.
scripts for plink clustering (MDS and PCA) using either plink or VCF formats as input
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
## PLINK CLUSTERING (plink1.9) | |
## takes input (any path) and creates clustering reports in working dir (to not mix data and results) | |
## OUTPUT PREFIX | |
prefix=$1 | |
## MDS DIMENSIONS (optional, default = 2) | |
dims=${2:-2} | |
## LD PRUNNING | |
## https://www.cog-genomics.org/plink/1.9/ld#indep | |
plink --indep 50 5 2 --bfile $prefix --out $prefix | |
## DISTANCE | |
## https://www.cog-genomics.org/plink/1.9/ibd | |
## https://www.cog-genomics.org/plink/1.9/distance | |
plink --bfile $prefix --genome --out $prefix --extract ${prefix}.prune.in --recode # --distance --distance-matrix | |
## CLUSTERING | |
## https://www.cog-genomics.org/plink/1.9/strat | |
## MDS | |
plink --file $prefix --read-genome ${prefix}.genome --cluster --mds-plot $dims --out ${prefix}_${dims} # --read-dists ${prefix}.dists | |
## PCA | |
plink --file $prefix --read-genome ${prefix}.genome --cluster --pca --out ${prefix} # --read-dists ${prefix}.dists |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
## PLINK CLUSTERING (plink1.9) | |
## takes input (any path) and creates clustering reports in working dir (to not mix data and results) | |
## OUTPUT PREFIX | |
prefix=$(basename -s .gz $1) # remove .gz if present | |
prefix=$(basename -s .vcf $prefix) # remove also .vcf | |
## MDS DIMENSIONS (optional, default = 2) | |
dims=${2:-2} | |
## LD PRUNNING | |
## https://www.cog-genomics.org/plink/1.9/ld#indep | |
plink --indep 50 5 2 --vcf $1 --out $prefix | |
## DISTANCE | |
## https://www.cog-genomics.org/plink/1.9/ibd | |
## https://www.cog-genomics.org/plink/1.9/distance | |
plink --vcf $1 --genome --out $prefix --extract ${prefix}.prune.in --recode # --distance --distance-matrix | |
## CLUSTERING | |
## https://www.cog-genomics.org/plink/1.9/strat | |
## MDS | |
plink --file $prefix --read-genome ${prefix}.genome --cluster --mds-plot $dims --out ${prefix}_${dims} # --read-dists ${prefix}.dists | |
## PCA | |
plink --file $prefix --read-genome ${prefix}.genome --cluster --pca --out ${prefix} # --read-dists ${prefix}.dists |
These scripts are probably fine for a one-off analysis of samples. But after working with them today, rerunning stuff a few times, I realized there are several rather dumb decisions and the scripts can be somewhat improved. Things like:
- Doing
--pca
and--mds-plot
in the same command. They really don't need to be run separately. - Not using a text file set with
--recode
for the prunned data. This comes from a tutorial I've found back in the day and I guess it was there to solve a name collision, as using the same--bfile
would cause it, but using--file
can avoid it. Or, you know, just make a new name for the prunned dataset, so you can keep using--bfile
(i.e. the smaller and more efficient format). - Using
--genome
and--read-genome
can be completelly ommited. Plink will do it automatically with--cluster
(or whatever different option if feels is better atm). Also this way it can be removed from the prunning step, so all clustering can be done within one step. I may add the options again later (only to the last clustering steps) to make rerunning the scripts more efficient, but that will need some detection of files to work. - Some log files are getting overwritten by the following commands, because they use the same
--out
names. This should be easy to fix, but needs a little thought for rerunning (i.e. when some steps may get skipped).
There may be some other improvements, probably more involved, like detecting if prunning was already done and skipping the step if so, to make rerunning easier, but that will be for another time.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
TODO
git clone
) with links and explanation of different options (e.g. distances)