Below are instructions on how to extract single SNP data from large genetic datasets.
- Genotype data (
*.map
,*.ped
or*.bed
,*.bim
,*.fam
) - Gentoype imputed data (
*.gen
or its binary version:*.bgen
) - Sequencing data (
*.vcf.gz
)
More on genetic data file formats here.
To extract SNPs from genotype data in plink format, you will need to first install plink. This is a command line tool with an excellent online documentation.
As an example, let's assume you want to extract two SNPs: rs1883832
and rs11569323
.
If you have data.ped
and data.map
files, turn these files first into the binary format:
plink --file path-to-data/data --make-bed
If you have data.bed
, data.bim
and data.fam
files, the command is:
plink --noweb --bfile path-to-data/data --recodeA --out path-to-data/extract-data --snps rs1883832,rs11569323
Here, the option recodeA
will calculate the additive allele counts, yielding values of 0, 1, 2.
Either of the commands will extract the two snps from the data files and store it in path-to-data/extract-data.raw
. The column labels reflect the rs-id with the name of the minor allele appended.
If you have a long list of SNPs, you can store them in a text file and provide plink with that textfile.
plink --noweb --bfile path-to-data/data --recodeA --out path-to-data/extract-data --extract snpfile
The snpfile
contains a list of SNPs (one per line, no header).
snpfile
rs1883832
rs11569323
Beside extracting SNPs, plink offers tons of other options. For example extracting data for a subset of individuals or extracting data from a genomic range.
Extracting data from genotype imputed data is slightly more complicated because there are often 22 or 23 datasets, for each chromosome one. So we will need to know the chromosome for each SNP.
As an example, we want to extract data for SNP rs3181108
, a SNP on chromosome 2.
-
Install
qctool
. This software will perform the main tasks. -
If not already named
gen.gz
, copy yourdata_chr2.gz
file of chromosome 2, and rename itdata_chr2.gen.gz
cp data_chr2.gz data_chr2.gen.gz
-
Store text file that contains the rsids and call it
snpfile
(no header, one SNP per line). -
Extract SNP with
qctool
:
qctool_v2 -g data_chr2.gen.gz -og path-to-data/extract-data.bed -ofiletype binary_ped -incl-rsids snpfile
As -ofiletype
there are several options, but here we choose a binary plink format.
- Turn binary plink to ped
plink --bfile path-to-data/extract-data --tab --recode --out path-to-data/extract-data --noweb
- Turn ped file into a raw file (additive mode)
plink --noweb --bfile path-to-data/extract-data --recodeA --out path-to-data/extract-data
This will produce a file path-to-data/extract-data.raw
.
Here is a detailed documentation of how qctool
filters variants.
TBD
Thanks for your great tutorial. When I use the .bed, .fam and .bim files to extract SNPs, I run the following code:
plink --bfile IC22 --recode A --out IC22res --snps rs7290901
However, I get the following error:
Error: Duplicate ID 'rs567226117'.
Upon checking the .bim file, I found that:
22 rs567226117 0 16341242 G A
22 rs567226117 0 16341242 G C
After searching for the SNP on NCBI, it shows that:
rs567226117
A/C/G single-nucleotide variation on chromosome 22
How can I resolve this error? Is there another PLINK code you would recommend?