sinarueeger/extract-genetic-data.md

## extract-genetic-data.md

      
    Raw
  

              extract-genetic-data.md
            
          
    Extract genetic data for a subset of SNPs

Below are instructions on how to extract single SNP data from large genetic datasets.

Genotype data (*.map, *.ped or *.bed, *.bim,*.fam)
Gentoype imputed data (*.gen or its binary version: *.bgen)
Sequencing data (*.vcf.gz)

More on genetic data file formats here.
Genotype data

To extract SNPs from genotype data in plink format, you will need to first install plink. This is a command line tool with an excellent online documentation.
Small set of SNPs

As an example, let's assume you want to extract two SNPs: rs1883832 and rs11569323.
If you have data.ped and data.map files, turn these files first into the binary format:
plink --file path-to-data/data --make-bed
If you have data.bed, data.bim and data.fam files, the command is:
plink --noweb --bfile path-to-data/data --recodeA --out path-to-data/extract-data --snps rs1883832,rs11569323
Here, the option recodeA will calculate the additive allele counts, yielding values of 0, 1, 2.
Either of the commands will extract the two snps from the data files and store it in path-to-data/extract-data.raw. The column labels reflect the rs-id with the name of the minor allele appended.
Many SNPs

If you have a long list of SNPs, you can store them in a text file and provide plink with that textfile.
plink --noweb --bfile path-to-data/data --recodeA --out path-to-data/extract-data --extract snpfile
The snpfile contains a list of SNPs (one per line, no header).
snpfile
rs1883832
rs11569323

Documentation

Beside extracting SNPs, plink offers tons of other options. For example extracting data for a subset of individuals or extracting data from a genomic range.
Genotype imputed data

Extracting data from genotype imputed data is slightly more complicated because there are often 22 or 23 datasets, for each chromosome one. So we will need to know the chromosome for each SNP.
As an example, we want to extract data for SNP rs3181108, a SNP on chromosome 2.


Install qctool. This software will perform the main tasks.


If not already named gen.gz, copy your data_chr2.gz file of chromosome 2, and rename it data_chr2.gen.gz


cp data_chr2.gz data_chr2.gen.gz


Store text file that contains the rsids and call it snpfile (no header, one SNP per line).


Extract SNP with qctool:


qctool_v2 -g data_chr2.gen.gz -og path-to-data/extract-data.bed -ofiletype binary_ped -incl-rsids snpfile
As -ofiletype there are several options, but here we choose a binary plink format.

Turn binary plink to ped

plink --bfile path-to-data/extract-data --tab --recode --out path-to-data/extract-data --noweb

Turn ped file into a raw file (additive mode)

plink --noweb --bfile path-to-data/extract-data --recodeA --out path-to-data/extract-data
This will produce a file path-to-data/extract-data.raw.
Documentation

Here is a detailed documentation of how qctool filters variants.
Sequencing data

TBD