Skip to content

Instantly share code, notes, and snippets.

@sinarueeger
Last active March 13, 2024 14:21
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sinarueeger/868a98f102058f20b257ad45cb1e4385 to your computer and use it in GitHub Desktop.
Save sinarueeger/868a98f102058f20b257ad45cb1e4385 to your computer and use it in GitHub Desktop.
Extract data for single SNPs from large genetic datasets

Extract genetic data for a subset of SNPs

Below are instructions on how to extract single SNP data from large genetic datasets.

More on genetic data file formats here.

Genotype data

To extract SNPs from genotype data in plink format, you will need to first install plink. This is a command line tool with an excellent online documentation.

Small set of SNPs

As an example, let's assume you want to extract two SNPs: rs1883832 and rs11569323.

If you have data.ped and data.map files, turn these files first into the binary format:

plink --file path-to-data/data --make-bed

If you have data.bed, data.bim and data.fam files, the command is:

plink --noweb --bfile path-to-data/data --recodeA --out path-to-data/extract-data --snps rs1883832,rs11569323

Here, the option recodeA will calculate the additive allele counts, yielding values of 0, 1, 2.

Either of the commands will extract the two snps from the data files and store it in path-to-data/extract-data.raw. The column labels reflect the rs-id with the name of the minor allele appended.

Many SNPs

If you have a long list of SNPs, you can store them in a text file and provide plink with that textfile.

plink --noweb --bfile path-to-data/data --recodeA --out path-to-data/extract-data --extract snpfile

The snpfile contains a list of SNPs (one per line, no header).

snpfile

rs1883832
rs11569323

Documentation

Beside extracting SNPs, plink offers tons of other options. For example extracting data for a subset of individuals or extracting data from a genomic range.

Genotype imputed data

Extracting data from genotype imputed data is slightly more complicated because there are often 22 or 23 datasets, for each chromosome one. So we will need to know the chromosome for each SNP.

As an example, we want to extract data for SNP rs3181108, a SNP on chromosome 2.

  1. Install qctool. This software will perform the main tasks.

  2. If not already named gen.gz, copy your data_chr2.gz file of chromosome 2, and rename it data_chr2.gen.gz

cp data_chr2.gz data_chr2.gen.gz

  1. Store text file that contains the rsids and call it snpfile (no header, one SNP per line).

  2. Extract SNP with qctool:

qctool_v2 -g data_chr2.gen.gz -og path-to-data/extract-data.bed -ofiletype binary_ped -incl-rsids snpfile

As -ofiletype there are several options, but here we choose a binary plink format.

  1. Turn binary plink to ped

plink --bfile path-to-data/extract-data --tab --recode --out path-to-data/extract-data --noweb

  1. Turn ped file into a raw file (additive mode)

plink --noweb --bfile path-to-data/extract-data --recodeA --out path-to-data/extract-data

This will produce a file path-to-data/extract-data.raw.

Documentation

Here is a detailed documentation of how qctool filters variants.

Sequencing data

TBD

@yellowbridge
Copy link

Thanks for your great tutorial. When I use the .bed, .fam and .bim files to extract SNPs, I run the following code:
plink --bfile IC22 --recode A --out IC22res --snps rs7290901
However, I get the following error:
Error: Duplicate ID 'rs567226117'.

Upon checking the .bim file, I found that:
22 rs567226117 0 16341242 G A
22 rs567226117 0 16341242 G C

After searching for the SNP on NCBI, it shows that:
rs567226117
A/C/G single-nucleotide variation on chromosome 22

How can I resolve this error? Is there another PLINK code you would recommend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment