Last active
March 12, 2020 09:37
-
-
Save shadiakiki1986/542457d02ac4ef07342f5969a151eab9 to your computer and use it in GitHub Desktop.
t3-replicating plink epistasis.r.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "t3-replicating plink epistasis.r.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyP9FLInj1E6+gwTeXEbJ4g2", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "ir", | |
"display_name": "R" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/shadiakiki1986/542457d02ac4ef07342f5969a151eab9/t3-replicating-plink-epistasis-r.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "fyMTc_4iOpIU", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Trying to replicate plink's `--epistasis` calculations using [1][2][3]\n", | |
"\n", | |
"- [1] http://zzz.bwh.harvard.edu/plink/epi.shtml#snp\n", | |
"- [2] https://www.cog-genomics.org/plink2/epistasis\n", | |
"- [3] https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/\n", | |
"\n", | |
"This notebook uses dummy data generated by plink as follows:\n", | |
"\n", | |
"```\n", | |
"./plink19 --silent --dummy 513 1423 0.02 --out dummy_cc1\n", | |
"```\n", | |
"\n", | |
"The expected output from `./plink19 --bfile dummy_cc1 --epistasis --out test3_epistasis_epi1=1_epi2=1 --epi1 1` for a few pairs of snp39 are\n", | |
"\n", | |
"```\n", | |
"# plink --epistasis result\n", | |
"# (.cc file)\n", | |
"CHR1 SNP1 CHR2 SNP2 OR_INT STAT P\n", | |
"1 snp39 1 snp69 1.23109 1.1938 0.2746\n", | |
"1 snp39 1 snp752 0.908365 0.270368 0.6031\n", | |
"1 snp39 1 snp1383 1.76582 8.82644 0.002969\n", | |
"\n", | |
"# (.cc.summary file)\n", | |
"CHR SNP N_SIG N_TOT PROP BEST_CHISQ BEST_CHR BEST_SNP\n", | |
"1 snp39 10 1422 0.007032 8.826 1 snp1383\n", | |
"```\n", | |
"\n", | |
"Got help from the plink google group at\n", | |
"\n", | |
"https://groups.google.com/forum/#!topic/plink2-users/blUGVNImcs8\n", | |
"\n", | |
"The numbers from the above pasted plink output which are replicated by this notebook are the 1.23, 1.19, and 0.27." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "UHyVInjkNx1o", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Manual step\n", | |
"\n", | |
"Upload the 3 files `dummy_cc1.{bed,bim,fam}` generated above" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "4pXb4sKq-D0P", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Loading data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "H9GcncSfOv6H", | |
"colab_type": "code", | |
"outputId": "e31b6c29-429a-4ca8-93cc-14b752f7fed3", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 272 | |
} | |
}, | |
"source": [ | |
"# install R package for read.plink function\n", | |
"install.packages(\"BiocManager\")\n", | |
"BiocManager::install(\"snpStats\")" | |
], | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Installing package into ‘/usr/local/lib/R/site-library’\n", | |
"(as ‘lib’ is unspecified)\n", | |
"\n", | |
"Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.3 (2020-02-29)\n", | |
"\n", | |
"Installing package(s) 'BiocVersion', 'snpStats'\n", | |
"\n", | |
"also installing the dependencies ‘BiocGenerics’, ‘zlibbioc’\n", | |
"\n", | |
"\n", | |
"Old packages: 'broom', 'covr', 'curl', 'dplyr', 'farver', 'forcats', 'fs',\n", | |
" 'ggplot2', 'lifecycle', 'plyr', 'processx', 'rlang', 'roxygen2', 'rprojroot',\n", | |
" 'shiny', 'testthat', 'vctrs', 'xfun', 'xml2', 'xtable', 'foreign', 'nlme',\n", | |
" 'survival'\n", | |
"\n" | |
], | |
"name": "stderr" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "NhBUQ6nwTmyr", | |
"colab_type": "code", | |
"outputId": "192a1ccc-b93b-43b2-af8e-5969537b68ad", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 425 | |
} | |
}, | |
"source": [ | |
"# reads in original input data into a snpMatrix object\n", | |
"library(snpStats)\n", | |
"geno <- read.plink(\"dummy_cc1\")\n", | |
"str(geno)" | |
], | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Loading required package: survival\n", | |
"\n", | |
"Loading required package: Matrix\n", | |
"\n" | |
], | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": [ | |
"List of 3\n", | |
" $ genotypes:Formal class 'SnpMatrix' [package \"snpStats\"] with 1 slot\n", | |
" .. ..@ .Data: raw [1:513, 1:1423] 03 02 02 01 ...\n", | |
" .. .. ..- attr(*, \"dimnames\")=List of 2\n", | |
" .. .. .. ..$ : chr [1:513] \"per0\" \"per1\" \"per2\" \"per3\" ...\n", | |
" .. .. .. ..$ : chr [1:1423] \"snp0\" \"snp1\" \"snp2\" \"snp3\" ...\n", | |
" $ fam :'data.frame':\t513 obs. of 6 variables:\n", | |
" ..$ pedigree: chr [1:513] \"per0\" \"per1\" \"per2\" \"per3\" ...\n", | |
" ..$ member : chr [1:513] \"per0\" \"per1\" \"per2\" \"per3\" ...\n", | |
" ..$ father : logi [1:513] NA NA NA NA NA NA ...\n", | |
" ..$ mother : logi [1:513] NA NA NA NA NA NA ...\n", | |
" ..$ sex : int [1:513] 2 2 2 2 2 2 2 2 2 2 ...\n", | |
" ..$ affected: int [1:513] 2 2 1 1 1 2 1 1 1 2 ...\n", | |
" $ map :'data.frame':\t1423 obs. of 6 variables:\n", | |
" ..$ chromosome: int [1:1423] 1 1 1 1 1 1 1 1 1 1 ...\n", | |
" ..$ snp.name : chr [1:1423] \"snp0\" \"snp1\" \"snp2\" \"snp3\" ...\n", | |
" ..$ cM : logi [1:1423] NA NA NA NA NA NA ...\n", | |
" ..$ position : int [1:1423] NA 1 2 3 4 5 6 7 8 9 ...\n", | |
" ..$ allele.1 : chr [1:1423] \"A\" \"B\" \"A\" \"A\" ...\n", | |
" ..$ allele.2 : chr [1:1423] \"B\" \"A\" \"B\" \"B\" ...\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "-o4czVnogAPi", | |
"colab_type": "code", | |
"outputId": "1ee1ddf2-0cdb-46c9-84c3-d419329321a4", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 85 | |
} | |
}, | |
"source": [ | |
"# transform the data from a snpMatrix object to a numeric matrix\n", | |
"# These values are 0, 1, 2\n", | |
"# Quote: \"Yes, the coding is linear (0/1/2 allele counts).\"\n", | |
"# https://groups.google.com/d/msg/plink2-users/Hp1Gia80WUI/hCAYdKVGCwAJ\n", | |
"\n", | |
"genotypes <- t(as(geno$genotypes, \"numeric\"))\n", | |
"\n", | |
"# str(genotypes[\"snp39\",])\n", | |
"str(genotypes)" | |
], | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
" num [1:1423, 1:513] 2 1 1 0 1 0 1 2 2 2 ...\n", | |
" - attr(*, \"dimnames\")=List of 2\n", | |
" ..$ : chr [1:1423] \"snp0\" \"snp1\" \"snp2\" \"snp3\" ...\n", | |
" ..$ : chr [1:513] \"per0\" \"per1\" \"per2\" \"per3\" ...\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "9Y6t6t9VgNHy", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### logistic regression" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "k5AvkSmoaRbf", | |
"colab_type": "code", | |
"outputId": "c77c5309-0ffc-46d4-bbac-e8097ec91e82", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 102 | |
} | |
}, | |
"source": [ | |
"# define the dataframe for regression\n", | |
"tab <- data.frame(phenotype = geno$fam$affected,\n", | |
" snp1=genotypes[\"snp39\",],\n", | |
" snp2=genotypes[\"snp69\",],\n", | |
"\n", | |
" ### vvvvv Not sure how to do this\n", | |
" # snp1_snp2=pmax(genotypes[\"snp39\",] * genotypes[\"snp69\",], 2)\n", | |
" # snp1_snp2=genotypes[\"snp39\",] + genotypes[\"snp69\",]\n", | |
" snp1_snp2=genotypes[\"snp39\",] * genotypes[\"snp69\",]\n", | |
" )\n", | |
"\n", | |
"\n", | |
"# convert 1's and 2's to True and False\n", | |
"tab$phenotype = tab$phenotype == 2\n", | |
"str(tab)" | |
], | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"'data.frame':\t513 obs. of 4 variables:\n", | |
" $ phenotype: logi TRUE TRUE FALSE FALSE FALSE TRUE ...\n", | |
" $ snp1 : num 0 2 1 2 0 1 2 0 1 1 ...\n", | |
" $ snp2 : num 2 0 1 0 1 2 1 0 0 1 ...\n", | |
" $ snp1_snp2: num 0 0 1 0 0 2 2 0 0 1 ...\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "JLNgZrojcmxo", | |
"colab_type": "code", | |
"outputId": "0b8aeaf3-4f4a-4a98-bd52-7f6b25324128", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 425 | |
} | |
}, | |
"source": [ | |
"# logistic regression like https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/\n", | |
"model = glm(phenotype ~ snp1 + snp2 + snp1_snp2, data = tab, family = \"binomial\")\n", | |
"summary(model)" | |
], | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
"\n", | |
"Call:\n", | |
"glm(formula = phenotype ~ snp1 + snp2 + snp1_snp2, family = \"binomial\", \n", | |
" data = tab)\n", | |
"\n", | |
"Deviance Residuals: \n", | |
" Min 1Q Median 3Q Max \n", | |
"-1.35912 -1.17728 -0.03907 1.17754 1.27366 \n", | |
"\n", | |
"Coefficients:\n", | |
" Estimate Std. Error z value Pr(>|z|)\n", | |
"(Intercept) -0.002466 0.295985 -0.008 0.993\n", | |
"snp1 -0.095247 0.244684 -0.389 0.697\n", | |
"snp2 -0.110501 0.229146 -0.482 0.630\n", | |
"snp1_snp2 0.207898 0.190273 1.093 0.275\n", | |
"\n", | |
"(Dispersion parameter for binomial family taken to be 1)\n", | |
"\n", | |
" Null deviance: 684.83 on 493 degrees of freedom\n", | |
"Residual deviance: 682.10 on 490 degrees of freedom\n", | |
" (19 observations deleted due to missingness)\n", | |
"AIC: 690.1\n", | |
"\n", | |
"Number of Fisher Scoring iterations: 3\n" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
} | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "TE_YUf76CLEo", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"outputId": "47978924-7c3b-42fb-c64c-81b36b798330" | |
}, | |
"source": [ | |
"# odds ratio reported by plink (value 1.23 in the pasted plink result at the top of this notebook)\n", | |
"exp(summary(model)$coef['snp1_snp2',1])" | |
], | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
"[1] 1.231088" | |
], | |
"text/latex": "1.23108773895034", | |
"text/markdown": "1.23108773895034", | |
"text/html": [ | |
"1.23108773895034" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
} | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "jg_Oy91kXYh2", | |
"colab_type": "code", | |
"outputId": "7d4ddd55-0444-4343-b453-16522486e5f4", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 162 | |
} | |
}, | |
"source": [ | |
"# Note the 1.19 and 0.27 values below match with the chi-squared and p-value from plink\n", | |
"# (check pasted result on top of this notebook)\n", | |
"anova(model, test=\"Chisq\")" | |
], | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
" Df Deviance Resid. Df Resid. Dev Pr(>Chi) \n", | |
"NULL NA NA 493 684.8294 NA\n", | |
"snp1 1 1.041385 492 683.7880 0.3074995\n", | |
"snp2 1 0.486748 491 683.3013 0.4853815\n", | |
"snp1_snp2 1 1.199859 490 682.1014 0.2733500" | |
], | |
"text/latex": "A anova: 4 × 5\n\\begin{tabular}{r|lllll}\n & Df & Deviance & Resid. Df & Resid. Dev & Pr(>Chi)\\\\\n & <int> & <dbl> & <int> & <dbl> & <dbl>\\\\\n\\hline\n\tNULL & NA & NA & 493 & 684.8294 & NA\\\\\n\tsnp1 & 1 & 1.041385 & 492 & 683.7880 & 0.3074995\\\\\n\tsnp2 & 1 & 0.486748 & 491 & 683.3013 & 0.4853815\\\\\n\tsnp1\\_snp2 & 1 & 1.199859 & 490 & 682.1014 & 0.2733500\\\\\n\\end{tabular}\n", | |
"text/markdown": "\nA anova: 4 × 5\n\n| <!--/--> | Df <int> | Deviance <dbl> | Resid. Df <int> | Resid. Dev <dbl> | Pr(>Chi) <dbl> |\n|---|---|---|---|---|---|\n| NULL | NA | NA | 493 | 684.8294 | NA |\n| snp1 | 1 | 1.041385 | 492 | 683.7880 | 0.3074995 |\n| snp2 | 1 | 0.486748 | 491 | 683.3013 | 0.4853815 |\n| snp1_snp2 | 1 | 1.199859 | 490 | 682.1014 | 0.2733500 |\n\n", | |
"text/html": [ | |
"<table>\n", | |
"<caption>A anova: 4 × 5</caption>\n", | |
"<thead>\n", | |
"\t<tr><th></th><th scope=col>Df</th><th scope=col>Deviance</th><th scope=col>Resid. Df</th><th scope=col>Resid. Dev</th><th scope=col>Pr(>Chi)</th></tr>\n", | |
"\t<tr><th></th><th scope=col><int></th><th scope=col><dbl></th><th scope=col><int></th><th scope=col><dbl></th><th scope=col><dbl></th></tr>\n", | |
"</thead>\n", | |
"<tbody>\n", | |
"\t<tr><th scope=row>NULL</th><td>NA</td><td> NA</td><td>493</td><td>684.8294</td><td> NA</td></tr>\n", | |
"\t<tr><th scope=row>snp1</th><td> 1</td><td>1.041385</td><td>492</td><td>683.7880</td><td>0.3074995</td></tr>\n", | |
"\t<tr><th scope=row>snp2</th><td> 1</td><td>0.486748</td><td>491</td><td>683.3013</td><td>0.4853815</td></tr>\n", | |
"\t<tr><th scope=row>snp1_snp2</th><td> 1</td><td>1.199859</td><td>490</td><td>682.1014</td><td>0.2733500</td></tr>\n", | |
"</tbody>\n", | |
"</table>\n" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
} | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "dd25MsMKXYeO", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment