Skip to content

Instantly share code, notes, and snippets.

@catethos
Created January 11, 2022 11:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save catethos/03c3077c1622eba3ede34c67c8a86e9c to your computer and use it in GitHub Desktop.
Save catethos/03c3077c1622eba3ede34c67c8a86e9c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "23980c9b-a9e3-482f-9d00-35dbe226b1e5",
"metadata": {},
"source": [
"# Shapiro-Wilk Test vs Kolmogorov-Smirnov Test\n",
"\n",
"Shapiro-Wilk is to test if **a distribution is normal**; whereas Kolmogorov-Smirnow is to test if **two distributions are equal**. Do notice that the null-hypothesis of Shapiro-Wilk test is that the distribution is normally distributed, so if the test is significant (p < 0.05), it means the distribution is not normal. "
]
},
{
"cell_type": "markdown",
"id": "d4f224e2-8150-4a65-b7f4-2c2e9ccd9c76",
"metadata": {},
"source": [
"## Simulation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a5014757-fc98-4abe-bce0-6c660b35ba5c",
"metadata": {},
"outputs": [],
"source": [
"import scipy.stats as stats"
]
},
{
"cell_type": "markdown",
"id": "42dd489d-6947-4c8d-9f19-91dfcda7ffb0",
"metadata": {},
"source": [
"Suppose have some generated data, and we want to test if they are normally distributed."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "36351753-1bba-4562-9dc4-e0cb7d83cee3",
"metadata": {},
"outputs": [],
"source": [
"# data1 is beta distributed\n",
"data1 = stats.beta.rvs(2,1, size=100)\n",
"\n",
"# data2 is normally distributed\n",
"data2 = stats.norm.rvs(size=100, loc=0, scale=1)"
]
},
{
"cell_type": "markdown",
"id": "1b00b69e-db14-4561-8638-88b17ff547ae",
"metadata": {},
"source": [
"We want to show how to\n",
"1. use Shapiro-Wilk to tell data1 is not normal but data2 is normal\n",
"2. use Kolmogorov-Smirnow to tell data 1 is not normal but data2 is normal (one-sample)\n",
"3. use Kolmogorov-Smirnow to tell data1 and data2 are different (two-sample)"
]
},
{
"cell_type": "markdown",
"id": "2509d315-3524-40d6-8255-0bc2d67d2adc",
"metadata": {},
"source": [
"## use Shapiro-Wilk to tell data1 is not normal but data2 is normal"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "e3c91830-eeb7-4218-a54e-dc2ec7ee796b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.1107210816117004e-06"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
"shapiro_result = stats.shapiro(data1)\n",
"shapiro_result.pvalue"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "05c65c26-afe1-402f-b245-e357bdc18d3b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7431479096412659"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
"shapiro_result = stats.shapiro(data2)\n",
"shapiro_result.pvalue"
]
},
{
"cell_type": "markdown",
"id": "13e8745e-1e39-4d2f-b9bf-6669f6c9c068",
"metadata": {},
"source": [
"## use Kolmogorov-Smirnow to tell data 1 is not normal but data2 is normal (one-sample)\n",
"\n",
"There are two ways of specifying the distribution to be compared against the data\n",
"1. using the cdf of the distribution directly\n",
"2. is the distribution is inside scipy, then specifying the name is enough"
]
},
{
"cell_type": "markdown",
"id": "a00cf812-0a88-4cb8-8afa-dcc5b45329d9",
"metadata": {},
"source": [
"### using CDF directly"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "0c3eb47f-522d-49e1-895d-43e85b312d0c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.440096620569547e-27"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
"ks_result = stats.kstest(data1, stats.norm().cdf)\n",
"ks_result.pvalue"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "afe58dec-ed61-4ae3-8311-0026f33c179d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6110210550418044"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
"ks_result = stats.kstest(data2, stats.norm().cdf)\n",
"ks_result.pvalue"
]
},
{
"cell_type": "markdown",
"id": "6507b1f6-fe5e-4a79-8f06-72f127d67ded",
"metadata": {},
"source": [
"### using name"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "58df4ab1-2d8e-4c13-b768-3098c3cc7182",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.440096620569547e-27"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
"ks_result = stats.kstest(data1, \"norm\")\n",
"ks_result.pvalue"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "687136ef-b438-47ec-91ba-053070ae0aec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6110210550418044"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
"ks_result = stats.kstest(data2, \"norm\")\n",
"ks_result.pvalue"
]
},
{
"cell_type": "markdown",
"id": "7efddb28-da46-4b6a-8045-2079ceb70fa4",
"metadata": {},
"source": [
"## use Kolmogorov-Smirnow to tell data1 and data2 are different (two-sample)\n",
"\n",
"Sometime we don't have a theoretical distribution to be compared against, and we just want to know if two set of data are coming from the same distribution."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "a7e2a715-488b-4c20-89d2-8185f7992c5f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.35710076793659e-13"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the null hypothesis of two-sample KS test is that\n",
"# the two distribution is the same, so if the result\n",
"# is significant, then we reject the null hypothesis\n",
"# and hence they are not from the same distribution.\n",
"\n",
"ks_result = stats.kstest(data2, data1)\n",
"ks_result.pvalue"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "48027569-c41e-4c06-b630-bfd172b5d7d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ks_result = stats.kstest(data2, data2)\n",
"ks_result.pvalue"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "3155f810-48a6-4c33-be27-72783f00e285",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ks_result = stats.kstest(data1, data1)\n",
"ks_result.pvalue"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Poetry",
"language": "python",
"name": "poetry-kernel"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment