catethos/normality_testing.ipynb

## normality_testing.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "23980c9b-a9e3-482f-9d00-35dbe226b1e5",
   "metadata": {},
   "source": [
    "# Shapiro-Wilk Test vs Kolmogorov-Smirnov Test\n",
    "\n",
    "Shapiro-Wilk is to test if **a distribution is normal**; whereas Kolmogorov-Smirnow is to test if **two distributions are equal**. Do notice that the null-hypothesis of Shapiro-Wilk test is that the distribution is normally distributed, so if the test is significant (p < 0.05), it means the distribution is not normal. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4f224e2-8150-4a65-b7f4-2c2e9ccd9c76",
   "metadata": {},
   "source": [
    "## Simulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a5014757-fc98-4abe-bce0-6c660b35ba5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy.stats as stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42dd489d-6947-4c8d-9f19-91dfcda7ffb0",
   "metadata": {},
   "source": [
    "Suppose have some generated data, and we want to test if they are normally distributed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "36351753-1bba-4562-9dc4-e0cb7d83cee3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# data1 is beta distributed\n",
    "data1 = stats.beta.rvs(2,1, size=100)\n",
    "\n",
    "# data2 is normally distributed\n",
    "data2 = stats.norm.rvs(size=100, loc=0, scale=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b00b69e-db14-4561-8638-88b17ff547ae",
   "metadata": {},
   "source": [
    "We want to show how to\n",
    "1. use Shapiro-Wilk to tell data1 is not normal but data2 is normal\n",
    "2. use Kolmogorov-Smirnow to tell data 1 is not normal but data2 is normal (one-sample)\n",
    "3. use Kolmogorov-Smirnow to tell data1 and data2 are different (two-sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2509d315-3524-40d6-8255-0bc2d67d2adc",
   "metadata": {},
   "source": [
    "## use Shapiro-Wilk to tell data1 is not normal but data2 is normal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e3c91830-eeb7-4218-a54e-dc2ec7ee796b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.1107210816117004e-06"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
    "shapiro_result = stats.shapiro(data1)\n",
    "shapiro_result.pvalue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "05c65c26-afe1-402f-b245-e357bdc18d3b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7431479096412659"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
    "shapiro_result = stats.shapiro(data2)\n",
    "shapiro_result.pvalue"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13e8745e-1e39-4d2f-b9bf-6669f6c9c068",
   "metadata": {},
   "source": [
    "## use Kolmogorov-Smirnow to tell data 1 is not normal but data2 is normal (one-sample)\n",
    "\n",
    "There are two ways of specifying the distribution to be compared against the data\n",
    "1. using the cdf of the distribution directly\n",
    "2. is the distribution is inside scipy, then specifying the name is enough"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a00cf812-0a88-4cb8-8afa-dcc5b45329d9",
   "metadata": {},
   "source": [
    "### using CDF directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0c3eb47f-522d-49e1-895d-43e85b312d0c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.440096620569547e-27"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
    "ks_result = stats.kstest(data1, stats.norm().cdf)\n",
    "ks_result.pvalue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "afe58dec-ed61-4ae3-8311-0026f33c179d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6110210550418044"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
    "ks_result = stats.kstest(data2, stats.norm().cdf)\n",
    "ks_result.pvalue"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6507b1f6-fe5e-4a79-8f06-72f127d67ded",
   "metadata": {},
   "source": [
    "### using name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "58df4ab1-2d8e-4c13-b768-3098c3cc7182",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.440096620569547e-27"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
    "ks_result = stats.kstest(data1, \"norm\")\n",
    "ks_result.pvalue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "687136ef-b438-47ec-91ba-053070ae0aec",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6110210550418044"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
    "ks_result = stats.kstest(data2, \"norm\")\n",
    "ks_result.pvalue"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7efddb28-da46-4b6a-8045-2079ceb70fa4",
   "metadata": {},
   "source": [
    "## use Kolmogorov-Smirnow to tell data1 and data2 are different (two-sample)\n",
    "\n",
    "Sometime we don't have a theoretical distribution to be compared against, and we just want to know if two set of data are coming from the same distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "a7e2a715-488b-4c20-89d2-8185f7992c5f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.35710076793659e-13"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the null hypothesis of two-sample KS test is that\n",
    "# the two distribution is the same, so if the result\n",
    "# is significant, then we reject the null hypothesis\n",
    "# and hence they are not from the same distribution.\n",
    "\n",
    "ks_result = stats.kstest(data2, data1)\n",
    "ks_result.pvalue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "48027569-c41e-4c06-b630-bfd172b5d7d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ks_result = stats.kstest(data2, data2)\n",
    "ks_result.pvalue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "3155f810-48a6-4c33-be27-72783f00e285",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ks_result = stats.kstest(data1, data1)\n",
    "ks_result.pvalue"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Poetry",
   "language": "python",
   "name": "poetry-kernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "23980c9b-a9e3-482f-9d00-35dbe226b1e5",
	"metadata": {},
	"source": [
	"# Shapiro-Wilk Test vs Kolmogorov-Smirnov Test\n",
	"\n",
	"Shapiro-Wilk is to test if a distribution is normal; whereas Kolmogorov-Smirnow is to test if two distributions are equal. Do notice that the null-hypothesis of Shapiro-Wilk test is that the distribution is normally distributed, so if the test is significant (p < 0.05), it means the distribution is not normal. "
	]
	},
	{
	"cell_type": "markdown",
	"id": "d4f224e2-8150-4a65-b7f4-2c2e9ccd9c76",
	"metadata": {},
	"source": [
	"## Simulation"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "a5014757-fc98-4abe-bce0-6c660b35ba5c",
	"metadata": {},
	"outputs": [],
	"source": [
	"import scipy.stats as stats"
	]
	},
	{
	"cell_type": "markdown",
	"id": "42dd489d-6947-4c8d-9f19-91dfcda7ffb0",
	"metadata": {},
	"source": [
	"Suppose have some generated data, and we want to test if they are normally distributed."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "36351753-1bba-4562-9dc4-e0cb7d83cee3",
	"metadata": {},
	"outputs": [],
	"source": [
	"# data1 is beta distributed\n",
	"data1 = stats.beta.rvs(2,1, size=100)\n",
	"\n",
	"# data2 is normally distributed\n",
	"data2 = stats.norm.rvs(size=100, loc=0, scale=1)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "1b00b69e-db14-4561-8638-88b17ff547ae",
	"metadata": {},
	"source": [
	"We want to show how to\n",
	"1. use Shapiro-Wilk to tell data1 is not normal but data2 is normal\n",
	"2. use Kolmogorov-Smirnow to tell data 1 is not normal but data2 is normal (one-sample)\n",
	"3. use Kolmogorov-Smirnow to tell data1 and data2 are different (two-sample)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "2509d315-3524-40d6-8255-0bc2d67d2adc",
	"metadata": {},
	"source": [
	"## use Shapiro-Wilk to tell data1 is not normal but data2 is normal"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"id": "e3c91830-eeb7-4218-a54e-dc2ec7ee796b",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3.1107210816117004e-06"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
	"shapiro_result = stats.shapiro(data1)\n",
	"shapiro_result.pvalue"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"id": "05c65c26-afe1-402f-b245-e357bdc18d3b",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.7431479096412659"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
	"shapiro_result = stats.shapiro(data2)\n",
	"shapiro_result.pvalue"
	]
	},
	{
	"cell_type": "markdown",
	"id": "13e8745e-1e39-4d2f-b9bf-6669f6c9c068",
	"metadata": {},
	"source": [
	"## use Kolmogorov-Smirnow to tell data 1 is not normal but data2 is normal (one-sample)\n",
	"\n",
	"There are two ways of specifying the distribution to be compared against the data\n",
	"1. using the cdf of the distribution directly\n",
	"2. is the distribution is inside scipy, then specifying the name is enough"
	]
	},
	{
	"cell_type": "markdown",
	"id": "a00cf812-0a88-4cb8-8afa-dcc5b45329d9",
	"metadata": {},
	"source": [
	"### using CDF directly"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"id": "0c3eb47f-522d-49e1-895d-43e85b312d0c",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3.440096620569547e-27"
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
	"ks_result = stats.kstest(data1, stats.norm().cdf)\n",
	"ks_result.pvalue"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"id": "afe58dec-ed61-4ae3-8311-0026f33c179d",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.6110210550418044"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
	"ks_result = stats.kstest(data2, stats.norm().cdf)\n",
	"ks_result.pvalue"
	]
	},
	{
	"cell_type": "markdown",
	"id": "6507b1f6-fe5e-4a79-8f06-72f127d67ded",
	"metadata": {},
	"source": [
	"### using name"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"id": "58df4ab1-2d8e-4c13-b768-3098c3cc7182",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3.440096620569547e-27"
	]
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# p-value for data1 < 0.05 (significant), so it is not normally distributed\n",
	"ks_result = stats.kstest(data1, \"norm\")\n",
	"ks_result.pvalue"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"id": "687136ef-b438-47ec-91ba-053070ae0aec",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.6110210550418044"
	]
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# p-value for data2 > 0.05 (not significant), so it is normally distributed\n",
	"ks_result = stats.kstest(data2, \"norm\")\n",
	"ks_result.pvalue"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7efddb28-da46-4b6a-8045-2079ceb70fa4",
	"metadata": {},
	"source": [
	"## use Kolmogorov-Smirnow to tell data1 and data2 are different (two-sample)\n",
	"\n",
	"Sometime we don't have a theoretical distribution to be compared against, and we just want to know if two set of data are coming from the same distribution."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"id": "a7e2a715-488b-4c20-89d2-8185f7992c5f",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3.35710076793659e-13"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# the null hypothesis of two-sample KS test is that\n",
	"# the two distribution is the same, so if the result\n",
	"# is significant, then we reject the null hypothesis\n",
	"# and hence they are not from the same distribution.\n",
	"\n",
	"ks_result = stats.kstest(data2, data1)\n",
	"ks_result.pvalue"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"id": "48027569-c41e-4c06-b630-bfd172b5d7d7",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"1.0"
	]
	},
	"execution_count": 21,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ks_result = stats.kstest(data2, data2)\n",
	"ks_result.pvalue"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"id": "3155f810-48a6-4c33-be27-72783f00e285",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"1.0"
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ks_result = stats.kstest(data1, data1)\n",
	"ks_result.pvalue"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Poetry",
	"language": "python",
	"name": "poetry-kernel"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}