Skip to content

Instantly share code, notes, and snippets.

@tabrez
Last active July 30, 2020 17:45
Show Gist options
  • Save tabrez/8ae26b866840e4847202793b69f1a972 to your computer and use it in GitHub Desktop.
Save tabrez/8ae26b866840e4847202793b69f1a972 to your computer and use it in GitHub Desktop.
basic_statistics.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "### A. What is statistics?\n* A __population__ is an individial or group that represents all the members of a certain group or category of interest.\n* Values generated from, or applied to, a population are called __parameters__.\n* A __sample__ is a subset drawn from the larger population. If the subset is drawn randomly, then it's called a random sample.\n* Values generated from, or applied to, a sample are called __statistics__.\n* __Descriptive statistics__ apply only to the sample data that we collected where as __inferential statistics__ allow us to reach some conclusions about the larger population.\n* One of our goals is to be able to determine how well the results from the sample (obtained using statistics) generalize to the larger population. "
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### B. Types of sampling data\n* __Random sampling__ means every member of a population has an equal chance of being selected into a sample.\n* __Convenience sample__ means members from a population are selected based on proximity, ease-of-access and willingness to participate.\n* __Representative sampling__ means selecting cases so that they will match the larger population on specific characteristics(e.g. % of male vs female, % of children vs adults).\n* Any sampling method that does not differ from population of interest _in ways that influence the outcome of the study_ is an acceptable sampling method."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### C. Types of variables\n* A __variable__ has more than a single value; a __constant__ on the other hand has a single value.\n* Variables can be __quantitative(continuous)__ or __qualitative(categorical)__: \n * a quantitative variable indicates some sort of amount\n (e.g. 'height' variable)\n * a qualitative variable doesn't indicate more or less of a certain quality\n (e.g. 'country' variable)"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### D. Four different scales of measurement for variables\n\n* Different levels of a __nominally scaled variable__ have no numeric value.\n(e.g. 'gender' variable)\n* The values of __ordinal variables__ have weight but do not contain information about the distance between the values. (e.g. ranks of the fastest sprint runners)\n* Variables scored using __interval__ and __ratio__ scales contain information about both relative value as well as distance. (e.g. 'height' variable)\n* Ration scale includes a zero value"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.style as style\nstyle.use('fivethirtyeight')\n%matplotlib inline\nx = ['Democrats', 'Republicans', 'Independents']\ns = pd.Series({'Democrats': 40, 'Republicans': 45, 'Independents': 15}, index = x)\ns.plot(kind='bar')",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "s.plot(kind='pie')",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Hours spent doing acitivities\n\n| Country | TV | Homework |\n|:---------|:--------:|:--------:|\n| US | 6 | 2 |\n| Mexico | 3 | 1 |\n| China | 4 | 4 |\n| Norway | 2 | 3 |\n| Japan | 3 | 4 |"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "index = ['US', 'Mexico', 'China', 'Norway', 'Japan']\ntv_hours = [6, 3, 4, 2, 3]\nhw_hours = [2, 1, 4, 3, 4]\ngym_hours = [1, 1, 1, 2, 2]\ntable = pd.DataFrame({'tv': tv_hours, 'homework': hw_hours, 'gym': gym_hours}, \n index = index, \n columns=['tv', 'homework', 'gym'])\nax = table.plot(kind='bar', figsize=(12, 8))\n# ax = table.plot(kind='bar', figsize=(12, 8), stacked=True)\n# ax.set_ylim(0, 11)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "ax.get_figure().savefig('hours.png', dpi=1000)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "gpa_ranges = ['1.0-1.4', '1.5-1.9', '2.0-2.4', '2.5-2.9', '3.0-3.4', '3.5-4.0']\nfrequencies = [15, 2, 17, 14, 28, 26]\nplt.ylim(0, 35)\nplt.plot(gpa_ranges, frequencies, marker='s', markersize=10)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### E. Central Tendency\nA __distribution__ is arrangement of scores of one variable in order from lowest to highest\nOne of the distribution characteristics we are interested in is called central tendency, which consists of mean, median and mode.\n\n* The __mean__ is the arithematic average of a distribution of scores.\n* The __median__ is the score in the distribution that marks the 50th percentile i.e. 50% of the scores in the distribution fall above the median and 50% fall below it.\n* The __mode__ is the score that occurs most often in the distribution and is fairly useless.\n* A bell-shaped frequency distribution of scores that has the mean, median, and mode in the middle of the distribution & is symmetrical and asymptotic is called as a __normal distribution__\n\nExamples of other distributions are _t_ distribution, _F_ distribution and _chi-square_ distribution."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### F. Formulas for calculating sample and population mean of a distribution:\n\n$$\\mu = \\frac{\\varSigma X}{N}$$\n\n$~$\n\n$$\\bar{X} = \\frac{\\varSigma X}{n}$$\n\nwhere:\n\n$\\bar{X}$ is the sample mean\n\n$\\mu$ is the population mean\n \n$\\varSigma$ means 'the sum of'\n \nX is an individual score in the distribution\n \nn is the number of scores in the sample\n \nN is the number of scores in the population\n \n\n\n\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### G. Measures of Central Tendency\n\ne.g. 86 90 95 100 100 100 110 110 115 120\n\n_Mode_: 100\n\n_Median_: (100 + 100) / 2 = 100\n\n_Mean_: (86 + 90 + 95 + 100 + 100 + 100 + 110 + 115 + 120) / 10 = 102.6\n\nNormal distribution?\n"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import matplotlib.pyplot as plt\n%matplotlib inline\n\nx = [86, 90, 95, 100, 100, 100, 110, 110, 115, 120]\na = plt.hist(x)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### H. Mean, Median, and Mode of a Skewed Distribution\n\nQ. Is it important to do well in school?\nOptions from 1 to 5 => 1 = \"not at all important\" & 5 = \"very important\"\n\n1 1 1 2 2 2 3 3 3 3\n4 4 4 4 4 4 4 4 5 5\n5 5 5 5 5 5 5 5 5 5\n\n_Mode_: 5\n\n_Median_: (4 + 4) / 2 = 4\n\n_Mean_: 113 / 30 ~= 3.77\n\nNormal distribution?"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "x = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]\na = plt.hist(x)\n\n# sum(x)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### I. Measures of Variability\n\n* __Range__ is the difference between the largest score(the maximum value) and the smallest score(the minimum value)\n\n* _Interquantile Range_ is the difference between the score that marks the 75th percentile(the third quartile) and the score that marks the 25th percentile(the first quartile).\n\n* _Deviation_ is the difference between an individual score in a distribution and the mean for the distibution.\n\n* __Variance__ is the statistical average of the amount of dispersion in a distribution of scores.\n\n* __Standard Deviation__ is the average deviation between individual scores in a distribution and the mean of the distribution\n\n_Range_ is a measure of the total spread in a distribution whereas the _variance_ and _standard deviation_ are measures of the average amount spread with the distribution.\n\n$~$\n## Show Interquantile range and deviation in figures\n\n$~$\n\n#### Variance and Standard Deviation Formulas\n\n$~$\n\n__Variance__\n\n$$v = s^2 = \\frac{\\varSigma(X - \\bar{X})^2}{n-1}$$\n\n$~$\n\n__Standard Deviation__\n\n$$s = \\sqrt\\frac{\\varSigma(X - \\bar{X})^2}{n-1}$$\n\n$~$\n\nwhere:\n\n$\\varSigma$ = to sum\n\nX = a score in the distibution\n\n$$\\bar{X} = the sample mean\n\nn = the number of cases in the sample\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Q. \"If I have enough time, I can do even the most difficult work in this class\"\n\nOptions: 1 to 5 => 1 = \"not at all true\" and 5 = \"very true\"\n \n Sample size = 491\n Mean = 4.21\n Standard deviation = 0.98\n Variance = 0.96\n Range = 5 - 1"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "score = [1, 2, 3, 4, 5]\nfrequency = [9, 23, 73, 139, 247]\n\nbars = plt.bar(score, frequency)\nplt.xlabel('Scores on Confidence Item')\nplt.ylabel('Frequency')\n\nfor bar in bars:\n height = bar.get_height()\n plt.text(bar.get_x() + bar.get_width()/2., 1.05*height,\n height,\n ha='center', va='bottom')",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Q. \"I would feel really good if I were the only one who could answer the teacher's question in class.\"\nOptions: 1 to 5 => 1 = \"strongly agree\" and 2 = \"strongly disagree\"\n\nCalculate sample size, mean, standard deviation, variance, range and draw a bar graph."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "score = [1, 2, 3, 4, 5]\nfrequency = [115, 81, 120, 77, 98]",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Normal Distribution (bell curve)\n* __Descriptive statistics__: What is the average number of calories consumed by the 1000 people in the sample each day?\n* __Probability statistics__: If the average person in a sample of 1000 people consumes 2000 calories per day, what is the probability of having a student in the sample who consumes 3000 calories per day?\n* __Inferential statistics__: Does the phenomenon observed in a sample respresents an actual phenomenonin the population from which the sampe was drawn or was it by chance?\n\nTheoritical normal distribution is what statisticians use to develop probabilities. E.g. ~68% of scores fall with-in one standard deviation from the mean of the scores in a normal distribution. z-score. etc. The probabilities generated from the distribution depend on (1) bell curve shape of the distribution (2) absence of any sampling bias\n\n$~$\n* Positivel skewed vs negatively skewed distributions\n* leptokurtic vs platykurtic distributions\n\n$~$\n## Show Normal, Non-normal, and Skewed distributions in figures\n## Show Normal distribution divided into Std. Dev. units"
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.6.3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "8ae26b866840e4847202793b69f1a972",
"data": {
"description": "basic_statistics.ipynb",
"public": true
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/8ae26b866840e4847202793b69f1a972"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment