Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save brubrant/b04cdd802c62de636ecf4b3e94c554da to your computer and use it in GitHub Desktop.
Save brubrant/b04cdd802c62de636ecf4b3e94c554da to your computer and use it in GitHub Desktop.
Descriptive Statistics with Python
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Descriptive Statistics with Python Project\n",
"\n",
"\n",
"**Descriptive Statistics** is the subject matter of this project. Descriptive statistics gives us the basic summary measures about the dataset. The summary measures include measures of central tendency (mean, median and mode) and measures of variability (variance, standard deviation, minimum/maximum values, IQR (Interquartile Range), skewness and kurtosis). I have used the fortune 500 dataset from the data world website for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"1.\tIntroduction to descriptive statistics\n",
"2.\tMeasures of central tendency\n",
" -\tMean\n",
" -\tMedian\n",
" -\tMode\n",
"3.\tMeasures of dispersion\n",
" -\tVariance\n",
" -\tStandard deviation\n",
" -\tCoefficient of variation\n",
" -\tIQR (Interquartile range)\n",
" -\tSkewness\n",
" -\tKurtosis\n",
"4.\tDataset description\n",
"5.\tImport libraries\n",
"6.\tImport dataset\n",
"7.\tExploratory data analysis\n",
"8.\tDescriptive statistics with `describe()` function\n",
" -\tSummary statistics of numerical columns\n",
" -\tSummary statistics of character columns\n",
" -\tSummary statistics of all the columns\n",
"9.\tComputation of measures of central tendency\n",
" -\tMean\n",
" -\tMedian\n",
" -\tMode\n",
"10.\tComputation of measures of dispersion or variability\n",
" -\tMinimum and maximum values\n",
" -\tRange\n",
" -\tVariance\n",
" -\tStandard deviation\n",
" -\tMedian\n",
" -\tInterquartile Range\n",
"11.\tComputation of measures of shape of distribution\n",
" -\tSkewness\n",
" -\tKurtosis\n",
"12.\tResults and conclusion\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to descriptive statistics\n",
"\n",
"\n",
"Descriptive statistics are numbers that are used to describe and summarize the data. They are used to describe the basic features of the data under consideration. They provide simple summary measures which give an overview of the dataset. Summary measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion. \n",
"\n",
"\n",
"Measures of central tendency include the `mean`, `median` and `mode`. These measures summarize a given data set by providing a single data point. These measures describe the center position of a distribution for a data set. We analyze the frequency of each data point in the distribution and describes it using the mean, median or mode. They provide the average of a data set. They can be either a representation of entire population or a sample of the population.\n",
"\n",
"\n",
"Measures of variability or dispersion include the `variance` or `standard deviation`, `coefficient of variation`, `minimum` and `maximum` values, `IQR (Interquartile Range)`, `skewness and `kurtosis`. These measures help us to analyze how spread-out the distribution is for a dataset. So, they provide the shape of the data set.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Measures of central tendency\n",
"\n",
"\n",
"**Central tendency** means a central value which describe a probability distribution. It may also be called a center or location of the distribution. The most common measures of central tendency are **mean**, **median** and **mode**. The most common measure of central tendency is the **mean**. For skewed distribution or when there is concern about outliers, the **median** may be preferred. So, median is more robust measure than the mean.\n",
"\n",
"\n",
"\n",
"### Mean\n",
"\n",
"- The most common measure of central tendency is the mean.\n",
"- Mean is also known as the simple average.\n",
"- It is denoted by greek letter µ for population and by ¯x for sample.\n",
"- We can find mean of a number of elements by adding all the elements in a dataset and then dividing by the number of elements in the dataset.\n",
"- It is the most common measure of central tendency but it has a drawback.\n",
"- The mean is affected by the presence of outliers.\n",
"- So, mean alone is not enough for making business decisions.\n",
"\n",
"\n",
"### Median\n",
"\n",
"- Median is the number which divides the dataset into two equal halves.\n",
"- To calculate the median, we have to arrange our dataset of n numbers in ascending order.\n",
"- The median of this dataset is the number at (n+1)/2 th position, if n is odd.\n",
"- If n is even, then the median is the average of the (n/2)th number and (n+2)/2 th number.\n",
"- Median is robust to outliers.\n",
"- So, for skewed distribution or when there is concern about outliers, the median may be preferred.\n",
"\n",
"\n",
"### Mode\n",
"\n",
"- Mode of a dataset is the value that occurs most often in the dataset.\n",
"- Mode is the value that has the highest frequency of occurrence in the dataset.\n",
"\n",
"\n",
"There is no best measure that give us the complete picture. So, these measures of central tendency (mean, median and mode) should be used together to represent the full picture. \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Measures of dispersion or variability\n",
"\n",
"\n",
"**Dispersion** is an indicator of how far away from the center, we can find the data values. The most common measures of dispersion are **variance**, **standard deviation** and **interquartile range (IQR)**. **Variance** is the standard measure of spread. The **standard deviation** is the square root of the variance. The **variance** and **standard deviation** are two useful measures of spread. \n",
"\n",
"\n",
"\n",
"### Variance\n",
"\n",
"-\tVariance measures the dispersion of a set of data points around their mean value.\n",
"-\tIt is the mean of the squares of the individual deviations.\n",
"-\tVariance gives results in the original units squared.\n",
"\n",
"\n",
"\n",
"### Standard deviation\n",
"\n",
"-\tStandard deviation is the most common used measure of variability.\n",
"-\tIt is the square-root of the variance.\n",
"-\tFor Normally distributed data, approximately 95% of the values lie within 2 s.d. of the mean. \n",
"-\tStandard deviation gives results in the original units.\n",
"\n",
"\n",
"### Coefficient of Variation (CV)\n",
"\n",
"-\tCoefficient of Variation (CV) is equal to the standard deviation divided by the mean.\n",
"-\tIt is also known as `relative standard deviation`.\n",
"\n",
"\n",
"### IQR (Interquartile range)\n",
"\n",
"-\tA third measure of spread is the **interquartile range (IQR)**.\n",
"-\tThe IQR is calculated using the boundaries of data situated between the 1st and the 3rd quartiles. \n",
"-\tThe interquartile range (IQR) can be calculated as follows:-\n",
" IQR = Q3 – Q1\n",
"-\tIn the same way that the median is more robust than the mean, the IQR is a more robust measure of spread than variance and standard deviation and should therefore be preferred for small or asymmetrical distributions. \n",
"-\tIt is a robust measure of spread.\n",
"\n",
"\n",
"\n",
"### Measures of shape\n",
"\n",
"Now, we will take a look at measures of shape of distribution. There are two statistical measures that can tell us about the shape of the distribution. These measures are **skewness** and **kurtosis**. These measures can be used to convey information about the shape of the distribution of the dataset.\n",
"\n",
"\n",
"### Skewness\n",
"-\t**Skewness** is a measure of a distribution's symmetry or more precisely lack of symmetry. \n",
"-\tIt is used to mean the absence of symmetry from the mean of the dataset. \n",
"-\tIt is a characteristic of the deviation from the mean. \n",
"-\tIt is used to indicate the shape of the distribution of data.\n",
"\n",
"\n",
"#### Negative skewness\n",
"\n",
"-\tNegative values for skewness indicate negative skewness. \n",
"-\tIn this case, the data are skewed or tail to left. \n",
"-\tBy skewed left, we mean that the left tail is long relative to the right tail. \n",
"-\tThe data values may extend further to the left but concentrated in the right. \n",
"-\tSo, there is a long tail and distortion is caused by extremely small values which pull the mean downward so that it is less than the median. \n",
"-\tHence, in this case we have\n",
" **Mean < Median < Mode**\n",
" \n",
"\n",
"#### Zero skewness\n",
"\n",
"-\tZero skewness means skewness value of zero. \n",
"-\tIt means the dataset is symmetrical. \n",
"-\tA data set is symmetrical if it looks the same to the left and right to the center point. \n",
"-\tThe dataset looks bell shaped or symmetrical. \n",
"-\tA perfectly symmetrical data set will have a skewness of zero. \n",
"-\tSo, the normal distribution which is perfectly symmetrical has a skewness of 0. \n",
"-\tSo, in this case, we have\n",
" **Mean = Median = Mode**\n",
" \n",
"\n",
"#### Positive skewness\n",
"\n",
"-\tPositive values for skewness indicate positive skewness. \n",
"-\tThe dataset are skewed or tail to right. \n",
"-\tBy skewed right, we mean that the right tail is long relative to the left tail. \n",
"-\tThe data values are concentrated in the right. \n",
"-\tSo, there is a long tail to the right that is caused by extremely large values which pull the mean upward so that it is greater than the median. \n",
"-\tSo, we have\n",
" **Mean > Median > Mode**\n",
" \n",
"\n",
"#### Reference range on skewness values\n",
"\n",
"The rule of thumb for skewness values are:\n",
"\n",
"-\tIf the skewness is between -0.5 and 0.5, the data are fairly symmetrical.\n",
"-\tIf the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed.\n",
"-\tIf the skewness is less than -1 or greater than 1, the data are highly skewed.\n",
"\n",
"\n",
"### Kurtosis\n",
"\n",
"-\tKurtosis is the degree of peakedness of a distribution. \n",
"-\tData sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly and have heavy tails.\n",
"-\tData sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. \n",
"\n",
"\n",
"#### Reference range for kurtosis\n",
"-\tThe reference standard is a normal distribution, which has a kurtosis of 3. \n",
"-\tOften, **excess kurtosis** is presented instead of kurtosis, where **excess kurtosis** is simply **kurtosis - 3**. \n",
"\n",
"#### Mesokurtic curve\n",
"-\tA normal distribution has kurtosis exactly 3 (**excess kurtosis** exactly 0). \n",
"-\tAny distribution with kurtosis ≈3 (excess ≈ 0) is called **mesokurtic**.\n",
"\n",
"#### Platykurtic curve\n",
"-\tA distribution with kurtosis < 3 (**excess kurtosis** < 0) is called **platykurtic**. \n",
"-\tAs compared to a normal distribution, its central peak is lower and broader, and its tails are shorter and thinner.\n",
"\n",
"#### Leptokurtic curve\n",
"\n",
"-\tA distribution with kurtosis > 3 (**excess kurtosis** > 0) is called **leptokurtic**. \n",
"-\tAs compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter.\n",
"\n",
"\n",
"\n",
"### Summary\n",
"\n",
"\n",
"So far, we have looked at the measures of central tendency of the data which include `mean`, `median` and `mode`. Also, we have taken a look at measures of spread of the data which consists of `variance`, `standard deviation`, `interquartile range (IQR)`, `minimum` and `maximum` values. We have also discussed `skewness` and `kurtosis` as measures of shape. These quantities can only be used for quantitative variables not for categorical variables.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dataset description\n",
"\n",
"\n",
"I have used the `fortune 500 dataset` for this project. I have downloaded this dataset from the data world website. This data set can be downloaded from the following url –\n",
"\n",
"\n",
"https://data.world/alexandra/fortune-500\n",
"\n",
"\n",
"The data set consists of revenue and profit figures of fortune 500 companies along with their rank.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ignore warnings"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/fortune500.csv'\n",
"\n",
"df = pd.read_csv(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View dimensions of dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(25500, 5)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 25500 instances and 5 variables in the data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preview the dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Year</th>\n",
" <th>Rank</th>\n",
" <th>Company</th>\n",
" <th>Revenue (in millions)</th>\n",
" <th>Profit (in millions)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1955</td>\n",
" <td>1</td>\n",
" <td>General Motors</td>\n",
" <td>9823.5</td>\n",
" <td>806</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1955</td>\n",
" <td>2</td>\n",
" <td>Exxon Mobil</td>\n",
" <td>5661.4</td>\n",
" <td>584.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1955</td>\n",
" <td>3</td>\n",
" <td>U.S. Steel</td>\n",
" <td>3250.4</td>\n",
" <td>195.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1955</td>\n",
" <td>4</td>\n",
" <td>General Electric</td>\n",
" <td>2959.1</td>\n",
" <td>212.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1955</td>\n",
" <td>5</td>\n",
" <td>Esmark</td>\n",
" <td>2510.8</td>\n",
" <td>19.1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Year Rank Company Revenue (in millions) Profit (in millions)\n",
"0 1955 1 General Motors 9823.5 806\n",
"1 1955 2 Exxon Mobil 5661.4 584.8\n",
"2 1955 3 U.S. Steel 3250.4 195.4\n",
"3 1955 4 General Electric 2959.1 212.6\n",
"4 1955 5 Esmark 2510.8 19.1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary of dataset"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 25500 entries, 0 to 25499\n",
"Data columns (total 5 columns):\n",
"Year 25500 non-null int64\n",
"Rank 25500 non-null int64\n",
"Company 25500 non-null object\n",
"Revenue (in millions) 25500 non-null float64\n",
"Profit (in millions) 25500 non-null object\n",
"dtypes: float64(1), int64(2), object(2)\n",
"memory usage: 996.2+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observations\n",
"\n",
"- We can see that the `Year` and `Rank` variables have integer data types as expected. The `Company` variable is of object data type. \n",
"\n",
"- The `Revenue (in millions)` variable is of float data type.\n",
"\n",
"- The `Profit (in millions)` variable is of object data type. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for missing values"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Year 0\n",
"Rank 0\n",
"Company 0\n",
"Revenue (in millions) 0\n",
"Profit (in millions) 0\n",
"dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above command shows that there are no missing values in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Descriptive statistics with `describe()` function\n",
"\n",
"\n",
"\n",
"Descriptive or summary statistics in python – pandas, can be obtained by using the `describe()` function. The `describe()` function gives us the `count`, `mean`, `standard deviation(std)`, `minimum`, `Q1(25%)`, `median(50%)`, `Q3(75%)`, `IQR(Q3 - Q1)` and `maximum` values.\n",
"\n",
"\n",
"I will demonstrate the usage of `describe()` function as follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary statistics of numerical columns"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Year</th>\n",
" <th>Rank</th>\n",
" <th>Revenue (in millions)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>25500.00000</td>\n",
" <td>25500.000000</td>\n",
" <td>25500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1980.00000</td>\n",
" <td>250.499765</td>\n",
" <td>4273.329635</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>14.71989</td>\n",
" <td>144.339963</td>\n",
" <td>11351.884979</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1955.00000</td>\n",
" <td>1.000000</td>\n",
" <td>49.700000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1967.00000</td>\n",
" <td>125.750000</td>\n",
" <td>362.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1980.00000</td>\n",
" <td>250.500000</td>\n",
" <td>1019.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1993.00000</td>\n",
" <td>375.250000</td>\n",
" <td>3871.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2005.00000</td>\n",
" <td>500.000000</td>\n",
" <td>288189.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Year Rank Revenue (in millions)\n",
"count 25500.00000 25500.000000 25500.000000\n",
"mean 1980.00000 250.499765 4273.329635\n",
"std 14.71989 144.339963 11351.884979\n",
"min 1955.00000 1.000000 49.700000\n",
"25% 1967.00000 125.750000 362.300000\n",
"50% 1980.00000 250.500000 1019.000000\n",
"75% 1993.00000 375.250000 3871.000000\n",
"max 2005.00000 500.000000 288189.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `describe()` function excludes the character columns and gives summary statistics of numeric columns only."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary statistics of character columns\n",
"\n",
"\n",
"- The `describe()` function with an argument named `include` along with `value` object(include='object') gives the summary statistics of the character columns."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Company</th>\n",
" <th>Profit (in millions)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>25500</td>\n",
" <td>25500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>1887</td>\n",
" <td>6977</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>CBS</td>\n",
" <td>N.A.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>57</td>\n",
" <td>369</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Company Profit (in millions)\n",
"count 25500 25500\n",
"unique 1887 6977\n",
"top CBS N.A.\n",
"freq 57 369"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe(include=['object'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary statistics of all the columns\n",
"\n",
"\n",
"- The `describe()` function with include='all' gives the summary statistics of all the columns.\n",
"\n",
"\n",
"- We need to add a variable named include='all' to get the summary statistics or descriptive statistics of both numeric and character columns."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Year</th>\n",
" <th>Rank</th>\n",
" <th>Company</th>\n",
" <th>Revenue (in millions)</th>\n",
" <th>Profit (in millions)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>25500.00000</td>\n",
" <td>25500.000000</td>\n",
" <td>25500</td>\n",
" <td>25500.000000</td>\n",
" <td>25500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1887</td>\n",
" <td>NaN</td>\n",
" <td>6977</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>CBS</td>\n",
" <td>NaN</td>\n",
" <td>N.A.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>57</td>\n",
" <td>NaN</td>\n",
" <td>369</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1980.00000</td>\n",
" <td>250.499765</td>\n",
" <td>NaN</td>\n",
" <td>4273.329635</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>14.71989</td>\n",
" <td>144.339963</td>\n",
" <td>NaN</td>\n",
" <td>11351.884979</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1955.00000</td>\n",
" <td>1.000000</td>\n",
" <td>NaN</td>\n",
" <td>49.700000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1967.00000</td>\n",
" <td>125.750000</td>\n",
" <td>NaN</td>\n",
" <td>362.300000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1980.00000</td>\n",
" <td>250.500000</td>\n",
" <td>NaN</td>\n",
" <td>1019.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1993.00000</td>\n",
" <td>375.250000</td>\n",
" <td>NaN</td>\n",
" <td>3871.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2005.00000</td>\n",
" <td>500.000000</td>\n",
" <td>NaN</td>\n",
" <td>288189.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Year Rank Company Revenue (in millions) \\\n",
"count 25500.00000 25500.000000 25500 25500.000000 \n",
"unique NaN NaN 1887 NaN \n",
"top NaN NaN CBS NaN \n",
"freq NaN NaN 57 NaN \n",
"mean 1980.00000 250.499765 NaN 4273.329635 \n",
"std 14.71989 144.339963 NaN 11351.884979 \n",
"min 1955.00000 1.000000 NaN 49.700000 \n",
"25% 1967.00000 125.750000 NaN 362.300000 \n",
"50% 1980.00000 250.500000 NaN 1019.000000 \n",
"75% 1993.00000 375.250000 NaN 3871.000000 \n",
"max 2005.00000 500.000000 NaN 288189.000000 \n",
"\n",
" Profit (in millions) \n",
"count 25500 \n",
"unique 6977 \n",
"top N.A. \n",
"freq 369 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Computation of measures of central tendency \n",
"\n",
"\n",
"- In this section, I will compute the measures of central tendency - mean, median and mode. \n",
"\n",
"- These statistics give us a approximate value of the middle of a numeric variable.\n",
"\n",
"- I will use the `Revenue (in millions)` variable for calculations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mean"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4273.32963529412\n"
]
}
],
"source": [
"mean = df['Revenue (in millions)'].mean()\n",
"\n",
"print(mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Median"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1019.0\n"
]
}
],
"source": [
"median = df['Revenue (in millions)'].median()\n",
"\n",
"print(median)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mode"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 85.0\n",
"1 86.2\n",
"2 90.0\n",
"dtype: float64\n"
]
}
],
"source": [
"mode = df['Revenue (in millions)'].mode()\n",
"\n",
"print(mode)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observation\n",
"\n",
"\n",
"- We can see that `mean > median > mode`. So, the distribution of `Revenue (in millions)` is positively skewed. I will plot its distribution to confirm the same."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot the distribution "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0xb0caa9b38>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"data = df['Revenue (in millions)']\n",
"\n",
"sns.distplot(data, bins=10, hist=True, kde=True, label = 'Revenue (in millions)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above plot confirms that the `Revenue (in millions)` is positively skewed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Computation of measures of dispersion or variability\n",
"\n",
"\n",
"- In this section, I will compute the measures of dispersion or variability - minimum and maximum values, range, variance, standard-deviation, IQR. \n",
"\n",
"- Again, I will use the `Revenue (in millions)` variable for calculations.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Minimum value"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"49.7"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Maximum value"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"288189.0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Range"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"288139.3"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].max() - df['Revenue (in millions)'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variance"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"128865292.56794235"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].var()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standard deviation"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11351.88497862546"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].std()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Median (Q2 or 50th percentile)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1019.0"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Q2 = df['Revenue (in millions)'].quantile(0.5)\n",
"\n",
"Q2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q3 or 75th percentile"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3871.0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Q3 = df['Revenue (in millions)'].quantile(0.75)\n",
"\n",
"Q3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Q1 or 25th percentile"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"362.3"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Q1 = df['Revenue (in millions)'].quantile(0.25)\n",
"\n",
"Q1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interquartile Range\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3508.7"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"IQR = Q3 - Q1\n",
"\n",
"IQR"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Draw boxplot"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.boxplot(df['Revenue (in millions)'])\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Computation of measures of shape of distribution\n",
"\n",
"\n",
"- In this section, I will compute the measures of shape of distribution - skewness and kurtosis. \n",
"\n",
"- Again, I will use the `Revenue (in millions)` variable for calculations.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Skewness\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9.32673729580641"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].skew()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"I find the skewness to be 9.3267. So, it is greater than 1. Hence, we can conclude that the `Revenue (in millions)` data is highly skewed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Kurtosis"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"132.04561027793167"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Revenue (in millions)'].kurt()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"I find the kurtosis to be 132.0456. So, it is greater than 3 and so excess kurtosis > 0. Hence, we can conclude that the `Revenue (in millions)` curve is a leptokurtic curve. As compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Results and conclusion\n",
"\n",
"\n",
"1.\tIn this project, I describe the descriptive statistics that are used to summarize a dataset. \n",
"2.\tIn particular, I have described the measures of central tendency (mean, median and mode). I have also described the measures of dispersion or variability (variance, standard deviation, coefficient of variation, minimum and maximum values, IQR) and measures of shape (skewness and kurtosis).\n",
"3.\tI have demonstrated how to calculate the summary statistics with `describe()` function.\n",
"4.\tI have computed the measures of central tendency-mean, median and mode for the `Revenue (in millions)`variable. I have found `mean > median > mode`. So, the distribution of `Revenue (in millions)` is positively skewed. I have plotted its distribution to confirm the same.\n",
"5.\tI have computed the measures of dispersion or variability-range, variance, standard-deviation, median and IQR for the `Revenue (in millions)`variable.\n",
"6.\tI have also computed the measures of shape-skewness and kurtosis for the `Revenue (in millions)`variable.\n",
"7.\tI find the skewness to be 9.3267. So, it is greater than 1. Hence, we can conclude that the `Revenue (in millions)` data is highly skewed.\n",
"8.\tI find the kurtosis to be 132.0456. So, it is greater than 3 and so excess kurtosis > 0. Hence, we can conclude that the `Revenue (in millions)` curve is a leptokurtic curve. As compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment