Skip to content

Instantly share code, notes, and snippets.

@mepsrajput
Last active September 13, 2020 09:11
Show Gist options
  • Save mepsrajput/96bf64afce0489de7e373c9eaa58523c to your computer and use it in GitHub Desktop.
Save mepsrajput/96bf64afce0489de7e373c9eaa58523c to your computer and use it in GitHub Desktop.
Statistics Notes for Data Science and ML

Exploratory data analysis

anecdotal evidence: Evidence, often personal, that is collected casually rather than by a well-designed study.

population: A group we are interested in studying. “Population” often refers to a group of people, but the term is used for other subjects, too.

cross-sectional study: A study that collects data about a population at a particular point in time.

cycle: In a repeated cross-sectional study, each repetition of the study is called a cycle.

longitudinal study: A study that follows a population over time, collecting data from the same group repeatedly.

record: In a dataset, a collection of information about a single person or other subject.

respondent: A person who responds to a survey.

sample: The subset of a population used to collect data.

representative: A sample is representative if every member of the population has the same chance of being in the sample.

oversampling: The technique of increasing the representation of a sub-population in order to avoid errors due to small sample sizes.

raw data: Values collected and recorded with little or no checking, calculation or interpretation.

recode: A value that is generated by calculation and other logic applied to raw data.

data cleaning: Processes that include validating data, identifying errors, translating between data types and representations, etc.

distribution: The values that appear in a sample and the frequency of each.

histogram: A mapping from values to frequencies, or a graph that shows this mapping.

frequency: The number of times a value appears in a sample.

mode: The most frequent value in a sample, or one of the most frequent values.

normal distribution (Gaussian distribution): An idealization of a bell-shaped distribution.

uniform distribution: A distribution in which all values have the same frequency.

outlier: A value far from the central tendency.

Types Of Analysis

Quantitative Analysis: Quantitative Analysis or the Statistical Analysis is the science of collecting and interpreting data with numbers and graphs to identify patterns and trends.

Qualitative Analysis: Qualitative or Non-Statistical Analysis gives generic information and uses text, sound and other forms of media to do so.

Categories In Statistics

Descriptive Statistics: Descriptive Statistics uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables.

  • Descriptive Statistics helps organize data and focuses on the characteristics of data providing parameters.

Inferential Statistics: Inferential Statistics makes inferences and predictions about a population based on a sample of data taken from the population in question.

  • Inferential statistics generalizes a large data set and applies probability to arrive at a conclusion. It allows you to infer parameters of the population based on sample stats and build models on it.

Mean (x̄)

  • The mean is the average of the numbers i.e., (sum of numbers)/(Count of numbers)

Variance / Mean Squared Deviation (σ^2)

  • Describes the variablility or spread of a data distribution.
  • The average of the squared differences from the Mean. σ^2 = 1/nΣ i = 1-to-N(xi − x̄)2

Population Variance is the average of squared deviations

Sample Variance is the average of squared differences from the mean

Standard Deviation (σ)

  • The Standard Deviation is a measure of how spread out numbers are.
  • In other words, it is the square root of the Variance.

Population SD = root (1/N Σ i = 1-to-N (xi - μ)2)

Sample SD = root (1/N-1 Σ i = 1-to-N (xi - x̄)2)

Calculation of Variance & SD

Data: 15, 16, 18, 19, 22, 24, 29, 30, 34 Mean: 15 Distances from Mean (Mean - Data Point): 8, 7, 5, 4, 1, 1, 6, 7, 11 Squaring & Adding: 8^2 + 7^2 + ... = 362 Variance = 362 / 9 = 40.22 SD = 6.34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment