Skip to content

Instantly share code, notes, and snippets.

@jaeddy
Last active June 10, 2022 14:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaeddy/b6549a6dcc62f8ddb8ad to your computer and use it in GitHub Desktop.
Save jaeddy/b6549a6dcc62f8ddb8ad to your computer and use it in GitHub Desktop.

Data science / big data techniques, described in 40 words or less

Collection of common data science terms, tools, and concepts with definitions, as assembled by Vincent Granville in an analyticbridge blog post. (accessed 07/24/2014)

Adjusted R^2 (R-Square)

The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R^2 which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R^2 will always increase as you add new variables, which result in models that over-fit the data and have poor predictive ability. Adjusted R^2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction. (Submitted by Santiago Perez)

Cluster Analysis

Methods to assign a set of objects into groups. These groups are called clusters and objects in a cluster are more similar to each other than to those in other clusters. Well known algorithms are hierarchical clustering, k-means, fuzzy clustering, supervised clustering. (submitted by Markus Schmidberger)

Decision Trees

A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)

Factor Analysis

Used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)

Goodness of Fit

The degree to which the predicted values created by a model minimizes errors in cross-validation tests. However, over-fitting the data can be dangerous, as it results in a model that will have no predictive power on fresh data. True Goodness of Fit is determined by how the model fits new data, ie its predictive ability. (submitted by Santiago Perez)

Hadoop

Hadoop is an Open Source framework that supports large scale data analysis by allowing one to decompose questions into discrete chunks that can be executed independently very close to slices of the data in question and ultimately reassembled into an answer to the question posed. (submitted by Philip Best)

K-Means

Popular clustering algorithm where for a given (a priori) K, finds K clusters by iteratively moving cluster centers to the cluster centers of gravity and adjusting the cluster set assignments. (Submitted by Michael Malak)

Mahout

Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.

MapReduce

Model for processing large amounts of data efficiently. Original problem is "mapped" to smaller problems (which may themselves become "original" problems). Smaller problems are processed in parallel. Results of smaller problems are combined, or "reduced", into solution to original problem. (submitted by Melanie Jutras)

Monte-Carlo Simulations

Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)

No SQL

"Not only SQL" is a group of database management systems. Data is not stored in tables like a relational database and is not based on the mathematical relationship between tables. It is a way of storing and retrieving unstructured data quickly. (submitted by Markus Schmidberger)

Multidimensional Scaling

Reduce space dimension by projecting a N*N (N = number of observations) similarity matrix into a 2-dimensional visual representation. Classical example is producing a geographic map with cities, when the only data available is travel times between any pair of cities. (submitted by Vincent Granville)

Pig

Pig is a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back. It's also known for being able to process a large variety of different data types.

Stepwise Regression

Variable selection process for multivariate regression. In forward stepwise selection, a seed variable is selected and each additional variable is inputed into the model, but only kept if it significantly improves goodness of fit (as measured by increases in R^2). Backwards selection starts with all variables, and removes them one by one until removing an additional one decreases R^2 by a non-trivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R^2, not Adjusted R^2. (submitted by Santiago Perez)

Time Series

A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data, replaced with just a t0 (start time) and delta-t that apply to the whole series. (Submitted by Michael Malak)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment