Skip to content

Instantly share code, notes, and snippets.

View Mashimo's full-sized avatar

Massimo Albani Mashimo

  • Officina Mutante
View GitHub Profile
@Mashimo
Mashimo / Random Forest
Last active April 29, 2018 22:38
Random forest
A single decision tree, tasked to learn a dataset might not be able to perform well due to the outliers, and the breadth and depth complexity of the data. So instead of relying on a single tree, random forests rely on a forest of cleverly grown decision trees. Each tree within the forest is allowed to become highly specialized in a specific area, but still retains some general knowledge about most areas. When a random forest classifier, it is actually each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.
Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.
In addition to the tree bagg
@Mashimo
Mashimo / Decision Tree
Last active April 29, 2018 22:39
Decision Tree
Decision trees are a supervised, probabilistic, machine learning classifier that are often used as decision support tools. Like any other classifier, they are capable of predicting the label of a sample, and the way they do this is by examining the probabilistic outcomes of your samples' features.
Decision trees are one of the oldest and most used machine learning algorithms, perhaps even pre-dating machine learning. They're very popular and have been around for decades. Following through with sequential cause-and-effect decisions comes very naturally.
Decision trees are a good tool to use when you want backing evidence to support a decision.
Support vector machines are a set of supervised learning algorithms that you can use for classification, regression and outlier detection purposes. SciKit-Learn has many classes for SVM usage, depending on your purpose. The one we'll be focusing on is Support Vector Classifier, SVC.
@Mashimo
Mashimo / Regression
Last active April 29, 2018 22:37
Regression
Examples of regression models for prediction
@Mashimo
Mashimo / Classification K-nearest neighbours
Last active April 29, 2018 22:37
Clustering supervised
Clustering groups samples that are similar within the same cluster.
Supervised: data samples have labels associated.
Use the K-nearest algorithm.
@Mashimo
Mashimo / Clustering unsupervised
Last active August 5, 2022 02:45
Clustering data
Clustering groups samples that are similar within the same cluster.
Unsupervised: no label provided in the data samples.
Use the K-means algorithm.
@Mashimo
Mashimo / Isomap
Last active November 16, 2020 10:54
Data dimensionality reduction via isomap
Isomap is a nonlinear dimensionality reduction method.
The algorithm provides a simple method for estimating the intrinsic geometry of a data manifold based on a rough estimate
of each data point’s neighbours
@Mashimo
Mashimo / PCA
Last active July 21, 2020 16:57
PCA - Principal component Analysis
Principal Component Analysis
@Mashimo
Mashimo / data visualisation
Last active April 29, 2018 22:36
wheat seeds data visualisation
Check the data-visualisation-README file below.
@Mashimo
Mashimo / readNHL.py
Last active April 29, 2018 22:39
Read NHL Historic Player Points Statistics
import pandas as pd
# Load up the table for the years 2014-2015, and extract the dataset out of it.
#
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2"
table_df = pd.read_html(url, header=1)[0]
# Columns get automatic names. Rename the columns so that they are similar to the
# column definitions on the website.
#