Mashimo/Random Forest

## Random Forest
A single decision tree, tasked to learn a dataset might not be able to perform well due to the outliers, and the breadth and depth complexity of the data. So instead of relying on a single tree, random forests rely on a forest of cleverly grown decision trees. Each tree within the forest is allowed to become highly specialized in a specific area, but still retains some general knowledge about most areas. When a random forest classifier, it is actually each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.
 Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.
  In addition to the tree bagging of training samples at the forest level, each individual decision tree further 'feature bags' at each node-branch split. This is helpful because some datasets contain a feature that is very correlated to the target (the 'y'-label). By selecting a random sampling of features every split, if such a feature were to exist, it wouldn't show up on as many branches of the tree and there would be more diversity of the features examined.
Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself. It does this by scoring each tree's predictions against that tree's out-of-bag samples. A tree's out of bag samples are those forest training samples that were withheld from a specific tree during training.
One of the advantages of using the out of bag error is it eliminates the need for you to split your data into a training / testing before feeding it into the forest model, since that's part of the forest algorithm. However using the out-of-bag error metric often underestimates the actual performance improvement, and the optimal number of training iterations.

## wearable.py
"""
Predict human activity by looking at data from wearables.
Train a random forest against a public domain Human Activity Dataset titled
Wearable Computing: Accelerometers' Data Classification of Body Postures and
Movements, containing 165633, one of which is invalid.
Within the dataset, there are five target activities:
Sitting
Sitting Down
Standing
Standing Up
Walking
These activities were captured from four people wearing accelerometers mounted
on their waist, left thigh, right arm, and right ankle.
"""
import pandas as pd
import time

# Grab the DLA HAR dataset from:
# http://groupware.les.inf.puc-rio.br/har
# http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip


#
# : Load up the dataset into dataframe 'X'
#
X = pd.read_csv("Datasets/dataset-har-PUC-rio-ugulino.csv", sep=';')

#
# : Encode the gender column, 0 as male, 1 as female
#
X.gender  = X.gender.map({'Woman':1, 'Man':0})

#
# : Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float)
X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float)

#
# INFO: Check data types
print (X.dtypes)

# column z4 is type "object". Something is wrong with the dataset.

#
# : Convert that column into numeric
# Use errors='raise'. This will alert you if something ends up being
# problematic
#
#X.z4 = pd.to_numeric(X.z4, errors='coerce')
#print (X[pd.isnull(X).any(axis=1)])
# 122076 --> z4 =    -14420-11-2011 04:50:23.713

#
# INFO: This is a wrong coded record, drop it before calling the
# to_numeric methods ...
X.drop(X.index[[122076]], inplace=True)
#X.set_value(122076,'z4',-144)  alternative: change it
X.z4 = pd.to_numeric(X.z4, errors='raise')

print (X.dtypes)
# everything ok now

# Activity to predict is in "class" column
# : Encode 'y' value as a dummies version of dataset's "class" column
#
y = pd.get_dummies(X['class'].copy())
# this produces a 5 column wide dummies dataframe as the y value

#
# : Get rid of the user and class columns in X
#
X.drop(['class','user'], axis=1, inplace=True)


print (X.describe())


#
# INFO: An easy way to show which rows have nans in them
print (X[pd.isnull(X).any(axis=1)])

# no NANs

#
# : Create an RForest classifier 'model'
#
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=30, max_depth= 10, random_state=0,
                               oob_score=True)

#
# : Split  data into test / train sets
#
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=7)


print ("Fitting...")
s = time.time()

model.fit(X_train, y_train)

print("Fitting completed in: ", time.time() - s)


#
# INFO: Display the OOB Score of data
score = model.oob_score_
print ("OOB Score: ", round(score*100, 3))


print ("Scoring...")
s = time.time()

score = model.score(X_test, y_test)

print ("Score: ", round(score*100, 3))
print ("Scoring completed in: ", time.time() - s)
	A single decision tree, tasked to learn a dataset might not be able to perform well due to the outliers, and the breadth and depth complexity of the data. So instead of relying on a single tree, random forests rely on a forest of cleverly grown decision trees. Each tree within the forest is allowed to become highly specialized in a specific area, but still retains some general knowledge about most areas. When a random forest classifier, it is actually each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.
	Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.
	In addition to the tree bagging of training samples at the forest level, each individual decision tree further 'feature bags' at each node-branch split. This is helpful because some datasets contain a feature that is very correlated to the target (the 'y'-label). By selecting a random sampling of features every split, if such a feature were to exist, it wouldn't show up on as many branches of the tree and there would be more diversity of the features examined.
	Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself. It does this by scoring each tree's predictions against that tree's out-of-bag samples. A tree's out of bag samples are those forest training samples that were withheld from a specific tree during training.
	One of the advantages of using the out of bag error is it eliminates the need for you to split your data into a training / testing before feeding it into the forest model, since that's part of the forest algorithm. However using the out-of-bag error metric often underestimates the actual performance improvement, and the optimal number of training iterations.
	"""
	Predict human activity by looking at data from wearables.
	Train a random forest against a public domain Human Activity Dataset titled
	Wearable Computing: Accelerometers' Data Classification of Body Postures and
	Movements, containing 165633, one of which is invalid.
	Within the dataset, there are five target activities:
	Sitting
	Sitting Down
	Standing
	Standing Up
	Walking
	These activities were captured from four people wearing accelerometers mounted
	on their waist, left thigh, right arm, and right ankle.
	"""
	import pandas as pd
	import time

	# Grab the DLA HAR dataset from:
	# http://groupware.les.inf.puc-rio.br/har
	# http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip


	#
	# : Load up the dataset into dataframe 'X'
	#
	X = pd.read_csv("Datasets/dataset-har-PUC-rio-ugulino.csv", sep=';')

	#
	# : Encode the gender column, 0 as male, 1 as female
	#
	X.gender = X.gender.map({'Woman':1, 'Man':0})

	#
	# : Clean up any column with commas in it
	# so that they're properly represented as decimals instead
	#
	X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float)
	X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float)

	#
	# INFO: Check data types
	print (X.dtypes)

	# column z4 is type "object". Something is wrong with the dataset.

	#
	# : Convert that column into numeric
	# Use errors='raise'. This will alert you if something ends up being
	# problematic
	#
	#X.z4 = pd.to_numeric(X.z4, errors='coerce')
	#print (X[pd.isnull(X).any(axis=1)])
	# 122076 --> z4 = -14420-11-2011 04:50:23.713

	#
	# INFO: This is a wrong coded record, drop it before calling the
	# to_numeric methods ...
	X.drop(X.index[[122076]], inplace=True)
	#X.set_value(122076,'z4',-144) alternative: change it
	X.z4 = pd.to_numeric(X.z4, errors='raise')

	print (X.dtypes)
	# everything ok now

	# Activity to predict is in "class" column
	# : Encode 'y' value as a dummies version of dataset's "class" column
	#
	y = pd.get_dummies(X['class'].copy())
	# this produces a 5 column wide dummies dataframe as the y value

	#
	# : Get rid of the user and class columns in X
	#
	X.drop(['class','user'], axis=1, inplace=True)


	print (X.describe())


	#
	# INFO: An easy way to show which rows have nans in them
	print (X[pd.isnull(X).any(axis=1)])

	# no NANs

	#
	# : Create an RForest classifier 'model'
	#
	from sklearn.ensemble import RandomForestClassifier

	model = RandomForestClassifier(n_estimators=30, max_depth= 10, random_state=0,
	oob_score=True)

	#
	# : Split data into test / train sets
	#
	from sklearn.model_selection import train_test_split


	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
	random_state=7)


	print ("Fitting...")
	s = time.time()

	model.fit(X_train, y_train)

	print("Fitting completed in: ", time.time() - s)


	#
	# INFO: Display the OOB Score of data
	score = model.oob_score_
	print ("OOB Score: ", round(score*100, 3))


	print ("Scoring...")
	s = time.time()

	score = model.score(X_test, y_test)

	print ("Score: ", round(score*100, 3))
	print ("Scoring completed in: ", time.time() - s)