Skip to content

Instantly share code, notes, and snippets.

@roycoding
Last active January 23, 2018 08:05
Show Gist options
  • Save roycoding/4d4951f41904f0f62a6e to your computer and use it in GitHub Desktop.
Save roycoding/4d4951f41904f0f62a6e to your computer and use it in GitHub Desktop.
Beat the Becnhmark: Forest Cover Type Prediction

Beating the Forest Cover Type Prediction benchmark

Day 4 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.

To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.

Random Forest Cover Types

Using pandas I loaded the train and test data sets into Python. I then used all of the columns as features for the model, which were conveniently all numerical. Here is the Python code for the scikit-learn random forest classifier:

import pandas as pd
from sklearn import ensemble
from sklearn import cross_validation
from sklearn import metrics

# Load the training and test data sets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Create numpy arrays for use with scikit-learn
train_X = train.drop(['Id','Cover_Type'],axis=1).values
train_y = train.Cover_Type.values
test_X = test.drop('Id',axis=1).values

# Split the training set into training and validation sets
X,X_,y,y_ = cross_validation.train_test_split(train_X,train_y,test_size=0.2)

# Train and predict with the random forest classifier
rf = ensemble.RandomForestClassifier()
rf.fit(X,y)
y_rf = rf.predict(X_)
print metrics.classification_report(y_,y_rf)
print metrics.accuracy_score(y_,y_rf)

# Retrain with entire training set and predict test set.
rf.fit(train_X,train_y)
y_test_rf = rf.predict(test_X)

# Write to CSV
pd.DataFrame({'Id':test.Id.values,'Cover_Type':y_test_rf})\
            .sort_index(ascending=False,axis=1).to_csv('rf1.csv',index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment