roycoding/pizza-rf.md

## pizza-rf.md

      
    Raw
  

              pizza-rf.md
            
          
    Beating the Random Acts of Pizza Benchmark

Day 2 of the Beat 5 Kaggle Becnhmarks in 5 Days Challenge.

The Random Acts of Pizza competition is about predicting when a request for a free pizza on the Random Acts of Pizza sub-reddit is granted. The benchmark is simply guessing that no pizzas are given (or all). This results in an AUC score of 50.
To beat the AUC = 50 benchmark with a simple model, I first looked at the training and test data to find simple features. I decided to use the word counts of the request title and comment text, as longer comments might be skipped by readers.
To build the model I first extracted only the desired fields from the original JSON files with jq and used json2csv to write out CSV.
cat train.json|jq -c '.[]' | json2csv -p -k=request_id,requester_received_pizza,request_title,request_text_edit_aware -d="|" > train_1.csv

cat test.json|jq -c '.[]' | json2csv -p -k=request_id,request_title,request_text_edit_aware -d="|" > test_1.csv
I then built a very basic random forest model using the default settings in scikit-learn. With a 80/20 training/test split I achieved a local AUC of about 0.52 (single validation). Using the entire training set to build a random forest, I was able to score an AUC of 0.51274 on the competition leaderboard.
Not great, but not bad for a very simple model.
import pandas as pd
from sklearn import cross_validation
from sklearn import ensemble
from sklearn import metrics

train = pd.read_csv('train_1.csv',delimiter='|')
test = pd.read_csv('test_1.csv',delimiter='|')

# Create text word count and title word count fields and binarize pizza received
train.requester_received_pizza = train.requester_received_pizza.apply(lambda x: 1 if x else 0)
train['title_count'] = train.request_title.apply(lambda x: len(x.split()))
train['text_count'] = train.request_text_edit_aware.apply(lambda x: len(str(x).split()))

test['title_count'] = test.request_title.apply(lambda x: len(x.split()))
test['text_count'] = test.request_text_edit_aware.apply(lambda x: len(str(x).split()))

# Create training and testing arrays as well as validation splits
train_X = train.drop(['request_id', u'requester_received_pizza', u'request_title', u'request_text_edit_aware'],axis=1).values
train_y = train.requester_received_pizza.values
X,X_,y,y_ = cross_validation.train_test_split(train_X,train_y,test_size=0.2)

test_X = test.drop(['request_id','request_title','request_text_edit_aware'],axis=1).values

# Train and test random forest model
rf = ensemble.RandomForestClassifier() # Default values
rf.fit(X,y)
y_rf = rf.predict(X_)
print metrics.roc_auc_score(y_,y_rf)

# Train model with full training data and predict test y's
rf.fit(train_X,train_y)
y_test_rf = rf.predict(test_X)

# Write submission file
test_out = pd.DataFrame({'request_id':test.request_id.values,'requester_received_pizza':y_test_rf.astype('int')})
test_out.to_csv('rf1.csv',index=False)