Skip to content

Instantly share code, notes, and snippets.

@jspoelstra
Created April 28, 2017 05:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jspoelstra/2d616576eb650d5ba109fd8e864a352e to your computer and use it in GitHub Desktop.
Save jspoelstra/2d616576eb650d5ba109fd8e864a352e to your computer and use it in GitHub Desktop.
Titanic data exploration and age imputation
Display the source blob
Display the rendered blob
Raw
{"nbformat_minor": 0, "metadata": {"kernelspec": {"language": "R", "name": "r", "display_name": "R"}, "language_info": {"codemirror_mode": "r", "file_extension": ".r", "version": "3.1.1", "name": "R", "pygments_lexer": "r", "mimetype": "text/x-r-source"}}, "cells": [{"source": "## Exploring the Titanic Dataset\nClosely following the excellent notebook published by Curt Wehrley here: (https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md)\n\nFirst load the AzureML-imported dataset:", "cell_type": "markdown", "metadata": {}}, {"source": "library(\"AzureML\")\nws <- workspace()\ndat <- download.datasets(ws, \"Titanic Passengers\")", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}, {"source": "Take a quick look at the first few lines", "cell_type": "markdown", "metadata": {}}, {"source": "head(dat)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "### Missing Data", "cell_type": "markdown", "metadata": {}}, {"source": "We first install some required packages", "cell_type": "markdown", "metadata": {}}, {"source": "install.packages(\"Amelia\")\ninstall.packages(\"corrgram\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Plot all missing values", "cell_type": "markdown", "metadata": {}}, {"source": "require(Amelia)\nmissmap(dat, \n main=\"Titanic Training Data - Missings Map\", \n col=c(\"yellow\", \"black\"), \n legend=FALSE)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "## Visualization\nLet's take a quick look at some of our columns by survival rate:", "cell_type": "markdown", "metadata": {}}, {"source": "barplot(table(dat$survived),\n names.arg = c(\"Perished\", \"Survived\"),\n main=\"Survived (passenger fate)\", col=\"black\")\nbarplot(table(dat$pclass), \n names.arg = c(\"first\", \"second\", \"third\"),\n main=\"Pclass (passenger traveling class)\", col=\"firebrick\")\nbarplot(table(dat$sex), main=\"Sex (gender)\", col=\"darkviolet\")\nhist(dat$age, main=\"Age\", xlab = NULL, col=\"brown\")\nbarplot(table(dat$sibsp), main=\"SibSp (siblings + spouse aboard)\", \n col=\"darkblue\")\nbarplot(table(dat$parch), main=\"ParCh (parents + kids aboard)\", \n col=\"gray50\")\nhist(dat$fare, main=\"Fare (fee paid for ticket[s])\", xlab = NULL, \n col=\"darkgreen\")\nbarplot(table(dat$embarked),\n names.arg = c(\"Missing\", \"Cherbourg\", \"Queenstown\", \"Southampton\"),\n main=\"Embarked (port of embarkation)\", col=\"sienna\")", "cell_type": "code", "metadata": {"scrolled": true, "collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Let's look at this again, by odds of survival.\n#### Ticket Class", "cell_type": "markdown", "metadata": {}}, {"source": "mosaicplot(dat$pclass ~ dat$survived, \n main=\"Passenger Fate by Traveling Class\", shade=FALSE, \n color=TRUE, xlab=\"Pclass\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "#### Gender", "cell_type": "markdown", "metadata": {}}, {"source": "mosaicplot(dat$sex ~ dat$survived, \n main=\"Passenger Fate by Gender\", shade=FALSE, color=TRUE, \n xlab=\"Sex\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "#### Age", "cell_type": "markdown", "metadata": {}}, {"source": "boxplot(dat$age ~ dat$survived, \n main=\"Passenger Fate by Age\",\n xlab=\"Survived\", ylab=\"Age\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "mosaicplot(cut(dat$age, breaks=seq(0,80,10)) ~ dat$survived,\n shade=FALSE, color=TRUE,\n main=\"Passenger Fate by Age\",\n xlab=\"Age\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "#### Port of Embarcation", "cell_type": "markdown", "metadata": {}}, {"source": "mosaicplot(dat$embarked ~ dat$survived, \n main=\"Passenger Fate by Port of Embarkation\",\n shade=FALSE, color=TRUE, xlab=\"Embarked\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "## Correlation between features", "cell_type": "markdown", "metadata": {}}, {"source": "require(corrgram)\nrequire(plyr)\ncorrgram.data <- dat\n## change features of factor type to numeric type for inclusion on correlogram\ncorrgram.data$survived <- as.numeric(corrgram.data$survived)\ncorrgram.data$pclass <- as.numeric(corrgram.data$pclass)\ncorrgram.data$embarked <- revalue(corrgram.data$embarked, \n c(\"C\" = 1, \"Q\" = 2, \"S\" = 3))\n## generate correlogram\ncorrgram.vars <- c(\"survived\", \"pclass\", \"sex\", \"age\", \n \"sibsp\", \"parch\", \"fare\", \"embarked\")\ncorrgram(corrgram.data[,corrgram.vars], order=FALSE, \n lower.panel=panel.ellipse, upper.panel=panel.pie, \n text.panel=panel.txt, main=\"Titanic Training Data\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "# Dealing with Missing Age values\nFollowing the sophisticated imputation of missing ages", "cell_type": "markdown", "metadata": {}}, {"source": "summary(dat$age)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "boxplot(dat$age ~ dat$pclass, \n main=\"Age by class\",\n xlab=\"Ticket Class\", ylab=\"Age\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Already better than just the global median, but we can do better. Look at passanger names:", "cell_type": "markdown", "metadata": {}}, {"source": "head(dat$name, n=10L)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "We migth be able to use titles to guess ages. Extract using RegEx", "cell_type": "markdown", "metadata": {}}, {"source": "## function for extracting honorific (i.e. title) from the Name feature\ngetTitle <- function(name) {\n title.dot.start <- regexpr(\"\\\\, [A-Z]*\\\\.\", name, TRUE)+2\n title.dot.end <- title.dot.start + attr(title.dot.start, \"match.length\")-3\n title <- substr(name, title.dot.start, title.dot.end)\n return (title)\n}", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}, {"source": "Extract from names and assign to new column", "cell_type": "markdown", "metadata": {}}, {"source": "dat$title <- getTitle(dat$name)\nunique(dat$title)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Let's check which of these have missing age values", "cell_type": "markdown", "metadata": {}}, {"source": "options(digits=2)\nrequire(Hmisc)\nstats <- bystats(dat$age, dat$title, fun=function(x) c(Mean=mean(x),Median=median(x)))\nstats", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "OK, let's get all the titles with missing values", "cell_type": "markdown", "metadata": {}}, {"source": "titles.na <- attr(stats, \"dimnames\")[[1]][stats[,2]>0]\n# Remove last row\ntitles.na <- titles.na[1:length(titles.na)-1]\ntitles.na", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Impute for each of these", "cell_type": "markdown", "metadata": {}}, {"source": "imputeMedian <- function(impute.var, filter.var, var.levels) {\n for (v in var.levels) {\n impute.var[ which( filter.var == v)] <- \n impute(impute.var[which( filter.var == v)])\n }\n return (impute.var)\n}", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "dat$age[which(dat$title == 'Dr.')]", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "dat$imputed.age <- is.na(dat$age)\ndat$age <- imputeMedian(dat$age, dat$title, titles.na)\nbystats(dat$age, dat$title, fun=function(x) c(Mean=mean(x),Median=median(x)))", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "bystats(dat$age[dat$imputed.age] , dat$title[dat$imputed.age], fun=function(x) c(Mean=mean(x),Median=median(x)))", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "# Done - save back", "cell_type": "markdown", "metadata": {}}, {"source": "upload.dataset(dat, ws, \"Titanic Imputed\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}, {"source": "", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}], "nbformat": 4}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment