Created
April 28, 2017 05:59
-
-
Save jspoelstra/2d616576eb650d5ba109fd8e864a352e to your computer and use it in GitHub Desktop.
Titanic data exploration and age imputation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"nbformat_minor": 0, "metadata": {"kernelspec": {"language": "R", "name": "r", "display_name": "R"}, "language_info": {"codemirror_mode": "r", "file_extension": ".r", "version": "3.1.1", "name": "R", "pygments_lexer": "r", "mimetype": "text/x-r-source"}}, "cells": [{"source": "## Exploring the Titanic Dataset\nClosely following the excellent notebook published by Curt Wehrley here: (https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md)\n\nFirst load the AzureML-imported dataset:", "cell_type": "markdown", "metadata": {}}, {"source": "library(\"AzureML\")\nws <- workspace()\ndat <- download.datasets(ws, \"Titanic Passengers\")", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}, {"source": "Take a quick look at the first few lines", "cell_type": "markdown", "metadata": {}}, {"source": "head(dat)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "### Missing Data", "cell_type": "markdown", "metadata": {}}, {"source": "We first install some required packages", "cell_type": "markdown", "metadata": {}}, {"source": "install.packages(\"Amelia\")\ninstall.packages(\"corrgram\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Plot all missing values", "cell_type": "markdown", "metadata": {}}, {"source": "require(Amelia)\nmissmap(dat, \n main=\"Titanic Training Data - Missings Map\", \n col=c(\"yellow\", \"black\"), \n legend=FALSE)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "## Visualization\nLet's take a quick look at some of our columns by survival rate:", "cell_type": "markdown", "metadata": {}}, {"source": "barplot(table(dat$survived),\n names.arg = c(\"Perished\", \"Survived\"),\n main=\"Survived (passenger fate)\", col=\"black\")\nbarplot(table(dat$pclass), \n names.arg = c(\"first\", \"second\", \"third\"),\n main=\"Pclass (passenger traveling class)\", col=\"firebrick\")\nbarplot(table(dat$sex), main=\"Sex (gender)\", col=\"darkviolet\")\nhist(dat$age, main=\"Age\", xlab = NULL, col=\"brown\")\nbarplot(table(dat$sibsp), main=\"SibSp (siblings + spouse aboard)\", \n col=\"darkblue\")\nbarplot(table(dat$parch), main=\"ParCh (parents + kids aboard)\", \n col=\"gray50\")\nhist(dat$fare, main=\"Fare (fee paid for ticket[s])\", xlab = NULL, \n col=\"darkgreen\")\nbarplot(table(dat$embarked),\n names.arg = c(\"Missing\", \"Cherbourg\", \"Queenstown\", \"Southampton\"),\n main=\"Embarked (port of embarkation)\", col=\"sienna\")", "cell_type": "code", "metadata": {"scrolled": true, "collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Let's look at this again, by odds of survival.\n#### Ticket Class", "cell_type": "markdown", "metadata": {}}, {"source": "mosaicplot(dat$pclass ~ dat$survived, \n main=\"Passenger Fate by Traveling Class\", shade=FALSE, \n color=TRUE, xlab=\"Pclass\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "#### Gender", "cell_type": "markdown", "metadata": {}}, {"source": "mosaicplot(dat$sex ~ dat$survived, \n main=\"Passenger Fate by Gender\", shade=FALSE, color=TRUE, \n xlab=\"Sex\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "#### Age", "cell_type": "markdown", "metadata": {}}, {"source": "boxplot(dat$age ~ dat$survived, \n main=\"Passenger Fate by Age\",\n xlab=\"Survived\", ylab=\"Age\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "mosaicplot(cut(dat$age, breaks=seq(0,80,10)) ~ dat$survived,\n shade=FALSE, color=TRUE,\n main=\"Passenger Fate by Age\",\n xlab=\"Age\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "#### Port of Embarcation", "cell_type": "markdown", "metadata": {}}, {"source": "mosaicplot(dat$embarked ~ dat$survived, \n main=\"Passenger Fate by Port of Embarkation\",\n shade=FALSE, color=TRUE, xlab=\"Embarked\", ylab=\"Survived\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "## Correlation between features", "cell_type": "markdown", "metadata": {}}, {"source": "require(corrgram)\nrequire(plyr)\ncorrgram.data <- dat\n## change features of factor type to numeric type for inclusion on correlogram\ncorrgram.data$survived <- as.numeric(corrgram.data$survived)\ncorrgram.data$pclass <- as.numeric(corrgram.data$pclass)\ncorrgram.data$embarked <- revalue(corrgram.data$embarked, \n c(\"C\" = 1, \"Q\" = 2, \"S\" = 3))\n## generate correlogram\ncorrgram.vars <- c(\"survived\", \"pclass\", \"sex\", \"age\", \n \"sibsp\", \"parch\", \"fare\", \"embarked\")\ncorrgram(corrgram.data[,corrgram.vars], order=FALSE, \n lower.panel=panel.ellipse, upper.panel=panel.pie, \n text.panel=panel.txt, main=\"Titanic Training Data\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "# Dealing with Missing Age values\nFollowing the sophisticated imputation of missing ages", "cell_type": "markdown", "metadata": {}}, {"source": "summary(dat$age)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "boxplot(dat$age ~ dat$pclass, \n main=\"Age by class\",\n xlab=\"Ticket Class\", ylab=\"Age\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Already better than just the global median, but we can do better. Look at passanger names:", "cell_type": "markdown", "metadata": {}}, {"source": "head(dat$name, n=10L)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "We migth be able to use titles to guess ages. Extract using RegEx", "cell_type": "markdown", "metadata": {}}, {"source": "## function for extracting honorific (i.e. title) from the Name feature\ngetTitle <- function(name) {\n title.dot.start <- regexpr(\"\\\\, [A-Z]*\\\\.\", name, TRUE)+2\n title.dot.end <- title.dot.start + attr(title.dot.start, \"match.length\")-3\n title <- substr(name, title.dot.start, title.dot.end)\n return (title)\n}", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}, {"source": "Extract from names and assign to new column", "cell_type": "markdown", "metadata": {}}, {"source": "dat$title <- getTitle(dat$name)\nunique(dat$title)", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Let's check which of these have missing age values", "cell_type": "markdown", "metadata": {}}, {"source": "options(digits=2)\nrequire(Hmisc)\nstats <- bystats(dat$age, dat$title, fun=function(x) c(Mean=mean(x),Median=median(x)))\nstats", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "OK, let's get all the titles with missing values", "cell_type": "markdown", "metadata": {}}, {"source": "titles.na <- attr(stats, \"dimnames\")[[1]][stats[,2]>0]\n# Remove last row\ntitles.na <- titles.na[1:length(titles.na)-1]\ntitles.na", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "Impute for each of these", "cell_type": "markdown", "metadata": {}}, {"source": "imputeMedian <- function(impute.var, filter.var, var.levels) {\n for (v in var.levels) {\n impute.var[ which( filter.var == v)] <- \n impute(impute.var[which( filter.var == v)])\n }\n return (impute.var)\n}", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "dat$age[which(dat$title == 'Dr.')]", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "dat$imputed.age <- is.na(dat$age)\ndat$age <- imputeMedian(dat$age, dat$title, titles.na)\nbystats(dat$age, dat$title, fun=function(x) c(Mean=mean(x),Median=median(x)))", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "bystats(dat$age[dat$imputed.age] , dat$title[dat$imputed.age], fun=function(x) c(Mean=mean(x),Median=median(x)))", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "# Done - save back", "cell_type": "markdown", "metadata": {}}, {"source": "upload.dataset(dat, ws, \"Titanic Imputed\")", "cell_type": "code", "metadata": {"collapsed": false}, "execution_count": null, "outputs": []}, {"source": "", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}, {"source": "", "cell_type": "code", "metadata": {"collapsed": true}, "execution_count": null, "outputs": []}], "nbformat": 4} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment