Skip to content

Instantly share code, notes, and snippets.

@shubham808
Created May 29, 2017 21:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shubham808/e865b8a50974a9658ff5db0c368ee4a5 to your computer and use it in GitHub Desktop.
Save shubham808/e865b8a50974a9658ff5db0c368ee4a5 to your computer and use it in GitHub Desktop.
{
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 0,
"cells": [
{
"cell_type": "markdown",
"source": "# Titanic Data Science \n\nObjective of this Notebook is to Simplify the Titanic Survival Prediction Problem by using a Multiple Layer Perceptron\n\nThe Various Steps are:\n1. Question or problem definition.\n2. Acquire training and testing data.\n3. Wrangle, prepare, cleanse the data.\n4. Analyze, identify patterns, and explore the data.\n5. Model, predict and solve the problem.\n6. Visualize, report, and present the problem solving steps and final solution.\n7. Supply or submit the results.\n\n## Question and problem definition\n\n> Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.\n\n Here are the highlights to note:\n- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.\n- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.\n- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n\nBelow is all you need to import for this to work so make sure you can import everything.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\nimport pandas as pd\nimport numpy as np\nimport random as rnd\nimport matplotlib.pyplot as plt\n\n#Deep Learning\n\nfrom keras.models import Sequential\nfrom keras.utils import np_utils\nfrom keras.layers.core import Dense, Activation, Dropout\nfrom keras.metrics import categorical_accuracy as accuracy\nfrom keras import regularizers\nfrom sklearn import linear_model\nfrom sklearn.metrics import accuracy_score\nfrom sklearn import svm\nfrom sklearn.ensemble import RandomForestClassifier",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Acquire data\n\nThe Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.\nWe will seperate the labels from the train features",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "features_train = pd.read_csv('train_titanic.csv')\nfeatures_test = pd.read_csv('test_titanic.csv')\n\nlabels_train = features_train['Survived']\nfeatures_train.drop('Survived', axis=1, inplace=True)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Analyze by describing data\n\nPandas also helps describe the datasets answering following questions early in our project.\n\n**Which features are available in the dataset?**\n\nNoting the feature names for directly manipulating or analyzing these. These feature names are described on the [Kaggle data page here](https://www.kaggle.com/c/titanic/data).",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "print(features_train.columns.values)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "**Which features are categorical?**\n\nThese values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.\n\n- Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.\n\n**Which features are numerical?**\n\nWhich features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.\n\n- Continous: Age, Fare. Discrete: SibSp, Parch.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\nfeatures_train.head()",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "**Which features are mixed data types?**\n\nNumerical, alphanumeric data within same feature. These are candidates for correcting goal.\n\n- Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.\n\n**Which features may contain errors or typos?**\n\nThis is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.\n\n- Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "features_train.tail()",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "**Which features contain blank, null or empty values?**\n\nThese will require correcting.\n\n- Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.\n- Cabin > Age are incomplete in case of test dataset.\n\n**What are the data types for various features?**\n\nHelping us during converting goal.\n\n- Seven features are integer or floats. Six in case of test dataset.\n- Five features are strings (object).",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Assumtions based on data analysis\n\nWe arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.\n\n1. We may want to complete Age feature as it is definitely correlated to survival.\n2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature.\n\nDropping Features\n\n3. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates.\n4. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.\n5. PassengerId may be dropped from training dataset as it does not contribute to survival.\n6. Name feature may not contribute directly to survival, so it maybe dropped.\n7. Pclass may also be dropped.\n\n\nCreating New Features\n\n1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.\n2. We may want to engineer the Name feature to extract Title as a new feature.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "###Scaling Features\n\nAs few of our most informative features are numerical in nature.\nTherefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the output and not generate distortions.\nWe will focus on the features Age and Fare while Scaling them.\nWe calculate the max value then subtract the mean from each.\nThis will be done after we pre-process the features for now we just look at the logic.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "#Scaling\nscale=max(combined_features['Age'])\ncombined_features['Age']/=scale\nmean=np.mean(combined_features['Age'])\ncombined_features['Age']-=mean\n\nscale=max(combined_features['Fare'])\ncombined_features['Fare']/=scale\nmean=np.mean(combined_features['Fare'])\ncombined_features['Fare']-=mean",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "###Dropping Features\n\nThis is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.\nBased on our assumptions and decisions we want to drop the Cabin, Pclass and Ticket features.\n\nNote that where applicable we perform operations on both training and testing datasets together to stay consistent.\nTo do this we will first combine them both.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "combined_features = features_train.append(features_test)\ncombined_features.reset_index(inplace=True)\ncombined_features.drop('index', axis=1, inplace=True)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Creating new feature extracting from existing\n\nWe want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.\n\nIn the following code we extract Title feature using regular expressions. The RegEx pattern `(\\w+\\.)` matches the first word which ends with a dot character within Name feature. The `expand=False` flag returns a DataFrame.\n\nWhen we plot Title, Age, and Survived, we note the following observations.\n\n- Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.\n- Survival among Title Age bands varies slightly.\n- Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).\n\n- We decide to retain the new Title feature for model training.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "combined_features['Title']= combined_features['Name'].map(\n lambda name: name.split(',')[1].split('.')[0].strip())\nTitle_Dictionary = {\n \"Capt\": \"Officer\",\n \"Col\": \"Officer\",\n \"Major\": \"Officer\",\n \"Jonkheer\": \"Royalty\",\n \"Don\": \"Royalty\",\n \"Sir\" : \"Royalty\",\n \"Dr\": \"Officer\",\n \"Rev\": \"Officer\",\n \"the Countess\":\"Royalty\",\n \"Dona\": \"Royalty\",\n \"Mme\": \"Mrs\",\n \"Mlle\": \"Miss\",\n \"Ms\": \"Mrs\",\n \"Mr\" : \"Mr\",\n \"Mrs\" : \"Mrs\",\n \"Miss\" : \"Miss\",\n \"Master\" : \"Master\",\n \"Lady\" : \"Royalty\"\n\n }\ncombined_features['Title']=combined_features.Title.map(Title_Dictionary)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "We can convert the categorical titles to ordinal.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Now we can safely drop the Name feature from training and testing datasets. We also do not need the PassengerId feature in the training dataset.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Converting a categorical feature\n\nNow we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.\n\nLet us start by converting Sex feature to a new feature called Gender where female=1 and male=0.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\n# Processing sex\ncombined_features['Sex'] = combined_features['Sex'].map({'male':1,'female':0})",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Completing a numerical continuous feature\n\nNow we should start estimating and completing features with missing or null values. We will first do this for the Age feature.\n\nWe can consider three methods to complete a numerical continuous feature.\n\n1. A simple way is to generate random numbers between mean and [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation).\n\n2. More accurate way of guessing missing values is to use other correlated features. In our case we note correlation among Age, Gender, and Pclass. Guess Age values using [median](https://en.wikipedia.org/wiki/Median) values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...\n\n3. Another meathod would be to use the Title feature we created above and use the average value of the age for each title to fill the gap. This provides a more realistic assumption. \n",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "# Processing the ages\ngrouped_train = combined_features.head(891).groupby(['Sex','Pclass','Title'])\ngrouped_median_train = grouped_train.median()\n\ngrouped_test = combined_features.iloc[891:].groupby(['Sex','Pclass','Title'])\ngrouped_median_test = grouped_test.median()\ndef fillAges(row, grouped_median):\n if row['Sex'] == 'female' and row['Pclass'] == 1:\n if row['Title'] == 'Miss':\n return grouped_median.loc['female', 1, 'Miss']['Age']\n elif row['Title'] == 'Mrs':\n return grouped_median.loc['female', 1, 'Mrs']['Age']\n elif row['Title'] == 'Officer':\n return grouped_median.loc['female', 1, 'Officer']['Age']\n elif row['Title'] == 'Royalty':\n return grouped_median.loc['female', 1, 'Royalty']['Age']\n\n elif row['Sex'] == 'female' and row['Pclass'] == 2:\n if row['Title'] == 'Miss':\n return grouped_median.loc['female', 2, 'Miss']['Age']\n elif row['Title'] == 'Mrs':\n return grouped_median.loc['female', 2, 'Mrs']['Age']\n\n elif row['Sex'] == 'female' and row['Pclass'] == 3:\n if row['Title'] == 'Miss':\n return grouped_median.loc['female', 3, 'Miss']['Age']\n elif row['Title'] == 'Mrs':\n return grouped_median.loc['female', 3, 'Mrs']['Age']\n\n elif row['Sex'] == 'male' and row['Pclass'] == 1:\n if row['Title'] == 'Master':\n return grouped_median.loc['male', 1, 'Master']['Age']\n elif row['Title'] == 'Mr':\n return grouped_median.loc['male', 1, 'Mr']['Age']\n elif row['Title'] == 'Officer':\n return grouped_median.loc['male', 1, 'Officer']['Age']\n elif row['Title'] == 'Royalty':\n return grouped_median.loc['male', 1, 'Royalty']['Age']\n\n elif row['Sex'] == 'male' and row['Pclass'] == 2:\n if row['Title'] == 'Master':\n return grouped_median.loc['male', 2, 'Master']['Age']\n elif row['Title'] == 'Mr':\n return grouped_median.loc['male', 2, 'Mr']['Age']\n elif row['Title'] == 'Officer':\n return grouped_median.loc['male', 2, 'Officer']['Age']\n\n elif row['Sex'] == 'male' and row['Pclass'] == 3:\n if row['Title'] == 'Master':\n return grouped_median.loc['male', 3, 'Master']['Age']\n elif row['Title'] == 'Mr':\n return grouped_median.loc['male', 3, 'Mr']['Age']\n\ncombined_features.head(891).Age = combined_features.head(891).apply(lambda r: fillAges(r, grouped_median_train) if np.isnan(r['Age'])\n else r['Age'], axis=1)\n\ncombined_features.iloc[891:].Age = combined_features.iloc[891:].apply(lambda r: fillAges(r, grouped_median_test) if np.isnan(r['Age'])\n else r['Age'], axis=1)\n\n",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Create new feature combining existing features\n\nWe can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\n# Processing family\ncombined_features['FamilySize'] = combined_features['Parch'] + combined_features['SibSp'] + 1\ncombined_features['Singleton'] = combined_features['FamilySize'].map(lambda s: 1 if s == 1 else 0)\ncombined_features['SmallFamily'] = combined_features['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)\ncombined_features['LargeFamily'] = combined_features['FamilySize'].map(lambda s: 1 if 5 <= s else 0)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Completing a categorical feature\n\nEmbarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\n#Processing Cabin:\ncombined_features.Cabin.fillna('U', inplace=True)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "### Quick completing and converting a numeric feature\n\nWe can now complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature. We do this in a single line of code.\n\nNote that we are not creating an intermediate new feature or doing any further analysis for correlation to guess missing feature as we are replacing only a single value. The completion goal achieves desired requirement for model algorithm to operate on non-null values.\n\nWe may also want round off the fare to two decimals as it represents currency.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\n# Processing Fares: fill empty spaces with mean\ncombined_features.head(891).Fare.fillna(combined_features.head(891).Fare.mean(), inplace=True)\ncombined_features.iloc[891:].Fare.fillna(combined_features.iloc[891:].Fare.mean(), inplace=True)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "###Dropping Features that are not informative\nWe find that \n\n\n1. Name can be Dropped\n2. Also PassengerId does not contribute to survival\n3. Embarked feature can be dropped\n4. Cabin Feature can be dropped\n5. As information from SibSp and Parch was extracted into Family Size they will be dropped\n6. Pclass will be dropped\n7. Ticket will be dropped",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\n# Processing Name\ncombined_features.drop('Name', axis=1, inplace=True)\n# Processing PassengerId\ncombined_features.drop('PassengerId', axis=1, inplace=True)\n#Processing Embarked: drop it\ncombined_features.drop('Embarked', axis=1, inplace=True)\n\n#Processing Ticket: drop it\ncombined_features.drop('Ticket', axis=1, inplace=True)\n\n#Processing Cabin:\ncombined_features.Cabin.fillna('U', inplace=True)\n\n# Processing pclass\npclass_dummies = pd.get_dummies(combined_features['Pclass'], prefix=\"Pclass\")\ncombined_features = pd.concat([combined_features, pclass_dummies], axis=1)\ncombined_features.drop('Pclass', axis=1, inplace=True)\n\n# Processing PassengerId\ncombined_features.drop('PassengerId', axis=1, inplace=True)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## Model, predict and solve\n\nNow we are ready to train a model and predict the required solution. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:\n\n- Logistic Regression\n- Support Vector Machines\n- Decision Tree\n- Random Forrest\n- Artificial neural network",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference [Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression).\n\nNote the confidence score generated by the model based on our training dataset.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "# Logistic Regression\n\nlogreg = LogisticRegression()\nlogreg.fit(features_train, labels_train)\npred = logreg.predict(features_test)\nacc_log = round(logreg.score(features_train, labels_train) * 100, 2)\nacc_log",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.\n\nPositive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).\n\n- Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.\n- This way Age*Class is a good artificial feature to model as it has second highest negative correlation with Survived.\n- So is Title as second highest positive correlation.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of **two categories**, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference [Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine).\n\nNote that the model generates a confidence score which is higher than Logistics Regression model.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "# Support Vector Machines\n\nsvc = SVC()\nsvc.fit(features_train, labels_train)\npred = svc.predict(features_test)\nacc_svc = round(svc.score(features_train, labels_train) * 100, 2)\nacc_svc",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).\n\nKNN confidence score is better than Logistics Regression but worse than SVM.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "knn = KNeighborsClassifier(n_neighbors = 3)\nknn.fit(X_train, Y_train)\nY_pred = knn.predict(X_test)\nacc_knn = round(knn.score(X_train, Y_train) * 100, 2)\nacc_knn",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference [Wikipedia](https://en.wikipedia.org/wiki/Decision_tree_learning).\n\nThe model confidence score is the highest among models evaluated so far.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "# Decision Tree\n\ndecision_tree = DecisionTreeClassifier()\ndecision_tree.fit(features_train, labels_train)\nY_pred = decision_tree.predict(features_test)\nacc_decision_tree = round(decision_tree.score(features_train, labels_train) * 100, 2)\nacc_decision_tree",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference [Wikipedia](https://en.wikipedia.org/wiki/Random_forest).\n\nThe model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "# Random Forest\n\nrandom_forest = RandomForestClassifier(n_estimators=100)\nrandom_forest.fit(features_train, labels_train)\nY_pred = random_forest.predict(features_test)\nrandom_forest.score(features_train, labels_train)\nacc_random_forest = round(random_forest.score(features_train, labels_train) * 100, 2)\nacc_random_forest",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "###Multiple Layer Perceptron\n\nNow we will try to use an arificial Neural Network to this problem.\nWe see that the number of input features is 25 we will make a sequential model containing 3 Dense Layers with input dimensions as 25.\nAlso we will add regularization. This is necessary due to deficiency of training data which could and in this case does lead to Overfitting.\nRegularization is added in the form of Dropout Layers mingled between the Dense Layers\nYou may try to experiment with the ordering and number of layers to get the best output\nSo, Try playing Lego a bit with them and try to find something interesting.\nLet me know if you do ;) ",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "model = Sequential()\nmodel.add(Dense(512, input_dim=25))\nmodel.add(Activation('relu'))\nmodel.add(Dropout(0.25))\nmodel.add(Dense(128))\nmodel.add(Activation('relu'))\nmodel.add(Dense(128))\nmodel.add(Activation('relu'))\nmodel.add(Dropout(0.5))\nmodel.add(Dense(2))\nmodel.add(Activation('softmax'))",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Before Fitting the Model we convert the labels to catagorical features using np_utils so that we may get the softmax probability output which we can later use to predict survival based on probability being greater than 0.5",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "labels_train = np_utils.to_categorical(labels_train)",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "we'll use categorical cross entropy for the loss, and adam as the optimizer\nYou may want to play around with these too\nits always good to experiment",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "\nmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\nclf=model",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Next we fit the model and split the validation at 0.1 or 10 percent this allows us to keep track of overfitting.",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "clf.fit(np.array(features_train),labels_train,validation_split=0.1,epochs=88)\nout = clf.predict(np.array(features_test))",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Next we write the predicts out to a file",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"source": "pred=[]\nfor x in out:\n if(x[1]>0.5):\n pred.append(1)\n else:\n pred.append(0)\n\nout1=pred\nidk = open('test_titanic.csv','r')\nidk = csv.DictReader(idk)\npred = []\nfor row in idk:\n pred.append(row[\"PassengerId\"])\nest = csv.writer(open('result.csv', 'w'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)\nest.writerows([[\"PassengerID\",\"Survived\"]])\nk=0\nfor x in pred:\n est.writerows([[str(x), str(out1[k])]])\n k+=1\n",
"execution_count": null,
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": "Our submission to the competition site Kaggle results in scoring 2211 of 6953 competition entries. This result is indicative while the competition is running. This result only accounts for part of the submission dataset. Not bad for our first attempt. Any suggestions to improve our score are most welcome.",
"execution_count": null,
"outputs": [],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment