Skip to content

Instantly share code, notes, and snippets.

View ogyalcin's full-sized avatar

Orhan Yalcin ogyalcin

View GitHub Profile
@ogyalcin
ogyalcin / clean_traindata.py
Last active October 8, 2020 14:18
Clean the Training Data
import pandas as pd
train = pd.read_csv("train.csv") #load the data from the system
train = train.drop(['Cabin'], 1, inplace=False) # First dropping 'Cabin' column because it has a lot of null values.
train = train.dropna() #delete the rows with empty values
y = train['Survived'] #select the column representing survival
X = train.drop(['Survived', 'PassengerId', 'Name', 'Ticket'], 1, inplace=True) # drop the irrelevant columns and keep the rest
X = pd.get_dummies(train) # convert non-numerical variables to dummy variables
@ogyalcin
ogyalcin / train_model.py
Created July 14, 2018 06:56
Create and Train a ML Model
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
dtc.fit(X, y)
@ogyalcin
ogyalcin / clean_testdata.py
Created July 14, 2018 06:57
Clean the Test Data
test = pd.read_csv("test.csv") # load the testing data
ids = test[['PassengerId']] # create a sub-dataset for submission file and saving it
test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], 1, inplace=True) # drop the irrelevant and keeping the rest
test.fillna(2, inplace=True) # fill (instead of drop) empty rows so that I would get the exact row number required for submission
test = pd.get_dummies(test) # convert non-numerical variables to dummy variables
@ogyalcin
ogyalcin / make_predictions.py
Created July 14, 2018 06:58
Make Predictions on Test Data
predictions = dtc.predict(test)
@ogyalcin
ogyalcin / save_to_file.py
Created July 14, 2018 06:59
Save the Results to a CSV File
results = ids.assign(Survived = predictions) # assign predictions to ids
results.to_csv("titanic-results.csv", index=False) # write the final dataset to a csv file.
@ogyalcin
ogyalcin / combined_df.py
Created August 2, 2018 08:57
Combining train and test datasets to check the missing values.
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
combined = pd.concat([train.drop('Survived',axis=1),test])
@ogyalcin
ogyalcin / null_values_heatmap.py
Created August 2, 2018 08:59
Create a Heatmap to Detect Null Values
#For iPython
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.figure(figsize=(10,25))
sns.heatmap(combined.isnull(),cmap="viridis",yticklabels=False,cbar=False)
@ogyalcin
ogyalcin / combined_info.py
Created August 2, 2018 09:00
Information about Null Values of Combined Titanic Dataset
combined.info()
@ogyalcin
ogyalcin / readcsv.py
Created August 2, 2018 11:37
Importing Numpy and Pandas & Reading CSV Files
import numpy as np
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
@ogyalcin
ogyalcin / prep_train.py
Created August 2, 2018 11:44
Preparing the Train Dataset
train['Age'].fillna(train['Age'].median(),inplace=True) # Imputing Missing Age Values
train['Embarked'].fillna(train['Embarked'].value_counts().index[0], inplace=True) # Imputing Missing Embarked Values
d = {1:'1st',2:'2nd',3:'3rd'} #Creating a dictionary to convert Passenger Class from 1,2,3 to 1st,2nd,3rd.
train['Pclass'] = train['Pclass'].map(d) #Mapping the column based on the dictionary
train.drop(['PassengerId','Name','Ticket','Cabin'], 1, inplace=True) # Dropping Unnecessary Columns
categorical_vars = train[['Pclass','Sex','Embarked']] # Getting Dummies of Categorical Variables
dummies = pd.get_dummies(categorical_vars,drop_first=True)
train = train.drop(['Pclass','Sex','Embarked'],axis=1) #Dropping the Original Categorical Variables to avoid duplicates
train = pd.concat([train,dummies],axis=1) #Now, concat the new dummy variables
train.head() #Check the clean version of the train data.