Skip to content

Instantly share code, notes, and snippets.

@WillKoehrsen
Created April 11, 2018 15:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save WillKoehrsen/6300142dea1fd12c0949d99f36f6e89d to your computer and use it in GitHub Desktop.
Save WillKoehrsen/6300142dea1fd12c0949d99f36f6e89d to your computer and use it in GitHub Desktop.
def format_data(df):
# Targets are final grade of student
labels = df['G3']
# Drop the school and the grades from features
df = df.drop(columns=['school', 'G1', 'G2', 'G3'])
# One-Hot Encoding of Categorical Variables
df = pd.get_dummies(df)
df['y'] = list(labels)
most_correlated = df.corr().abs()['y'].sort_values(ascending=False)
# Keep correlations greater than 0.2 in absolute value
most_correlated = most_correlated[most_correlated >= 0.2][1:]
df = df.ix[:, most_correlated.index]
# Already encode the higher education column in `higher_yes`
df = df.drop(columns = 'higher_no')
# Split into training/testing sets with 25% split
X_train, X_test, y_train, y_test = train_test_split(df, labels,
test_size = 0.25,
random_state=42)
# Return the training and testing data
return X_train, X_test, y_train, y_test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment