Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save samarth-agrawal-86/3d07bb025850a4837a60b7141ecf208c to your computer and use it in GitHub Desktop.
Save samarth-agrawal-86/3d07bb025850a4837a60b7141ecf208c to your computer and use it in GitHub Desktop.
Sorted Split - To create train valid test dataset using custom code
import pandas as pd
df = pd.read_csv('/kaggle/input/bluebook-for-bulldozers/TrainAndValid.csv', parse_dates=['saledate'], low_memory=False)
# Let's say we want to split the data in 80:10:10 for train:valid:test dataset
train_size = 0.8
valid_size=0.1
train_index = int(len(df)*train_size)
# First we need to sort the dataset by the desired column
df.sort_values(by = 'saledate', ascending=True, inplace=True)
df_train = df[0:train_index]
df_rem = df[train_index:]
valid_index = int(len(df)*valid_size)
df_valid = df[train_index:train_index+valid_index]
df_test = df[train_index+valid_index:]
X_train, y_train = df_train.drop(columns='SalePrice').copy(), df_train['SalePrice'].copy()
X_valid, y_valid = df_valid.drop(columns='SalePrice').copy(), df_valid['SalePrice'].copy()
X_test, y_test = df_test.drop(columns='SalePrice').copy(), df_test['SalePrice'].copy()
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment