Skip to content

Instantly share code, notes, and snippets.

@lettergram
Last active December 28, 2018 06:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lettergram/42c62096cf2dc4e92517d8b87cabd266 to your computer and use it in GitHub Desktop.
Save lettergram/42c62096cf2dc4e92517d8b87cabd266 to your computer and use it in GitHub Desktop.
def encode_and_split_data(comments, categories, data_split=0.8):
'''
:param comments: List of lists containing all comments
:param categories: List containing labeled categories for associated comments
:param data_split: The ratio of training to testing data (typical 80/20 split)
:return x_train: Numpy array of encoded training sample(s) (comment)
:return x_test: Numpy array of encoded testing sample(s) (comment)
:return y_train: Numpy array of encoded training label (category)
:return y_test: Numpy array of encoded testing label (category)
'''
# Word + Punctuation + POS Tags embedding
encoded_comments = create_word_embedding(comments, add_pos_tags=True)
# Word embedding, ensure you don't add the POS tags
encoded_categories = create_word_embedding(categories, add_pos_tags=False)
# Determine the training sample split point
training_sample = int(len(encoded_comments) * data_split)
# Split the dataset into training vs testing datasets
x_train = np.array(encoded_comments[:training_sample])
x_test = np.array(encoded_comments[training_sample:])
y_train = np.array(encoded_categories[:training_sample])
y_test = np.array(encoded_categories[training_sample:])
return x_train, x_test, y_train, y_test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment