Skip to content

Instantly share code, notes, and snippets.

@prakhar21
Created January 19, 2020 17:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save prakhar21/0002d9d303bb206ed73f0c2cd01d90b7 to your computer and use it in GitHub Desktop.
Save prakhar21/0002d9d303bb206ed73f0c2cd01d90b7 to your computer and use it in GitHub Desktop.
Data Segment
def segment_data(data_file):
try:
import pandas as pd
except ImportError:
raise
data = pd.read_csv(data_file, encoding='latin-1').sample(frac=1).drop_duplicates()
data = data[['classes', 'title']].rename(columns={"classes":"label", "title":"text"})
data['label'] = '__label__' +data['label'].astype(str)
data['text'] = data['text'].apply(lambda k: k.lower().strip())
data.to_csv('./data/whole.csv', sep='\t', index = False, header = False)
data.iloc[0:int(len(data)*0.8)].to_csv('./data/train.csv', sep='\t', index = False, header = False)
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('./data/test.csv', sep='\t', index = False, header = False)
data.iloc[int(len(data)*0.9):].to_csv('./data/dev.csv', sep='\t', index = False, header = False)
return
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment