Skip to content

Instantly share code, notes, and snippets.

@prrao87
Created January 13, 2019 00:03
Show Gist options
  • Save prrao87/a5e8d391c112726d4595f477765b35ba to your computer and use it in GitHub Desktop.
Save prrao87/a5e8d391c112726d4595f477765b35ba to your computer and use it in GitHub Desktop.
clean input tweet data to only have ascii characters
def _stance(path, topic=None):
def clean_ascii(text):
# function to remove non-ASCII chars from data
return ''.join(i for i in text if ord(i) < 128)
orig = pd.read_csv(path, delimiter='\t', header=0, encoding = "latin-1")
orig['Tweet'] = orig['Tweet'].apply(clean_ascii)
df = orig
# Get only those tweets that pertain to a single topic in the training data
if topic is not None:
df = df.loc[df['Target'] == topic]
X = df.Tweet.values
stances = ["AGAINST", "FAVOR", "NONE", "UNKNOWN"]
class_nums = {s: i for i, s in enumerate(stances)}
Y = np.array([class_nums[s] for s in df.Stance])
return X, Y
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment