Skip to content

Instantly share code, notes, and snippets.

@PrithivirajDamodaran
Created May 10, 2019 13:08
Show Gist options
  • Save PrithivirajDamodaran/0b658bc73e5f50b1d0617698b6177444 to your computer and use it in GitHub Desktop.
Save PrithivirajDamodaran/0b658bc73e5f50b1d0617698b6177444 to your computer and use it in GitHub Desktop.
Enron email dataset splitter/formatter
# Assumes you have the enron email dataset as emails.csv
import pandas as pd
data = pd.read_csv("emails.csv")
pd.set_option('display.max_colwidth',-1)
new = data["message"].str.split("\n", n = 15, expand = True)
data["from"] = new[2]
data["fromn"] = new[8]
data["to"] = new[3]
data["ton"] = new[9]
data["subject"] = new[4]
data["msg"] = new[15]
data.drop(columns =["message"], inplace = True)
data.drop(columns =["file"], inplace = True)
data['from'] = data["from"].apply(lambda val: val.replace("From:",''))
data['fromn'] = data["fromn"].apply(lambda val: val.replace("X-From:",''))
data['to'] = data["to"].apply(lambda val: val.replace("To:",''))
data['ton'] = data["ton"].apply(lambda val: val.replace("X-To:",''))
data['subject'] = data["subject"].apply(lambda val: val.replace("Subject:",''))
data['msg'] = data["msg"].apply(lambda val: val.replace("\n",' '))
# Lets look only at emails with 100 words or less and that are Non-replies
data[(data['msg'].str.len() <100) & ~(data['subject'].str.contains('Re:'))].sample(5)
@iedmrc
Copy link

iedmrc commented Nov 8, 2019

Hi,
How can I find a .csv version of enron dataset? Do you have any link?
Thanks!

@joseph-francis
Copy link

Hi,
How can I find a .csv version of enron dataset? Do you have any link?
Thanks!

Did you find the link?

@iedmrc
Copy link

iedmrc commented Feb 16, 2020

Hi,
How can I find a .csv version of enron dataset? Do you have any link?
Thanks!

Did you find the link?

Please check https://data.world/brianray/enron-email-dataset

@gokulnath27
Copy link

Hi,
How can I find a .csv version of enron dataset? Do you have any link?
Thanks!

Did you find the link?

https://www.kaggle.com/wcukierski/enron-email-dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment