Skip to content

Instantly share code, notes, and snippets.

@paultopia
Last active September 12, 2015 19:41
Show Gist options
  • Save paultopia/402891d05dd8c05995d2 to your computer and use it in GitHub Desktop.
Save paultopia/402891d05dd8c05995d2 to your computer and use it in GitHub Desktop.
code that behaves very oddly. not reproducible probably
# creating an example list item plus an example list:
samplerow = [1, u'Message-ID: <20625717.1075857797770.JavaMail.evans@thyme>\r\nDate: Wed, 13 Dec 2000 06:03:00 -0800 (PST)\r\nFrom: thane.twiggs@enron.com\r\nTo: jeffery.ader@enron.com, mark.bernstein@enron.com, scott.healy@enron.com, \r\n\tjanelle.scheuer@enron.com, tom.dutta@enron.com, dana.davis@enron.com, \r\n\tpaul.broderick@enron.com, chris.dorland@enron.com, \r\n\tgautam.gupta@enron.com, michael.brown@enron.com, \r\n\tjohn.llodra@enron.com, george.wood@enron.com, joe.gordon@enron.com, \r\n\tstephen.plauche@enron.com, jennifer.stewart@enron.com, \r\n\tdavid.guillaume@enron.com, tom.may@enron.com, \r\n\trobert.stalford@enron.com, jeffrey.miller@enron.com, \r\n\tnarsimha.misra@enron.com, joe.quenet@enron.com, \r\n\tpaul.thomas@enron.com, ricardo.perez@enron.com, \r\n\tkevin.presto@enron.com, sarah.novosel@enron.com, \r\n\tchristi.nicolay@enron.com\r\nSubject: ISO-NE failure to mitigate ICAP market -- Release of ISO NE\r\n confidential information\r\nMime-Version: 1.0\r\nContent-Type: text/plain; charset=us-ascii\r\nContent-Transfer-Encoding: 7bit\r\nX-From: Thane Twiggs\r\nX-To: Jeffery Ader, Mark Bernstein, Scott Healy, Janelle Scheuer, Tom Dutta, Dana Davis, Paul J Broderick, Chris Dorland, Gautam Gupta, Michael Brown, John Llodra, George Wood, Joe Gordon, Stephen Plauche, Jennifer N Stewart, David Guillaume, Tom May, Robert Stalford, Jeffrey Miller, Narsimha Misra, Joe Quenet, Paul D Thomas, Ricardo Perez, Kevin M Presto, Sarah Novosel, Christi L Nicolay\r\nX-cc: \r\nX-bcc: \r\nX-Folder: \\Joseph_Quenet_Dec2000\\Notes Folders\\All documents\r\nX-Origin: Quenet-J\r\nX-FileName: jquenet.nsf\r\n\r\nThe New England Conference of Public Utilities Commissioners (NECUPUC) filed \nan Answer if Support of the Motion of the Maine Public Utilities Commission \nfor Disclosure of Information.\nNECUPUC supports the request for the release of the unredacted copies of \nISO-NE\'s September 21, 2000 Answer in this case. In the alternative, they \nwould ask that the Commission provide to the regulators that are parties to \nthe proceeding unredacted copies of the ISO\'s September 21, 2000 Answer \nsubject to an appropriate protective order.\n\n\n\nDuke Energy North America (DENA) filed an Answer that opposes the MPUC \nrequest for public information.\n\nDENA argues that only a three month lag in the release of confidential \ninformation is impermissible under a prior FERC ruling in the NSTAR Services \nCo. case which set out a six-month lag rule for the release of information.\nTheir second argument was that the request seeks information for all NEPOOL \nmarkets and not just the ICAP market which is subject to the suit.\nIf the FERC authorizes the release of confidential information, then it \nshould be subject to a protective order which contains the following:\nThe information may only be used for the purposes of this docket.\nOnly specifically named "reviewing Reps associated with the MPUC may review \nthe information.\nThe confidential materials may not be removed from the NE-ISO\'s premises.\nThe reviewing rep must execute a nondisclosure certificate.\n\n\nAnswer to the MPUC\'s Motion for Disclosure of Information from Northeast \nUtilities Service Company and Select Energy and request for expedited \ncommission action. NUSCO and Select Energy Support MPUC\'s request for \ndisclosure of information that the ISO has filed under seal, but opposes the \nselective disclosure of this information to the MPUC and other regulatory \ncommissions and not other participants. NUSCO and Select Energy request \nexpedited action due the financial implications and there is also \nconsiderable uncertainty regarding prices in the residual ICAP market in \nJanuary, February and March of 2000 due to the suspended settlement pending \nCommission guidance.\n\n']
enronEmails = [samplerow[:], samplerow[:]]
import email, nltk, re
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def parseEmail(document):
# strip unnecessary headers, header text, etc.
theMessage = email.message_from_string(document)
tofield = theMessage['to']
fromfield = theMessage['from']
subjectfield = theMessage['subject']
bodyfield = theMessage.get_payload()
wholeMsgList = [tofield, fromfield, subjectfield, bodyfield]
# get rid of any fields that don't exist in the email
cleanMsgList = [x for x in wholeMsgList if x is not None]
# now return a string with all that stuff run together
return ' '.join(cleanMsgList)
def lettersOnly(document):
return re.sub("[^a-zA-Z]", " ", document)
def wordBag(document):
return lettersOnly(parseEmail(document)).lower().split()
def cleanDoc(document):
dasbag = wordBag(document)
# get rid of "enron" for obvious reasons, also the .com
bagB = [word for word in dasbag if not word in ['enron','com']]
unstemmed =[word for word in bagB if not word in stopwords.words("english")]
return [stemmer.stem(word) for word in unstemmed]
# THIS WILL WORK:
print cleanDoc(enronEmails[0][1])
# BUT THEN:
def atLeastThreeString(cleandoc):
return ' '.join([w for w in cleandoc if len(w)>2])
# THIS WILL STILL WORK:
print atLeastThreeString(cleanDoc(enronEmails[0][1]))
# THIS WILL NOT: throws errors, sometimes alleging that I'm trying to pass a list to email.message_from_string()
# and sometimes alleging that I'm trying to feed it unicode (I AM trying to feed it unicode, but it's been unicode
# all along and worked before)
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]
# Then after running this, it turns around and starts throwing errors back when I try to rerun line 38 too.
# moreover, looping doesn't work either.
cleanedlist = []
failedlist = []
for email in justEmails:
try:
cleandlist.append(atLeastThreeString(cleanDoc(email)))
except:
failedlist.append(email)
print len(cleanedlist)
print len(failedlist)
# generates 0 in cleanedlist, and the entire thing in failedlist
# BUT using map does work:
def wholeTask(document):
return atLeastThreeString(cleanDoc(document))
fullList = map(wholeTask, justEmails)
# finally produces the expected output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment