Skip to content

Instantly share code, notes, and snippets.

@adamhooper
Last active September 5, 2018 22:56
Show Gist options
  • Save adamhooper/1035fd44c28df52bf19e to your computer and use it in GitHub Desktop.
Save adamhooper/1035fd44c28df52bf19e to your computer and use it in GitHub Desktop.
Loading a PST into Overview
  1. Download a bunch of PSTs into a directory
  2. Download psts-to-files.sh and fiddle-with-rtf-bodies.rb to the same directory
  3. Install readpst
  4. chmod +x psts-to-files.sh fiddle-with-rtf-bodies.rb
  5. Run ./psts-to-files.sh
  6. Upload the files directory to www.overviewdocs.com

This works best for <20,000 emails. The scripts convert email messages to RTF to preserve formatting, and Overview can take a while to import all those files.

If you want to handle even more emails, you can upload a CSV. This takes away all attachments -- and occasionally some body text. Here's what to do:

  1. Install unrtf
  2. Run ./files-to-email-csv.rb
  3. Upload the CSV to www.overviewdocs.com
#!/usr/bin/env ruby
#
# Modifies the `files` directory in-place: converts the `readpst`-output files
# into Overview-friendly ones.
#
# Implementation:
#
# Searches the `files` directory for an email with filename `ABC-rtf-body.rtf`,
# accompanied by a plaintext version `ABC`. Deletes both; adds the basic
# headers from `ABC` to the top of `ABC-rtf-body.rtf` and outputs the result
# as `ABC.rtf`.
def plain_email_to_rtf_headers(plain_filename)
plain = IO.read(plain_filename, binmode: true)
end_of_header_index = plain.index(/\n\r?\n/)
header = plain[0...end_of_header_index]
header
.split(/\r?\n/)
.select{ |s| s =~ /^(From|To|Cc|Bcc|Date|Message-Id|Date|Subject):/i }
.sort
.map!{ |s| "#{s.gsub(/\\/, '\\\\')}\\par\n"}
.join('') + "\\par\n\\par\n"
end
def mix_plain_into_rtf(plain_filename, rtf_filename)
rtf_headers = plain_email_to_rtf_headers(plain_filename)
rtf_contents = IO.read(rtf_filename, binmode: true)
new_contents = rtf_contents.sub(/^\\/, rtf_headers)
output_filename = "#{plain_filename}.rtf"
IO.write(output_filename, new_contents, binmode: true)
end
Dir['files/**/*-rtf-body.rtf'].each do |rtf_filename|
plain_filename = rtf_filename[0..-14]
if File.exist?(plain_filename)
puts "Processing #{plain_filename}..."
mix_plain_into_rtf(plain_filename, rtf_filename)
File.delete(plain_filename)
File.delete(rtf_filename)
end
end
#!/usr/bin/env ruby
#
# Reads emails from the `files` directory (as created by `psts-to-files.sh`
# and `fiddle-with-rtf-bodies.rb`), and writes them to `emails.csv`.
UsefulHeaders = %w(Bcc Cc Date From Message-Id Subject To)
UsefulHeadersRegexp = Regexp.new("(#{UsefulHeaders.join('|')}): (.*)", 'i')
TrimAttachmentsRegexp = Regexp.new("\\A(.*?)\r?\n--(?:alt-)?--boundary-LibPST-iamunique-\\d+_-_-\r?\nContent-Type: (text\/html|application)", Regexp::MULTILINE, 'n')
def quote_csv_value(datum)
if datum =~ /[\x00-\x1f",]/
"\"#{datum.gsub(/"/, '""')}\""
else
datum
end
end
def array_to_csv_row(arr)
arr.map{ |value| quote_csv_value(value) }.join(',') + "\n"
end
def plain_email_to_csv_row(filename_without_extension, blob)
headers_string, body_string = blob.split(/\r?\n\r?\n/, 2)
headers = headers_string
.split(/\r?\n/)
.reduce({}) do |hash, line|
if line =~ UsefulHeadersRegexp
hash[$1.downcase] = $2
end
hash
end
body = UsefulHeaders.map do |header_name|
if headers[header_name.downcase].nil?
nil
else
"#{header_name}: #{headers[header_name.downcase]}"
end
end.compact.join("\n") + "\n\n#{body_string}"
arr = [ filename_without_extension ] * 2
UsefulHeaders.each do |h|
arr << headers[h.downcase] || ''
end
arr << body
array_to_csv_row(arr)
end
def rtf_filename_to_csv_row(rtf_filename)
unrtf_output = IO.popen([ 'unrtf', rtf_filename, '--nopict', '--text' ], binmode: true, &:read)
# The unrtf --quiet option doesn't seem to work (v0.21.9), I take an
# uglier approach to nixing its stupid header
text = if unrtf_output =~ /\A### Translation from RTF.*?---------\n(.*)/
$1
else
unrtf_output
end
plain_email_to_csv_row(rtf_filename, text)
end
def eml_filename_to_csv_row(eml_filename)
full_text = IO.read(eml_filename, binmode: true)
text = if full_text =~ TrimAttachmentsRegexp
"#{$1}\n\n [ attachment(s) truncated ]"
else
full_text
end
plain_email_to_csv_row(eml_filename, text)
end
File.open('emails.csv', 'wb') do |f|
f.write("id,title,#{UsefulHeaders.join(',')},text\n")
Dir['files/**/*'].each do |filename|
if filename =~ /\/[0-9]+.rtf$/
f.write(rtf_filename_to_csv_row(filename))
elsif filename =~ /\/[0-9]+$/
f.write(eml_filename_to_csv_row(filename))
end
end
end
#!/bin/bash
#
# When run in this directory, converts `*.pst` in this directory into a `files`
# subdirectory containing all email messages (in RTF format) and attachments
# (unmodified).
# Put all files into the "files" directory
mkdir -p files
rm -r files/*
for f in *.pst; do
readpst -S -8 -D -o files "$f"
done
find files -iregex '.*\(wmv\|gif\|jpg\|jpeg\|png\|eml\|zip\|rar\|avi\)$' -exec rm {} \;
./fiddle-with-rtf-bodies.rb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment