Skip to content

Instantly share code, notes, and snippets.

@drjwbaker
Last active January 24, 2018 16:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save drjwbaker/261252548e2cb59a3af33f8a78832c98 to your computer and use it in GitHub Desktop.
Save drjwbaker/261252548e2cb59a3af33f8a78832c98 to your computer and use it in GitHub Desktop.
Email Preservation: How Hard Can It Be? 2, 24 January 2018

Email Preservation: How Hard Can It Be? 2, 24 January 2018

Live notes, so an incomplete, partial record of what actually happened.

Tags: dpcemail2

My asides in {}

http://www.dpconline.org/events/briefing-day/email-2018


Talks

1035 – Introductory talk (Chris Prom, University of Illinois Urbana Champaign and Kate Murray, Library of Congress)

Not about constructing policy, rather recommendations .. but looking to move into policy directions in dissemination .. bibliography: http://www.emailarchivestaskforce.org/bibliography/ .. plan to publish in May .. since first briefing, additions: need to emphasise building a user community, challenge of sensitive emails, tools to process at scale, more AI for classification, bring in more on what users want from email, workforce skills, more call to arms, sustainable open source tools.

Report in sum:

  • Section 1: The Untapped Potential of Email Archives .. email matters and that is not the prevailing opinion .. email is bounded by technology ..

Email Preservation has to focus on potential access. Yes! Really, all of our Digital Preservation efforts should be considering this. @chrisprom #dpcemail2 pic.twitter.com/EWPcBu1q5c

— Natalie Harrower (@natalieharrower) January 24, 2018

.. what needs to change in archival processes: embrace email as complex research data {so not really an archive}; do more AI/NLP; more active role for content creators

  • Section 2: {missed the title, probably accession workflows}

#DPCEmail2 #DPC @chrisprom - Report distinguishes between "organisational" and "personal" #recordsmanagement and #archives perspectives - This can drive changes in processing work-flows. Tools are common, tasks may differ.

— Tim Gollins (@timgollins) January 24, 2018

Diagram of email processing chain for repositories #dpcemail2 #digipres #digitalpreservation #archives #recordsmanagement pic.twitter.com/37RswEefrL

— Natalie Harrower (@natalieharrower) January 24, 2018
  • Section 3: Email as a Documentary Technology

Email is a system for creating and communicating messages .. extensible .. where is the boundary around what constitutes a message? .. laying out the architecture of email reminds us of the complexity .. one of the more successful internet technologies .. ~ 'average email in a corporate archive may exist in 30/40 places in different formats' .. email in constant stage of transition ..

  • Section 4: Current Services and Trends

Challenges around getting, keeping, reviewing, making available.

  • Section 5: Potential Solutions and Sample Workflows

Preservation Approaches: bit level; more attention given to forward migration because due to stable email standards it is usually possible and easy(ish); emulation good for user experience and work with attachment ..

Potential email Preservation processes #dpcemail2 pic.twitter.com/huNNNJKR1f

— Natalie Harrower (@natalieharrower) January 24, 2018

.. Tools: lots out there! Bad news is they need to be chained together into rather complex workflows; ePadd: good for import/export, entity extraction and NLP; TOMES https://www.ncdcr.gov/resources/records-management/tomes/tomes-resources aimed at government records community and scales up well, looking to integrate machine learning; DArcMail (Smithsonian) converts to EAccess Format; Harvard Electronic Archiving System: takes and migrates attachments unlike something like ePadd; + proprietary tools, eg Emailchemy, Access DAta FTK, Preservica ..

#DPCEmail2 #DPC @chrisprom - discussing EPADD https://t.co/0nOgmEoR3Y and TOMES https://t.co/2MPr51UUzz and DArcMail https://t.co/OVB24G5mt0 - Most appear to be "migration based"

— Tim Gollins (@timgollins) January 24, 2018

.. Workflow: might include getting a PST out of FTK/BitCurator to then pass it through things like ePadd ..

Harvard workflow scenario (using ePadd) #dpcemail2 pic.twitter.com/YS5EIAeWFe

— David Underdown (@DavidUnderdown9) January 24, 2018

Very interesting that we're at a moment where research data preservation has a higher priority that preserving historical collections of 'letters' - if email can be seen as both. #DPCEmail2 https://t.co/rQPkx9LyNh

— Natalie Harrower (@natalieharrower) January 24, 2018

What would it mean to archive email as research data rather than as letters or personal papers?

Given complexity of the work-flows (consequent expected fragility) being presented I really feel that that "minimal ingest" and "autonomous agent processing" approaches to #digipres should be considered - we will reap effects of ingest mis-processing at our leisure #dpcemail2

— Tim Gollins (@timgollins) January 24, 2018

Ability to go back to the bit level should be fundamentally important in case things go wrong: human error and software tool black boxedness. And the report stresses that.

Quite fascinated by the archival and historiographical implications of approaching preservation of email as research data vs. as 'collections' of correspondence a la historical letters. #DPCEmail2 Fodder for our next chat @WilliamKilbride! @dpc_chat

— Natalie Harrower (@natalieharrower) January 24, 2018

Is email closer to 'research data' or 'personal letters' when it comes to archival practices? Does it matter? Thread here from #dpcemail2 briefing day. https://t.co/vSIkQYWJEo

— Digital Repository (@dri_ireland) January 24, 2018

1130 – Using email archives in Research (James Baker, University of Sussex)

Me!

1200 – The reconstruction of narrative in E-Discovery investigation (Simon Attfield, Middlesex University and Larry Chapin, Attorney)

Work around civil and criminal procedures ..

#dpcemail2 definition of eDiscovery pic.twitter.com/6l38E7ck5u

— David Underdown (@DavidUnderdown9) January 24, 2018

#dpcemail2 Chapin believes lawyers need to think in terms of narrative to recover the real meaning of the documents that are produced by eDiscovery tools

— David Underdown (@DavidUnderdown9) January 24, 2018

'email is the motherload of evidence' - Larry Chapin at #dpcemail2

— William Kilbride (@WilliamKilbride) January 24, 2018

.. LIBOR story made of what might never have been known had e-mails not been preserved .. email a place full of stories ..

#dpcemail2 relationships between people that became obvious from emails shaped direction of enquiries. Also explicit mention that some things weren’t being recorded in emails to avoid regulatory attention

— David Underdown (@DavidUnderdown9) January 24, 2018

.. eg use of nicknames .. What are lawyers using: predictive coding, story order tends to persuade juries better ..

#dpcemail2 now over to Simon Attfield, describing what predictive coding is and how it’s used. Lawyer provides computer with some documents known to be useful. Computer responds with “similar” ones. These are reviewed by lawyer and process iterates

— David Underdown (@DavidUnderdown9) January 24, 2018

.. better the structure of the documents = better the recall and precision, quicker document review

1230 – Email as corporate record (James Lappin)

now, with @JamesLappin we move on to the routine disposal of email #dpcemail2

— William Kilbride (@WilliamKilbride) January 24, 2018

#dpcemail2 now James Lappin on routine deletion policies for email and the impact for capture of emails. Expectation is that emails that constitute a “record” will be moved to a record management system

— David Underdown (@DavidUnderdown9) January 24, 2018

Gap between what is sent and what is moved into document management anything between 1 in 20 and 1 in 100 .. or we could aks people to get rid of the personal and not business orientated and assume the rest is significant, the problem of which is the governments and businesses would be forced to keep stuff (and pay for keeping that stuff) that they can't use ..

#dpcemail2 so those emails belonging to predecessor (now retired etc) becomes risk and burden, rather than valuable asset. Cannot be practically used by the organisation due to the personal material etc

— David Underdown (@DavidUnderdown9) January 24, 2018

.. TNA Guidelines: auto-deletion ..

@JamesLappin #dpcemail2 #DPC - This is an AWESOME articulation of the CORE paradox at the heart of e-mail archiving in organisations (especially government). I hope the slides will be on-line. #archives and #digipres communities - this really matters.

— Tim Gollins (@timgollins) January 24, 2018

.. if government fears the cost and risk of treating business emails as records, it creates a situation in which deletion is normal ..

@JamesLappin #dpcemail2 #DPC - articulating the risk of a voluminous and increasing granular tipping point that might make it impractical for organisations - 1 in 100 emails being kept might be considered acceptable because that filters out the risky material.

— Tim Gollins (@timgollins) January 24, 2018

@JamesLappin #dpcemail2 #DPC - citing Sir Alex Allen report https://t.co/vmam6mkN7s and https://t.co/bXJc68iURo as evidence that e-mail is NOT being kept in volume.

— Tim Gollins (@timgollins) January 24, 2018

.. what are the circumstances it which it would be appropriate to capture an individual's account in totality nad indefinitely?

1400 – The Future of Email Archiving: Four Propositions (Jason R. Baron, Drinker Biddle LLP)

Massive growth in presidential and federal records: 32m for Bill Clinton, 300m+ expected for Obama .. access massive issue when stuff not physical .. 1) someone has to be preserving, got to abandon manual. Trying to apply Schellenburg style record schedules to email just doesn't work (too much manual effort!)

#dpcemail2 Proposition 1: manual approaches result in non-compliance

— David Underdown (@DavidUnderdown9) January 24, 2018

Jason R. Baron: "Applying [traditional] records schedules to email does not work." #dp0c #dpcemail2

— somaya langley (@criticalsenses) January 24, 2018

End users are always the achilles heel (that is, asking people to drag and drop or print to paper etc their emails). M-12-18 Obama directive: by 2019 federal agencies will manage all important records in electronic format .. 2) Automated Capture. Pushed Capstone approach: chose what is scheduled to be deleted by role. 1.5% of stuff will be kept indefinitely. And that 1% is still 'Niagara Falls', billions of emails. AI is coming, less hierarchical capture will replace Capstone: bigger buckets .. 3) Advanced Search will aid access. See eDiscovery, telling stories, common in lawyer world. Lawyers need to provide 100% recall. Keywords are rubbish. Predictive Coding: supervised learning with small seed set and human input to correlate likeness {basically, clustering tech} .. 4) Sensitivity review needs to be embedded. Because beyond easy stuff (dates, names, bank account numbers) it is super hard.

We need faith in data analytics. Because otherwise archives will be less open not more.

1430 – Panel discussion introduced by Tim Gollins (National Records of Scotland): technology assisted review, fact, fiction or jam tomorrow and why this matters for email preservation

Technology Assisted Review, Predictive Coding, et al one big bucket of tech used interchangeably .. in the eDiscovery space it work in reviewing for 'responsiveness to the "Matter"' .. lots of commercial investment: economic driver, money to be made .. it is largely information retrieval via machine learning .. TAR is good, but borrowing tasks from other sectors to do 'our' tasks wont' do necessarily do the job: e-Discovery is not e-Archiving selection, appraisal, or sensitivity review - they are related tasks, need to be tuned for our tasks .. use of TAR in a case of email: McCormack and Grossman on release of Email of Governor Tim Kaine (Virginia)

We need more research into practical solutions to sensitivity review for things like email archives.

1530 – Review of Report Recommendations and Roadmap (Chaired by William Kilbride)

  • Section Six: The Path Forward and Next Steps

Community Development and Advocacy. Low-Barrier Activities: demystify email archiving for donors and users, assess readiness, skills, maintain tool assessment that that panel has done. High-Barrier Activities: sustain the email archiving community (not ready for BitCurator style model), specification planning for new tools, develop authenticity criteria, improve standards, develop pdf archiving options (good for small institutions)

Tool Support, Testing, Development. Low-Barrier Activities: test existing tool re loss, improve validation tools. Higher-Barrier Activities: work on integration and interoperability, self-archiving tools, sensitivity review.

#dpcemail2 in response @WilliamKilbride comments that we need to demonstrate value of email archives, asset, not a liability. Mentions eg of paper correspondence having insurance valuations and acceptance in lieu

— David Underdown (@DavidUnderdown9) January 24, 2018

#dpcemail2 @j_w_baker initially mentions test corpus might also help with donor confidence: let them see how email archives work and what can be found. Moves on to question of communal model for tool development

— David Underdown (@DavidUnderdown9) January 24, 2018

Some admin...

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: embeds to and from external sources, and direct quotations from speakers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment