drjwbaker/Mining Digital Repositories.md

## Mining Digital Repositories.md

      
    Raw
  

              Mining Digital Repositories.md
            
          
    ###Mining Digital Repositories: Challenges and Horizons, KB, Den Haag, 10 April 2014
Live notes, so an incomplete, partial record of what actually happened.
Tag #digrep14

####Challenges (Thursday)
Hans Jansen, KB
KB research lab.
Focus on engaging with humanities scholars.
Realistically need for partnerships with commercial bodies, but there will of course be tension.
Aim assess opportunities and challenges for using trans-national, multi-lingual, multi-media digital repositories.
#digrep14 aims of the workshop pic.twitter.com/SDkJl3jRd1
— Melvin Wevers (@melvinwevers) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Agenda:

What do we really mean by reuse?
Cross-border data flows.
Digital multimedia and open services for research ... not a trivial problem.
Stimulate international collaboration.
...try not to get bogged down in technical details.

Joris van Eijnatten, ASYMENC
European project. HERA.
Reference cultures - mental constructs used to refer to societies, that offer a model of that society. Gestate over long periods of time.
How do public discourses around reference cultures change over time, 1815-1952 here ... and how can large scale analysis of textual corpora illuminate this?
3 themes: cites and metropoli; European mass culture; emergence of products and services.
Bridging those who read texts and those who rip them apart; look closely and read from afar; write and hack.

####Deborah Thomas, LoC - Expanding Frontiers: American Newspapers in the Digital Humanities
'Frontiers are the edge of the horizons in the American ethos' Nice analogy to the (unruly) edges of DH. Even a link to mining and the potential therein at the physical and digital frontiers...
Newspapers Landscape in US. Repositories rarely overlap, different financial models.
National Digital Newspaper Program, 2005- (NEH, LoC). Digitisation at state level (funds distributed by NEH), aggregated and made free and open through LoC at Chronicling America.
7.5 milion pages now online.
Working with newspapers: high use by anyone studying the past, but were intended to last as physical objects... So poor quality a major part of working with newspapers in the US.
Guide at http://www.loc.gov/ndnp/guidelines > focus on replicable, non-proprietory, and segmentation only by column, page (due to complex visual structure across the newspaper set)
API allows data to be pushed through Flickr, Google Maps, et cetera.
At request of researchers, ability to download in bulk the OCR data http://chroniclingamerica.loc.gov/ocr/
This download and API access has allowed for advanced uses of the newspapers, eg Viral Texts.
Great examples of DH research w/ Chronicling America: http://t.co/n7MzLGOJ9R, http://t.co/m1LVrQEzzc, http://t.co/pW6KpL8p2B #digrep14
— Harriett Green (@greenharr) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Share delivery platform as open source code on GitHub.
Library of Congress software for digitised newspaper interface is open source https://t.co/4fb0ydO1er #digrep14
— Alastair Dunning (@alastairdunning) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Different OCR software being used, output different accuracy measures.
Solr only supports indexing for 8 (or so) languages.
Microfilm a more sustainable long term than digital.
By being prompted to look hard, states have found plenty of digitisation out there they didn't know was out there...
Focus on newspaper community. Data models shaped by cataloguing requirements around what a newspaper is (at least weekly, general news ... otherwise specialist publication).
Trove the model for crowdsourcing.

####Workshop 3: Mining big data in a global context (led by v.d. Heuvel, UvA)
Charles v/d Heuvel describes topic modeling used in Circulation of Knowledge project, http://t.co/KhnKFL22Gx #digrep14
— Harriett Green (@greenharr) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Working with complex datasets to understand circulation of knowledge ... beyond just letter exchange to the content of letters.
Epistolarium tool built http://ckcc.huygens.knaw.nl/epistolarium/
Combining big data with small data sets ... bouncing back and forth ... but tools fail, for the most part, when language is implicit.
Uncertainty both in qualitative and quantitative term ... but then the same goes for opportunities!
Bram Mellink
How can we trace cultural influences by digital means?
Transtlantis project traces cultural influence of U.S. on Dutch public debates w/data  from @KB_Nederland newspaper database #digrep14
— Harriett Green (@greenharr) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Texcavator http://dev.wahsp.nl/texcavator/
Count words... But what are we actually looking at? Computer cannot distinguish how the word used... Influence cannot be measured by word count alone.
Also, searching for words tend towards unsurprising results, searches that are around things you already know about. How can we avoid clichés? How can we find the unknown unknowns?
Machine is a counting machines [why not use the machine as something other than a counting machine]
Melvin Wevers
Breaking down types of reference by geography, symbolism, technological.
But should also look at networks in which references made, trace complexity ... in short, we need to think of what digitisation has forced us into.
So NER, Linked Data.
Discussion points...

digitisation should be able quality rather than scale.
small patterns on small subsets then big patterns on larger data.
can computational techniques allow us to analysis complex and messy things in ways we couldn't before.
what does this require of people, faculty, repositories?


####Reports from workshops, wrap-up
Can we mine for something that isn't spoken about, only alluded to?
Jaap Verheul: the Voldemort problem in text mining, how to mine the absence of concepts & hidden debates #digrep14
— Ulrich Tiedau (@Uli_T) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Best quote from #digrep14 so far: ‘Can you mine the dog that didn’t bark.’
— Owain Roberts (@owainrr) April 10, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Some humanities scholars read far too much into the supposed value dh-ish folks ascribe to visualisations >> just a starting point
Need to remain attentive to the core values of humanists - how often the debates we are having are made easier when we map them back to what we do as humanists
[discussions between the three groups remarkably similar ... perhaps says a lot about how many of these discussions are have long, rich heritages ... could we do with a little perspective on the debates that have been had?]
Tinkering with visualisation is a less systematic way of working than we humanists are used to [really? - being an humanist is all about reading stuff, pushing it together, enjoying what you do, and mucking about...]

####Horizons (Friday)

####Hans-Jorg Lieder, Berlin StateLibrary - What about Texts?
Not just books. Also palm leaf manuscripts, Rosetta stone, bone fragments ... And newspapers - complex text that contains tacit knowledge about now it is read ...
Librarians first response to the digital was to photograph card catalogues ... then OPACs ... then digitising text, starting with books.
Books easy (hence why Google digitise them); newspapers, music scores, manuscripts not easy (hence why Google don't digitise them).
Newspapers not built to last. But we want them to last.
Problems with OCR a consequence of both physical form and software.
Enrichment ongoing. Lack of versioning in library repositories, can be difficult to replicate results.
Europeana Newspapers, dates from 1618 to 1955. Search across 10 million newspaper pages.
Everyman resource [hmmmm ... still think they represent a intellectual element of the past as opposed to a direct link to everyday experience, certainly for before the mid-19th century]
Aggregation a important discussion. And there should be more researcher engagement with decisions over digitisation.
Tools in ENP that compares double keyed sample of text against OCR - give researchers a sense of what is possible.

####Workshop 3: Valorization/reuse of repositories
Alice Dijkstra (NWO, Netherlands)
When someone says something like DH, e-Humanities they may mean a different thing to what you think they do.
Funding structures important - think (when writing a proposal) relevance, clarity, connections outside the academy, abstract mustn't be boring, good CVs, collaborate at early stage.
My bit slides http://slidesha.re/R71gBz notes: https://gist.github.com/drjwbaker/10422453

Hildelies Balk, KB Lab

Developing on the NYPL, BF, BL models of labs. Focus on prototyping, demos, connecting virtual and physical spaces. Users building on top of our platform.
Example:

Linking, connecting, enhancing, building, hacking.
#Digrep14 Jaap Verheul giving a *tricky* live presentation of @trnslnts's text mining tool. Well, it seems to work... pic.twitter.com/Af4f3jGGFx
— Pim Huijnen (@pimhuijnen) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Elaine Collins (Director of Global Partnerships), DC Thomson family history, 'Reuse of Digital Repositories'

Family history market. Not a vendor. Fund digitisation themselves. Take on that risk ... researchers not one of their more lucrative stakeholders.
People seem to like to browse the digital collections in situ, where the access - in the library was provided the content - is freely available.
But what about personal engagement with the data, not currently available [possible?] through existing commercial products.
Partnership and investment (of resources) with the Integrated Census Microdata Project at Exeter https://www.essex.ac.uk/history/research/icem/
Protective about the names in census data. Seen as a reasonable compromise to enable research whilst protecting commercial interests.
Increase in requests around 'big data'. But coming from CS not humanists.
Collins: Most requests to use British Newspapaer Archive as big data have come from computer scientists not humanists #digrep14
— Alastair Dunning (@alastairdunning) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Craig MacDonald, CS, University of Glasgow. Obituary Project http://www.dcs.gla.ac.uk/~craigm/ Part of the Terrier team http://terrier.org/	 > mine obituaries, gold for family historians.
Cristinanini, AI, Bristol > representing news articles as network of actors.
Trying to translate researcher knowledge into things the (family history) customer need, find interesting - these are the sorts of things DCTFH would be interested in.

####Alastair Dunning, Europeana Newspapers, Europeana in a research context
How is one object represented in various ways on the internet?
Often follows something like: catalogue with metadata; digitisation project funded and run somewhere else that created new metadata alongside a digital image (one archival, one presentation); full text, often transcribed; then all the unstructured commentary around this.
Distributed lifecycle.
Europeana 2,300 organisations, 150 aggregators. Data brain.
Moving toward something based on a cloud infrastructure, something shareable. Europeana Cloud @europeana_cloud
Metadata in with definitions of use of content. Third parties then access through API, share and reuse not just data but content.
Development of Europeana as a platform not a portal. Tim Sherrat: "Portals are for visiting, platforms are for building on"
Cloud that taps into search, labs and pro.
Europeana Research will offer researchers access to APIs, downloadable raw data allowing 3rd parties to build their own tools #digrep14
— Owain Roberts (@owainrr) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
"Uploading algorithms and tools instead of downloading the data." On future possibilities for text mining. @alastairdunning #digrep14
— Irene Haslinger (@IHaslinger) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

####Ray Abruzzi, Gale Cengage Learning, 'Commercial Perspectives on Digital Heritage Collections'
Main customer the researcher, the student.
Because of customer annoyance at changes, leave, say ECCO, alone and build new things the sit outside of it and other left alone collections.
#digrep14 free access to Gale NewsVault and Artemis: Primary Sources  (2 weeks, no sign on needed) http://t.co/klkD3P4gTS
— Ray Abruzzi (@rayabruzzi) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Artemis 'Term Clusters'. Dynamic discovery tool that that data mines the collections.
We, and the licences that were agreed, did not anticipate TDM.
.@rayabruzzi giving great examples of good and bad ways to ask vendors/publishers for data #digrep14
— Harriett Green (@greenharr) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Gale looking at pushing data to libraries for them to distribute under correct terms. Portico another option they are looking at.

Rens Bod, Amsterdam, 'Digital Upotias: fidning deep patterns in Digital Repositories'

Centre for DH
A crash course in DH http://t.co/BLEt8AMNPe #digrep14
— Owain Roberts (@owainrr) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Humanists are pattern matchers. These can now, in many cases, be automatically derived. And can be very useful in driving new research.
This pattern matching is not something humanists tend to identify with.
Nonetheless it makes sense for them to start to, for some patterns need computation, digital to illuminate.
Distinction perhaps here between deep patterns (so for texts, synactic trees) and shallow patterns (so for texts, word sequences); or local patterns and universal. Humanists tend to think the former is always so, though perhaps there are some universal laws...
Humanities 3.0: the hermeneutic and critical tradition applied to deep and shallow patterns found. Rens Bod #digrep14
— Irene Haslinger (@IHaslinger) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Rens Bod from Uni Amsterdam revisiting stylometrics and pointing out flaws in previous humanities computing methodologies #digrep14
— Alastair Dunning (@alastairdunning) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Rens Bod from Uni Amsterdam revisiting stylometrics and pointing out flaws in previous humanities computing methodologies #digrep14
— Alastair Dunning (@alastairdunning) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
#digrep14 Syntactic fragments. I'm lost.
— James Baker (@j_w_baker) April 11, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Need to move from describing the patterns we have found in digital stuff to using this kind of work in relation to concrete research questions.

Some admin...

This work is licensed under a Creative Commons Attribution 3.0 Unported License.