Skip to content

Instantly share code, notes, and snippets.

@drjwbaker
Last active February 22, 2016 16:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save drjwbaker/9c3c9e3ed60bf7aeaf04 to your computer and use it in GitHub Desktop.
Save drjwbaker/9c3c9e3ed60bf7aeaf04 to your computer and use it in GitHub Desktop.
Andrew Prescott, 'Searching for Dr Johnson', Sussex Humanities Lab Research Seminar, 22 February 2016

Andrew Prescott, 'Searching for Dr Johnson', Sussex Humanities Lab Research Seminar, 22 February 2016

Live notes, so an incomplete, partial record of what actually happened.

Tags: dhist

My asides in {}

Stream/Deck:


Talk

Colley listed hits on BL Newspaper Archives to assert rise of 'Magna Carta' as a C18 phenomena. Can't replicate her figures... And if you call it 'Magna Charta' you get many more. But most obvious explanation for the trend she describes is that C19 OCR is more accurate than C18 OCR. Burney C17-18 searches aren't very valuable.

Trend towards indiscriminate counting of words from search: eg Peter de Bolla, The Architecture of Concepts: The Historical Formation of Concepts, 91. Uses ECCO, but issues much the same as in Burney. Acknowledges some problems, but 'mere technical glitch' and he ploughs ahead. Doubt different conclusions from tidy data... Seriousness of OCR issues still not yet well enough understood. Nor methodological implications. 'Crack cocaine of the ngram viewer'.

Case study. From genesis (based on access not search) to method (based on search).

{cites Tanner, Munoz et al study of Burney accuracy}

Creation. Type font and quality of microfilm (based on poor quality original objects) both issues for early newspapers. Significant words (eg not stop words) worse than for the overall corpus. Less than evens chance of finding the word you want. Gale concedes the uncorrected OCR, but Connected Histories doesn't...

Origins of the project NOT to provide searchable versions of the collection, but access to microfilm. Only one set of Burney microfilm at the BL. With no catalogue. And with heavy use the microfilm deteriorated. 1992 BL started working towards digitisation. Experimental project to improve the way that the BL provided access to microfilm; papers by Hazel Podmore. Burney microfilm likely made from used rather than master microfilm. Initially found 54% OCR (deemed unacceptably low). Even when generated, OCR was anticipated as means of finding one's way, as opposed to accurate searching (per Colley, de Bolla)

The origins of digital projects can be complex with lots of institutional pressures involved.

For the time being we are stuck. We have to use the Burney as it stands. So we need to think about how bad it is. McGuffie. Samuel Johnson in the British Press, 1749-1784 A Chronological Checklist is a yardstick we can use to better understand the quality of Burney search. Comparisons confirm poor quality. But less for certain types of material in the paper: ads (though McGuffie ignores these). Library distinction (an artificial one...) between newspapers and weekly/monthly/quarterly periodicals is problematic for working on C18 work.

Basic search for 'Johnson' retrieves less than 10% of those listed in McGuffie (most of what is retrieved is ads for Johnson's Dictionary; NB: only around half of what is in the McGuffie is in Burney. BUT results suggest fruitful lines of enquiry: ads (because of high level of success; see work of Sid Bransed, Trondheim); fuzzy searches (set to low) finds things that McGuffie didn't through hand sorting. Eg, articles on Johnson and Boswell's tour of Scotland, from which we find that the Scottish press seems to be reporting on this based on reports in the London press. Or course, we miss elusive references: use of dashes, mentioned as 'Rambling Sam', of as the author of a work.

Search is the paradigm of our age. It is the way we cut through mass. But it is neither effective nor consistent. Especially when the intention of the project we are searching was not for search. Illusion is of scientific process, rather than an iterative one. Still indispensable and our research would be worse without it. Brad Pasineck: articulated his method of browse, reading, and search; desultory reading.

Assumption of digitisation was that it would be a closed approach. But things like TCP has blown that open, showing digitisation to be open ended.


Response (Tim Hitchcock)

Is the half-blindness the Burney has imposed upon us the problem? Is accuracy such a big deal? But lack of technical criticism in the historical profession. Our digitisation recreates a mid-C20 form of culture: male, western, anglo. ECCO based on a mid-C20 LoC obsession with British culture, and thus microfilming to fill the gaps.


Q&A

Gallica a real contrast in terms of public provision, compared to British more commercial model. National silos worsened by digital compared to cosmopolitan nature beforehand ... joining of problematic resources to others amplifies the problem ... this project is about making clear and setting down the limitations (and opportunities!) ... tension between making stuff useable and making it clear what you are using, perhaps irresolvable ... is there a problem at all once we use the right fuzzy searching? ... humanists trying to give their work authority without going through the hard work of critiquing their sources


Some admin...

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Exceptions: embeds to and from external sources, and direct quotations from speakers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment