drjwbaker/Wednesday (DHOxSS 2013)

## Wednesday (DHOxSS 2013)
[live notes, so excuse the errors, omissions and personal perspective]
***This work is licensed under a Creative Commons Attribution 3.0 Unported License.***

[live blog] #DHOxSS Kate Lindsay, Re-imaging the First World War, Academic IT Services. How can digital humanities move us beyond the trenches?

@KTdigital. Manager for Engagement | Education Enhancement Academic IT.
Standard narratives of the war have a long history. But, global impact of the war often underplayed, as is gender.
Anniversary provides an opportunity to tell new stories: so not just war poetry!

And yet, Oxford started digitisation in late-90s with Wilfred Owen poems, and then added more
First World War poetry digital archive: oucs.ox.ac.uk/ww1lit
Teachers like digital collections: drafts of poems together on digital platforms challenge idea of final version.
Public submission system for private ww1 collections.
Know these people not IT literate. So held 6 roadshow days.
6500 submissions in 12 weeks. Captured stories as well as photographs of letters/objects.
JISC funded Oxford to train other people in this model runcoco.oucs.ox.ac.uk @runcoco
Initially no login. Just contribution to community owned collections.
Took this opportunity to explore how this model could be used for very different collections.

Europeana took the model. Very successful. Over 20,000 objects collected in Germany alone. @europeana1914
Continuing to offer training to Europeana on rapid digitisation. 60k objects in total digitised.
Europeana 1914-1918 allows us to move away from Anglo focus on WW1: objects from all countries together.
Eclectic collections. What do they mean? How can they be meaningfully used?
Comments on the blog, bringing together disparate knowledge and family histories.
Work with schools. Objects as a means of connecting children and heritage.
General public 'knows' an awful lot about WW1. Project generated emails, people correcting descriptions (eg bus ticket).
Public as providing context around the content.

Feed all this into high quality educational tools, in this case an OER.
ww1centenary.oucs.ox.ac.uk
Wordpress base, community blog supporting teaching of WW1.
Very much focused on the cultural stories that surround the conflict.
Audio and video talks uploaded to iTunesU, 60k downloads (open licence: preferring CC-BY NC SA); resource library.
Most popular areas the visualisations, which mash up data: scrapping wikipedia/media for content and mapping it.
Or maps showing editors of WW1 related wiki pages: show the afterlife of particular battles.
CWGC opened up data in graves.

Snowball effect of the projects back to back, ready made audiences.
Measuring impact from the start helps one project roll into the next.
Moving now from a project model to a consultancy model.


Glenn Roe and Martiun Wynne
Close, Distant, and Scalable Reading

Roe

theliterarylink.com/closereading.html > specific examination, meaning of microcosm working us toward understanding the macrocosm.
Foucault alone = the humanist.
BUT it would take 30 lifetimes to read 400k books.
Culturomics (no humanists...) paper published in Science. Ngrams spikes, but what do they really mean...
An ngram spike could mean pro- and counter- perspective on a text, thing, idea, person.
Problematic relationship with text itself.

One solution to stop reading, Distance Reading.
Digitising means we now have data. A critis! (as eveything is in the humanities...)
From research with computers which helps you do your reseach faster, to new discoveries with new resources

Big data? What is big? (H not big by SS standards, but hey big enough for us!)
Where it is difficult to think of working on collections as individual texts.
Matt Jockers, Macroanalysis (2013) > if Moretti quantitative, Jockers computational.
Using metadata: eg 'influenced by' tag on wikipedia, to say something about philosophical influence.
Explicit relationships of Republic of Letters is quanitative.
Annales as a base for DH. Certainly the Annales style, so some contunity and connections to scholarly tradition.

Do we come to scalable reading. If we have close and distant, how do we get back from the latter to the former.
Or not reading. After all we do this anyway, we don't read everything properly: so why not let computers help us out?
SR helps us insert our work back into the traditions of humanities scholarship.
Distance reading and data-driven analysis to provide useful context.
1) corpus linguistics
2) information retrieval
3) text mining, data viz

Wynne

Scalable reading and corpus lingusitics: corpus, condordnace, collocation (see John Sinclair, 1991).
Collocation: which words are likely to keep company of other words, and deriving meaning from that.

A corpus linguist would today go to the british national corpus to understand (changing) use of a word.
Drilling from list of word with collacation to the content itself: giving distant view close context.
The point is that corpus linguists have been doing scalable reading for some time.
> need to zoom in and out to get a real sense of the data.
Keeping stopwords in co-occurance measures gives you a sense of the phraseology surrounding a word.
As do words that don't feature in the corpus collocating strongly with a given word.
Eg: newspapers not the cheeryist of sources, therefore words like 'aftermath' tend to collocate with bad events.
Can we test assertions historians make based on close reading and test them against large corpora?
Eg Quentine Skinner (1978) on state not being used in a modern was before mid-16th century (so government, social control)
Eg Was the word Tudor used by the Tudors? Research says hardly ever.
If you stop at just the numbers, you don't get the whole picture.
Eg rise of the word 'holocaust' related to both 'The Holocaust' and nuclear war.
 > 1960s an important turning point for rise of holocaust as 'The Holocaust'
Keith Thomas, The Ends of Life (2010): scepticism around quantification of fuzzy phenomenon.
 > 'al I can do is record my impressions after long immersion in the period'
Perceptive review in TLS: last book of this sort, quotes/immersion now more accessible, long close reading not likely to be how new generation works.

Some objections: isn't this just Googling stuff? Isn't this just looking at words?
Yes, but the key is interpretation....
What is and isn't your corpus, and hence what can you claim from it.
What is the tool doing? Can my search be better? Where do I refine from here?

Good viz isn't fixed, it is an iterative process.
How do we connect DH to real past lived experience: scalable ensures this, and can be done quicker at a particular level.

[really good introduction:
combination of theory and explanation of practice, with a sense of how scalable reading builds on traditional scholalarship
> build big data / scalable reading into BL programmes? Take new contexts perspective?
> enjoyed focus on concordance as opposed to ngrams]


PM

Philologic. Does not destroy TEI encoding, but not required: TEI aware (so no lucene index).
Voyant: close distance reading, better at single texts as opposed to large corpora (though it can handle the latter).
Corrected ECCO database with TEI on top.

http://artfl-project.uchicago.edu/content/ecco-tcp
Exercise 1: When how and why did people write about Oxford in the 18th Century?
Start with frequencies, move to concordances.

4k occurances, but what does that mean?
Frequency by year reflects boom in printing, not increase in writing about Oxford.
By year by 10k shows a more balanced picture
Oxford as related to great men, poetry, travel.
'Oxford and Cambridge' a lingustic ordering from the 18th century
'New' and 'Street' as close colocates, noise within the database.
We can't quickly see what is not in it...
Doesn't give a clear sense of how the distrubtion of texts over time is representative of the change in frequency of printing over the 18th century.
[Oxford works well as an example because it is a place, street, earl et cetera, and therefore makes trainees think about what they are and aren't seeing in the data]
Non-exportable [oh dear...]

Oxford Text Archive
http://ota.ox.ac.uk/

AntConc
http://www.antlab.sci.waseda.ac.jp/software.html
http://www.antlab.sci.waseda.ac.jp/software/README_AntConc3.2.4.pdf
http://research.ncl.ac.uk/decte/toon/assets/docs/AntConc_Guide.pdf
Challenge of comparing across texts as MI scores and T Scores (???) difficult to compare when corpus lengths are different.
Use with concordances and n-grams (2-, 3-) to drill around a word [with the history titles index]


***This work is licensed under a Creative Commons Attribution 3.0 Unported License.***
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/3.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 Unported License</a>.
	[live notes, so excuse the errors, omissions and personal perspective]
	*This work is licensed under a Creative Commons Attribution 3.0 Unported License.*

	[live blog] #DHOxSS Kate Lindsay, Re-imaging the First World War, Academic IT Services. How can digital humanities move us beyond the trenches?

	@KTdigital. Manager for Engagement \| Education Enhancement Academic IT.
	Standard narratives of the war have a long history. But, global impact of the war often underplayed, as is gender.
	Anniversary provides an opportunity to tell new stories: so not just war poetry!

	And yet, Oxford started digitisation in late-90s with Wilfred Owen poems, and then added more
	First World War poetry digital archive: oucs.ox.ac.uk/ww1lit
	Teachers like digital collections: drafts of poems together on digital platforms challenge idea of final version.
	Public submission system for private ww1 collections.
	Know these people not IT literate. So held 6 roadshow days.
	6500 submissions in 12 weeks. Captured stories as well as photographs of letters/objects.
	JISC funded Oxford to train other people in this model runcoco.oucs.ox.ac.uk @runcoco
	Initially no login. Just contribution to community owned collections.
	Took this opportunity to explore how this model could be used for very different collections.

	Europeana took the model. Very successful. Over 20,000 objects collected in Germany alone. @europeana1914
	Continuing to offer training to Europeana on rapid digitisation. 60k objects in total digitised.
	Europeana 1914-1918 allows us to move away from Anglo focus on WW1: objects from all countries together.
	Eclectic collections. What do they mean? How can they be meaningfully used?
	Comments on the blog, bringing together disparate knowledge and family histories.
	Work with schools. Objects as a means of connecting children and heritage.
	General public 'knows' an awful lot about WW1. Project generated emails, people correcting descriptions (eg bus ticket).
	Public as providing context around the content.

	Feed all this into high quality educational tools, in this case an OER.
	ww1centenary.oucs.ox.ac.uk
	Wordpress base, community blog supporting teaching of WW1.
	Very much focused on the cultural stories that surround the conflict.
	Audio and video talks uploaded to iTunesU, 60k downloads (open licence: preferring CC-BY NC SA); resource library.
	Most popular areas the visualisations, which mash up data: scrapping wikipedia/media for content and mapping it.
	Or maps showing editors of WW1 related wiki pages: show the afterlife of particular battles.
	CWGC opened up data in graves.

	Snowball effect of the projects back to back, ready made audiences.
	Measuring impact from the start helps one project roll into the next.
	Moving now from a project model to a consultancy model.


	Glenn Roe and Martiun Wynne
	Close, Distant, and Scalable Reading

	Roe

	theliterarylink.com/closereading.html > specific examination, meaning of microcosm working us toward understanding the macrocosm.
	Foucault alone = the humanist.
	BUT it would take 30 lifetimes to read 400k books.
	Culturomics (no humanists...) paper published in Science. Ngrams spikes, but what do they really mean...
	An ngram spike could mean pro- and counter- perspective on a text, thing, idea, person.
	Problematic relationship with text itself.

	One solution to stop reading, Distance Reading.
	Digitising means we now have data. A critis! (as eveything is in the humanities...)
	From research with computers which helps you do your reseach faster, to new discoveries with new resources

	Big data? What is big? (H not big by SS standards, but hey big enough for us!)
	Where it is difficult to think of working on collections as individual texts.
	Matt Jockers, Macroanalysis (2013) > if Moretti quantitative, Jockers computational.
	Using metadata: eg 'influenced by' tag on wikipedia, to say something about philosophical influence.
	Explicit relationships of Republic of Letters is quanitative.
	Annales as a base for DH. Certainly the Annales style, so some contunity and connections to scholarly tradition.

	Do we come to scalable reading. If we have close and distant, how do we get back from the latter to the former.
	Or not reading. After all we do this anyway, we don't read everything properly: so why not let computers help us out?
	SR helps us insert our work back into the traditions of humanities scholarship.
	Distance reading and data-driven analysis to provide useful context.
	1) corpus linguistics
	2) information retrieval
	3) text mining, data viz

	Wynne

	Scalable reading and corpus lingusitics: corpus, condordnace, collocation (see John Sinclair, 1991).
	Collocation: which words are likely to keep company of other words, and deriving meaning from that.

	A corpus linguist would today go to the british national corpus to understand (changing) use of a word.
	Drilling from list of word with collacation to the content itself: giving distant view close context.
	The point is that corpus linguists have been doing scalable reading for some time.
	> need to zoom in and out to get a real sense of the data.
	Keeping stopwords in co-occurance measures gives you a sense of the phraseology surrounding a word.
	As do words that don't feature in the corpus collocating strongly with a given word.
	Eg: newspapers not the cheeryist of sources, therefore words like 'aftermath' tend to collocate with bad events.
	Can we test assertions historians make based on close reading and test them against large corpora?
	Eg Quentine Skinner (1978) on state not being used in a modern was before mid-16th century (so government, social control)
	Eg Was the word Tudor used by the Tudors? Research says hardly ever.
	If you stop at just the numbers, you don't get the whole picture.
	Eg rise of the word 'holocaust' related to both 'The Holocaust' and nuclear war.
	> 1960s an important turning point for rise of holocaust as 'The Holocaust'
	Keith Thomas, The Ends of Life (2010): scepticism around quantification of fuzzy phenomenon.
	> 'al I can do is record my impressions after long immersion in the period'
	Perceptive review in TLS: last book of this sort, quotes/immersion now more accessible, long close reading not likely to be how new generation works.

	Some objections: isn't this just Googling stuff? Isn't this just looking at words?
	Yes, but the key is interpretation....
	What is and isn't your corpus, and hence what can you claim from it.
	What is the tool doing? Can my search be better? Where do I refine from here?

	Good viz isn't fixed, it is an iterative process.
	How do we connect DH to real past lived experience: scalable ensures this, and can be done quicker at a particular level.

	[really good introduction:
	combination of theory and explanation of practice, with a sense of how scalable reading builds on traditional scholalarship
	> build big data / scalable reading into BL programmes? Take new contexts perspective?
	> enjoyed focus on concordance as opposed to ngrams]


	PM

	Philologic. Does not destroy TEI encoding, but not required: TEI aware (so no lucene index).
	Voyant: close distance reading, better at single texts as opposed to large corpora (though it can handle the latter).
	Corrected ECCO database with TEI on top.

	http://artfl-project.uchicago.edu/content/ecco-tcp
	Exercise 1: When how and why did people write about Oxford in the 18th Century?
	Start with frequencies, move to concordances.

	4k occurances, but what does that mean?
	Frequency by year reflects boom in printing, not increase in writing about Oxford.
	By year by 10k shows a more balanced picture
	Oxford as related to great men, poetry, travel.
	'Oxford and Cambridge' a lingustic ordering from the 18th century
	'New' and 'Street' as close colocates, noise within the database.
	We can't quickly see what is not in it...
	Doesn't give a clear sense of how the distrubtion of texts over time is representative of the change in frequency of printing over the 18th century.
	[Oxford works well as an example because it is a place, street, earl et cetera, and therefore makes trainees think about what they are and aren't seeing in the data]
	Non-exportable [oh dear...]

	Oxford Text Archive
	http://ota.ox.ac.uk/

	AntConc
	http://www.antlab.sci.waseda.ac.jp/software.html
	http://www.antlab.sci.waseda.ac.jp/software/README_AntConc3.2.4.pdf
	http://research.ncl.ac.uk/decte/toon/assets/docs/AntConc_Guide.pdf
	Challenge of comparing across texts as MI scores and T Scores (???) difficult to compare when corpus lengths are different.
	Use with concordances and n-grams (2-, 3-) to drill around a word [with the history titles index]


	*This work is licensed under a Creative Commons Attribution 3.0 Unported License.*
	<a rel="license" href="http://creativecommons.org/licenses/by/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/3.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 Unported License</a>.