drjwbaker/Friday (DHOxSS 2013)

## Friday (DHOxSS 2013)
[live notes, so excuse the errors, omissions and personal perspective]
***This work is licensed under a Creative Commons Attribution 3.0 Unported License.***

[live blog] Lorna Hughes, Digital collections as research infrastructure

University of Wales Chair in Digital Collections, National Library of Wales

Ghosts of the diigtal humanities past.
Early 1990s dissemination of DH... Floppy, CD-roms, software, hypertext, huge chunks of hardware, roadshows. And then the WWWW was invented.
Warnings over humanists needing to engage themselves in 'computer assisted learning' to ensure nuance captured since the 1970s.
TLTP: 'Courseware' built from central models using double-keying and later OCR to transcribe corpora.
Web allowed research councils to fund projects which gave the public access to digitised resources
- eg AHRB/C Resource enhancement fund 2000-6 > project based funding enabled researchers to work on problem focused methods
- so no sense of common endeavour, no sense of joined up research creation (neither in terms of content nor tech)
- work bespoke, maintained in silos. So have to digital content at all, just catalogues to printed resources.
- few projects would could become part of an integrated humanities research infrastructure: national, baseline history, broad.
- digital projects make possible liniking resources which as physically disperate: few did that.
- lack of use, lack of sustainability. Hence plenty now show 404 errors. University departments not best placed to host e-content.

But this is all hindsight. But hindsight that shows a visious circle:
Hard to find (poor metadata) > few users > little motivation to maintain and improve > harder to find > ...

JISC gave us a more strategic approach.
JISC e-content programme: real impact of digital projects when they make research faster and easier.
Typically supporting traditional research not new ways of working with or across content.

Development of digital collections in Wales against the backdrop of devolved government.
Digital Wales initiative: drive towards digital services which are digital by default.
Difference of Welsh strategy supports digitisation, creation of digital content: create a digital public sphere in Wales.
Nonetheless, National Library of Wales has chosen to use central funding to digitise, support content, and make it freely available in an enhanced way.
Digitisation of course supports preservation.
Growth in born-digital collections.

Digital projects also supporting change in scholarship, such as Welsh Newspapers Online.
Once you deliver these projects, what to users want: more content!
Welsh Experience of World War One, funded by JISC. Bringing fragmented and inaccessible collections together.
Established a research programme in digital projects in 2011.
Related to the digitistion beast and the need for more stuff, but also focus on users and how they use content.
Use DH to add value to the collections: what do we do with all the digital stuff?
- Use, share, engage, enrich, sustain, advocate.
"Use digital content to transform scholarship is the absolute foundation of the digital humanities"
Content, methods, tools: putting these together transforms scholarship, enables scholars to ask completely new questions.
This interface between DH and digital collections is in effect an extention of the digital public sphere.

Examples of projects:

Digital data has to be free to be useful.
It also need to be (and to be able to be) shared, aggregated and linked.
Partners such as TEL and Europaeana propels Welsh collections into an international audience.
Thus both promoting the Welsh language (national goal) and enriching the data (library goal).

LIPARM (Linking Parliamentary Records Through Metadata) [now that is a horrible acronym...]
Partners thinking both of existing content and future content.

Place name work to aid resource discovery allows clear engagement.
Wales1900 (cymru1900wales.org) uses Galaxy Zoo to crowdsource place names.
Place name important for local and language history.
Will become a useful index for existing projects: Welsh Wills Online and Welsh NEwspapers Online.

Digital humanities trendier than ever.
At the very least, all scholars use electronic resources that point to analogue resources.
BUT much work still uses the digital as print replicas.
Focus on not theorising but the use of the digital will get us out of our projectitus and digital silos.
DH is after all practice led, involves engagement in research infrastructure to develop new humanities questions.

QA
Theory vs practice? https://twitter.com/thomasgpadilla/status/355619970782199809
Usage? Most hits from Google and Wikipedia for Welsh Journals, but family history sites too.
Metadata: quality or simplicity? Does general use need as high quality data as researcher use?
Of course! Nonetheless benefits of Europeana outweigh the time/effort spent working good metadata into simple Europeana model.


Lou Bernard, TEI

Goals of the session...
What is a text? What is a document?
What is "markup"?
What ix XML markup and how do I use it?

Digital Turn
Humanities about text.
DH about digital tech and techniques that have evolved for manipulating them.
Markup is a basic technology that facilitates integration.

Why? Because the effect of markup is to make explicit how something should be processed.

What's in a text?
A stream of data which we know how to read, because we've learned it from experience.
Imagine having never seen a text before, what are the signals (markup) to tell you how to read it?
- use of white space
- different shapes of letters
- spaces between words

Texts are four dimensional:
- physical presence with visual aspects
- linguistic and structural
- convey real work information
- associated metadata (which helps us manage large collection for intelligent searching)
Good markup operates in all of these dimensions.

A text is not a document.
A document is something that exists in the world, which we can digitize.
A text is an abstraction, created by or for a community of readers, which we can markup (and express one of those readings)

A text is:
- more than a sequence of characters
- more than a sequence fof lingustic forms
Markup makes this explicit and available for analysis.

Markup
Descriptive markup that allows a machine to do something presentational (HTML) or procedural based on the content.
It is a scholarly activity. Not automatic (for the most part).
Markup is editing: it is never neutral, it always involves interpretation and hence can help answer research questions.
Markup involves decisions about what to and what not to markup: hence not quick or easy.
Involves conscious intellectual decisions we can argue about, discuss.
We could markup:
- structure.
- omissions.
- lingustic features.
- spelling variations.

A useful mental excercise
Imagine you have several thousand page: how do you priortise?
Then image your budget is halved, think again.
Decisions taken, hard grind delegated (but to humanists ideally), good teams challenge decisions and make iterations.
Oxymoron of consistent collaboratibve subjective interpretation of the text: TEI schemas allow clear documentation to overcome problems

XML is grownup version of HTML agreed by the computer science community.
DTD (Document Type Definition) is the grammer of the language (what can go together and how)

What XML is why should you care?
Structured
Extensible
Can be validated (so, starts and end tags are all in place and information about what the tags mean are in place; no overlapping Russian dolls)
Platform-, application-, vendor- independent.
Facilitates integration

Nearly all you need to know about XML
<?xml version-"1.0" ?> or hello I'm an XML document, I confirm to XML version 1.0
Element is everything from the start tag to the end tag.
Hierarchic box like structure which works like a network/(upside-town)tree composed of nodes.
XML contains:
- unicode characters
- elements with optional attributes
- comments
- processing instructions
- entity references
- CDATA marked sections

<seg></seg> is that same as <seg/>

XML must respect ISO standard ISO 10646 (aka Unicode)
BUT non Unicode can be tagged and marked up.

XML Validation.
Schema a list of tools you can use.
TEI uses RELAXNG.
But different projects have different requriements, so wher do you get a schema from?

TEI offers a semi-automatic procedure where specifications can be combined into a personal schema.
tei-c.org/roma

If we use TEI vocabulary we can ensure our encoding can be combined across France, England, Germany et al.

Attributes are optional. Confuses the syntax, but jolly useful.

XML must tesselate, can't go back up or work out of place.

Practical (see handout)
Converting Word docs to XML/TEI (Word is, after all, XML underneath)
http://www.tei-c.org/ege-webclient/
//div[@type='message']/* < looks in all the divs to find message type and return what is within them (done by the /*)
XMLmind
Next steps...
Plucking geo out of TEI, changes in language, recurrent phrases, names


***This work is licensed under a Creative Commons Attribution 3.0 Unported License.***
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/3.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 Unported License</a>.
	[live notes, so excuse the errors, omissions and personal perspective]
	*This work is licensed under a Creative Commons Attribution 3.0 Unported License.*

	[live blog] Lorna Hughes, Digital collections as research infrastructure

	University of Wales Chair in Digital Collections, National Library of Wales

	Ghosts of the diigtal humanities past.
	Early 1990s dissemination of DH... Floppy, CD-roms, software, hypertext, huge chunks of hardware, roadshows. And then the WWWW was invented.
	Warnings over humanists needing to engage themselves in 'computer assisted learning' to ensure nuance captured since the 1970s.
	TLTP: 'Courseware' built from central models using double-keying and later OCR to transcribe corpora.
	Web allowed research councils to fund projects which gave the public access to digitised resources
	- eg AHRB/C Resource enhancement fund 2000-6 > project based funding enabled researchers to work on problem focused methods
	- so no sense of common endeavour, no sense of joined up research creation (neither in terms of content nor tech)
	- work bespoke, maintained in silos. So have to digital content at all, just catalogues to printed resources.
	- few projects would could become part of an integrated humanities research infrastructure: national, baseline history, broad.
	- digital projects make possible liniking resources which as physically disperate: few did that.
	- lack of use, lack of sustainability. Hence plenty now show 404 errors. University departments not best placed to host e-content.

	But this is all hindsight. But hindsight that shows a visious circle:
	Hard to find (poor metadata) > few users > little motivation to maintain and improve > harder to find > ...

	JISC gave us a more strategic approach.
	JISC e-content programme: real impact of digital projects when they make research faster and easier.
	Typically supporting traditional research not new ways of working with or across content.

	Development of digital collections in Wales against the backdrop of devolved government.
	Digital Wales initiative: drive towards digital services which are digital by default.
	Difference of Welsh strategy supports digitisation, creation of digital content: create a digital public sphere in Wales.
	Nonetheless, National Library of Wales has chosen to use central funding to digitise, support content, and make it freely available in an enhanced way.
	Digitisation of course supports preservation.
	Growth in born-digital collections.

	Digital projects also supporting change in scholarship, such as Welsh Newspapers Online.
	Once you deliver these projects, what to users want: more content!
	Welsh Experience of World War One, funded by JISC. Bringing fragmented and inaccessible collections together.
	Established a research programme in digital projects in 2011.
	Related to the digitistion beast and the need for more stuff, but also focus on users and how they use content.
	Use DH to add value to the collections: what do we do with all the digital stuff?
	- Use, share, engage, enrich, sustain, advocate.
	"Use digital content to transform scholarship is the absolute foundation of the digital humanities"
	Content, methods, tools: putting these together transforms scholarship, enables scholars to ask completely new questions.
	This interface between DH and digital collections is in effect an extention of the digital public sphere.

	Examples of projects:

	Digital data has to be free to be useful.
	It also need to be (and to be able to be) shared, aggregated and linked.
	Partners such as TEL and Europaeana propels Welsh collections into an international audience.
	Thus both promoting the Welsh language (national goal) and enriching the data (library goal).

	LIPARM (Linking Parliamentary Records Through Metadata) [now that is a horrible acronym...]
	Partners thinking both of existing content and future content.

	Place name work to aid resource discovery allows clear engagement.
	Wales1900 (cymru1900wales.org) uses Galaxy Zoo to crowdsource place names.
	Place name important for local and language history.
	Will become a useful index for existing projects: Welsh Wills Online and Welsh NEwspapers Online.

	Digital humanities trendier than ever.
	At the very least, all scholars use electronic resources that point to analogue resources.
	BUT much work still uses the digital as print replicas.
	Focus on not theorising but the use of the digital will get us out of our projectitus and digital silos.
	DH is after all practice led, involves engagement in research infrastructure to develop new humanities questions.

	QA
	Theory vs practice? https://twitter.com/thomasgpadilla/status/355619970782199809
	Usage? Most hits from Google and Wikipedia for Welsh Journals, but family history sites too.
	Metadata: quality or simplicity? Does general use need as high quality data as researcher use?
	Of course! Nonetheless benefits of Europeana outweigh the time/effort spent working good metadata into simple Europeana model.


	Lou Bernard, TEI

	Goals of the session...
	What is a text? What is a document?
	What is "markup"?
	What ix XML markup and how do I use it?

	Digital Turn
	Humanities about text.
	DH about digital tech and techniques that have evolved for manipulating them.
	Markup is a basic technology that facilitates integration.

	Why? Because the effect of markup is to make explicit how something should be processed.

	What's in a text?
	A stream of data which we know how to read, because we've learned it from experience.
	Imagine having never seen a text before, what are the signals (markup) to tell you how to read it?
	- use of white space
	- different shapes of letters
	- spaces between words

	Texts are four dimensional:
	- physical presence with visual aspects
	- linguistic and structural
	- convey real work information
	- associated metadata (which helps us manage large collection for intelligent searching)
	Good markup operates in all of these dimensions.

	A text is not a document.
	A document is something that exists in the world, which we can digitize.
	A text is an abstraction, created by or for a community of readers, which we can markup (and express one of those readings)

	A text is:
	- more than a sequence of characters
	- more than a sequence fof lingustic forms
	Markup makes this explicit and available for analysis.

	Markup
	Descriptive markup that allows a machine to do something presentational (HTML) or procedural based on the content.
	It is a scholarly activity. Not automatic (for the most part).
	Markup is editing: it is never neutral, it always involves interpretation and hence can help answer research questions.
	Markup involves decisions about what to and what not to markup: hence not quick or easy.
	Involves conscious intellectual decisions we can argue about, discuss.
	We could markup:
	- structure.
	- omissions.
	- lingustic features.
	- spelling variations.

	A useful mental excercise
	Imagine you have several thousand page: how do you priortise?
	Then image your budget is halved, think again.
	Decisions taken, hard grind delegated (but to humanists ideally), good teams challenge decisions and make iterations.
	Oxymoron of consistent collaboratibve subjective interpretation of the text: TEI schemas allow clear documentation to overcome problems

	XML is grownup version of HTML agreed by the computer science community.
	DTD (Document Type Definition) is the grammer of the language (what can go together and how)

	What XML is why should you care?
	Structured
	Extensible
	Can be validated (so, starts and end tags are all in place and information about what the tags mean are in place; no overlapping Russian dolls)
	Platform-, application-, vendor- independent.
	Facilitates integration

	Nearly all you need to know about XML
	<?xml version-"1.0" ?> or hello I'm an XML document, I confirm to XML version 1.0
	Element is everything from the start tag to the end tag.
	Hierarchic box like structure which works like a network/(upside-town)tree composed of nodes.
	XML contains:
	- unicode characters
	- elements with optional attributes
	- comments
	- processing instructions
	- entity references
	- CDATA marked sections

	<seg></seg> is that same as <seg/>

	XML must respect ISO standard ISO 10646 (aka Unicode)
	BUT non Unicode can be tagged and marked up.

	XML Validation.
	Schema a list of tools you can use.
	TEI uses RELAXNG.
	But different projects have different requriements, so wher do you get a schema from?

	TEI offers a semi-automatic procedure where specifications can be combined into a personal schema.
	tei-c.org/roma

	If we use TEI vocabulary we can ensure our encoding can be combined across France, England, Germany et al.

	Attributes are optional. Confuses the syntax, but jolly useful.

	XML must tesselate, can't go back up or work out of place.

	Practical (see handout)
	Converting Word docs to XML/TEI (Word is, after all, XML underneath)
	http://www.tei-c.org/ege-webclient/
	//div[@type='message']/* < looks in all the divs to find message type and return what is within them (done by the /*)
	XMLmind
	Next steps...
	Plucking geo out of TEI, changes in language, recurrent phrases, names



	*This work is licensed under a Creative Commons Attribution 3.0 Unported License.*
	<a rel="license" href="http://creativecommons.org/licenses/by/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/3.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 Unported License</a>.