Skip to content

Instantly share code, notes, and snippets.

@drjwbaker
Last active August 29, 2015 14:10
Show Gist options
  • Save drjwbaker/3bce1164a1834785ab7a to your computer and use it in GitHub Desktop.
Save drjwbaker/3bce1164a1834785ab7a to your computer and use it in GitHub Desktop.
Web Archives as Big Data, 3 December 2014

Web Archives as Big Data, 3 December 2014

Live notes, so an incomplete, partial record of what actually happened.

Tags: webarchives

My asides in []


Tobias Blanke, The Promise of Infinite Archives of the Web

Lawrence Goldman (IHR), welcoming comments.

Research infrastructures.

We know web archives will change research. But we now need evidence - and today is about that. -- Annoying habit of defining DH ... But of course the world is digital and the humanities with it. So we should just get over it. -- Infinite archives meaning assumed but not well defined by the academy. -- Today look at web archives from the perspective of the infinite archive. -- Kahle ~ infinite archive locked in a present that always looks new. -- The web is just an ever expanding document space. -- Linked space concern of DH folks. Even at close level, of poems - for example.

DH still obsessed with linking: see markup. -- nowhere better expressed than in Latour's recent work [such as http://www.tommasoventurini.it/web/uploads/tommaso_venturini/TheSocialFabric.pdf ]

Rosenzweig: 'What would a digital archival system designed by historians look like?' - unit of analysis for the library is the page, for the computer scientist the link, what is the unit for the historian? -- because historians invariably interested in the 90%+ of archives not digitised.

What is significance in the age of lots of books? (more than we researchers could all read in a lifetime...) Gregory Crane (Leipzig) -- decisions of significance must be part of the research process [well, they are - at least in good humanities research anyway] -- but this relevance in the age of web search algorithms this can be a problem. How do we - for example - lift ephemera to a place of prominence?

In the past filters we archivists, they are now algorithms. How to we trust them? (because we need them...) And yet - of course - archivists have motivations -- Why provenance is important in archives.. see Duff, Accidentally found on purpose, Library Quarterly 2002. -- links give us that and allow us to blow it up in an age of infinite archives.

Hobsbawm on history from below relevant to us - sources become sources in the eye of the beholder; grassroots history requires heavy lifting, investment, and technology, because it is harder.

Challenge for DH to understand better how historians combine evidence, how they reason, how they work.

Prefers the term 'reasoning' to 'digital methods'

We need to go from numbers to stories. In an infinite archive space we need to delegate some authority to computers. But computer programming is non-narrative. So we need to be confident to confront computers with our own perspectives.

Q&A

Q: fair to characterise web as a database? Is it structured enough to be seen as so?

Users will call everything a database whatever it is...

Q: web archives not an archive but a library born with the links. Distinction between a structured archive and a born digital archives - which can be both structured and unstructured.

These links are hyperlinks not semantic and conceptual. Re combining structured and unstructured data, looking to our friends on the biomedical community can help - seeing failures of structured data and moving to unstructured for the flexibility.


Research Showcase Panel 1

these are examples of projects awarded bursaries based on their research questions that needed web archives

Bursary holders presenting their research using historical UK web dataset by @UKWebArchive and @Jisc. #webarchives

— Helen Hockx (@hhockx) December 3, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Rona Cran (UCL), The UK Web Archive and Beat Literature

Late-20th and early-21st centuries. Imagined using the web archives would be the easy part... Wanted to develop an index of useful search terms - but that was impossible... More fruitful has been the technical suggestions.

Beat Literature returned numerable irrelevant results. Struggled with volume. Process of foraging and sense making can help work with these archives be more efficient. Analysis time consuming. Lack of curation useful - could make my own guesses, draw my own conclusions. Better surfacing of obscure documents, marginalised groups by flat, under curated, search results.

Saskia Huc-Hepher (Westminster), An ethnosemiotic study of London French habitus displayed in blogs

Creation of London French special collection at UKWA.

http://www.webarchive.org.uk/ukwa/collection/63275098/page/1

Looking for small, thick, qualitative data to overcome data fundamentalism.

Search methods: bottom up - going in cold... unknown, obsolete resources turned up, and identification of new themes; top down - understand terrain, know what you are looking for, use to direct reading, non-equivalence between captures.

[clear that there is little intellectual justification for web archive silos]

@j_w_baker Absolutely. Memento does this to some extent.

— Helen Hockx (@hhockx) December 3, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

http://www.mementoweb.org/about/

Too many results from a search always a problem - need for facets within the archive?

@j_w_baker at my last meeting with historians, focus on subsets in our web archive was presented as a given thing. "Of course we need to..."

— Toke Eskildsen (@TokeEskildsen) December 3, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

@j_w_baker @WebART12 Indeed. Only, so far, few libraries have understood they should start building web archives in addition to catalogues.

— Herbert (@hvdsomp) December 3, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Edits to blog formats and structures crucial windows into understanding their motivations, feelings, experiences.

Harry Raffal (Hull), MoD Online Development for Strategy and Recruitment

Ministry of Defence websites.

Small meaningful queries create useful results. But as soon as these broaden, problems - and the researchers has to choose how he or she makes that small enough to use.

Where on a web page do we ascribe significance?

Was I overlaying what I expected the sites to be saying onto my reading of them?

Word frequency analysis -- decline in 'careers' and increase in words like 'join' and 'joining'. Change in language.

Recruitment page very important to the army - to such an extent that capture of page when it was down shows a placeholder with a phone number.

Use of Gource viz https://code.google.com/p/gource/

[Nice combination of network analysis and looking at links as a histogram to grasp change and see things that wouldn't be revealed just by reading]

Starting with qual, testing with quant.

Marta Musso (Cambridge), A history of online presence of UK companies

Post-1996 period spread of web... US dominance over the internet. Studying UK companies very much about to what extent they relied on US models, infrastructure.

Recording difference between registration date and website going online. How many went for US domain.

Data partial. Missing many companies. [know your data!]

Web pages within umbrella web pages.

Qualitative surveys to companies about their decision to open websites [mix of methods]

Internet as an advertising laden commercial space was controversial at the time. Particularly as people were paying a lot (dialup!) to get online. [the internet now isn't the same as the internet then]


Research Showcase Panel 2

Helen Taylor (IHR), Do poetry networks exist?

Has the internet made the global local?

Forum identity often not necessarily linked to 'offline' location, name, identity.

Change over time of webpages.

OUPS (Oxford University Poetry Society): theme of trying to get people offline and into a physical setting, or to a book.

Online presences can be precarious but hint at thriving networks offline.

Josh Cowls (OII), 15 years of UK universities on the web

Looking at ac.uk data.

Does UKHE group affiliation have an impact on their linking practices? Null hypothesis, exception being Russell Group - who seem more likely to link to each other (but may be more to do with high profile active research cultures than web strategy)

Over time league table ranking does, however, seem to be an increasingly strong predictor of links between institutions - web reinforcing power structures.

Geography. Is propensity to link inversely correlated to distance? Yes, though triangle around London strong.

The web more reflective than trans-formative of relationships between institutions.

Richard Dewarte (UAE), Exploring Euroscepticism in the UK Web Archive

Why didn't I get the sorts of patterns and trends that I wanted?

Rise of Euroscepticism nicely maps onto the web. Connections between local and central.

Results

  • Frustration. Went with an expectation. Ended up with both too little and too much. Serendipity still important. Is growth seen due to growth in the web, the archive, or the movement? Unpredictable results.

Big data = problem?

  • Access. Maybe searching isn't the best way to understand an archive? The Google model is your enemy here.

Good points by Richard Deswarte about need to rethink search when confronting #webarchives. Need to think beyond simple word search.

— Andrew Prescott (@Ajprescott) December 3, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
  • big data is only getting bigger.
  • should we ditch qual for quant...

Sampling

  • led to sampling. Because we usually want more but when we get more we don't know how to handle it...

Do we need to relearn how to be historians in the digital age?

JISC UK web domain dataset archive is available here (as a 19gb zipped file) http://t.co/PsO3tkqEAg #webarchives

— Andrew Prescott (@Ajprescott) December 3, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Webster, Reading Creationism in the Web Archive

A feature of some parts of evangelicalism.

What does creationism look like online? What are the key organisations? How does it interact across borders? Who takes any notice of it?

Caveats (re referring hosts)

  • tricky to benchmark absolute numbers
  • selectivity
  • what does a link mean - it can be positive and negative
  • not looking at growth over time
  • the UKness of the data

Findings - inbound traffic with each other, mainstream media - but little from academic circles (which confirms what creationist groups say about themselves and how academics ignore them)


James Baker [me!], A Place Full of Data

Slides: http://www.slideshare.net/drjwbaker/2014-12-ihrwacbigdatatalk

Notes: https://gist.github.com/drjwbaker/c0799eefc99fe8ad7e3e


Research Showcase Panel 3

Millward (LSHTM), Digital Barriers and the accessible web

Too much stuff! Research terms such as 'SCOPE' or 'MIND' obviously problematic.

Once you start filtering it is impossible to compare one search to another.

What the heck to you actually have once you've built a "human readable" corpus.

Subsetting, say, RNIB to only focus on a subset (ac.uk) reaps useful results.

Research questions needs to be focused. And browsing about can be just as powerful. How much of search results should be what we want?

Rowan Aust (Holloway), Online response to institutional crisis: the BBC and Jimmy Savile

Savile can never be banished from the archives of the BBC. TV is not just what is on the screen but experience of interacting with that on the screen.

Comparing searches for Savile in the archive with live site to explore potential omissions or changes from the latter.

BBC online does remain an archive of its news output - uncritical pre-scandal Savile profiles remain there, unedited (for example)

TOTP sites seem to have adapted their focus (from presenters to artists) to flatten mention of Savile.

Louis Theroux comments positive about Jimmy have been removed: reputation management.

Chris Fryer, The UK Parliament Web Archive

@C_Fryer

Archiving web pages within 2009 plus official social media. Highly curated. Come and work with us.


Peter Webster, What does the future hold? Archiving the UK web at the BL

UK Web Archive is three archives in one.

  • Open Web (2004-): curated, selective, permissions based.
  • Legal Deposit UK Web Archive (2013-)
  • JISC UK Web Domain set

There are other ways into this content... Derived data http://data.webarchive.org.uk/opendata/

Co-design has driven the project. And users needed search, filters et al ... http://www.webarchive.org.uk/shine ... iterative search that leads towards the creation of a corpora, that itself could then be further queried, iterated, saved, exported.

Working towards functionality around sharing corpora. Clustering using Apache Solr, to make algorithmic chunks to be drilled into.

Want to foreground websites that are no longer live.

More granular integration with Memento - global protocol for connecting web archive content and enabling navigation between them.


Roundtable: ethics of big data research

Ethics isn't something that happens to someone else.

Niels Brugger The core of the word publication is public, regardless of media type. Ethical issues when people don't know the reach of their publication or when things are published unintentionally. But these are the only issues. It was published.

Kathryn Eccles Bringing precedents forward important. Historians are good at teasing out what is private in order to explain it better. Digital Panopticon. Are we more sensitive about privacy now we make our work and our working open and open access?

Anne Alexander We live in a world where 'big' data approaches are more and more embedded in our everyday lives. In this context, ethics must be a process based on a set of principles.


Some admin...

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Exceptions: embeds to and from external sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment