Skip to content

Instantly share code, notes, and snippets.

@alxjrvs
Forked from sethwoodworth/seths-grant.rst
Last active December 14, 2015 23:29
Show Gist options
  • Save alxjrvs/5166399 to your computer and use it in GitHub Desktop.
Save alxjrvs/5166399 to your computer and use it in GitHub Desktop.

### How would your project help to further humanities research?

(e.g. it will make more content and data openly available for researchers to use, it will produce novel insights into humanities data through visualisations...) *

GITenberg will improve the quality of Public Domain transcriptions of books by treating books like open source code. The GITenberg project has compiled an archive of 40,000 public domain ebooks, and by applying open source software tools and techniques to these texts, the process of collaboratively improving ebook presentation -- from transcription to layout -- will be easier, more transparent, and produce higher quality material.

Readers and editors of ebooks will have the ability to improve the quality of these public domain ebooks through, for example: fixing errors made during transcriptions, reformatting style markup, or editing included images.

The tools, broadly used in the open source software community, include version tracking, annotations, commenting, metadata and textual interlinking, bringing texts to life by building a community of readers, editors, and commentators who are connected through the text itself. Bringing modern textual analysis and management tools to ebooks will make it easier for researchers and technologists to work with the texts, and to integrate public domain books into other projects.

The authors, editors, and publishers of these books are gone. These books, being in the public domain, belong to everyone. It is our shared responsibility to maintain these books, but also to appropriate, reinterpret and incorporate these books into new media, new software, and new frontiers.

Which aspects of your project will utilize open content, open data and/or open source tools and how will they be used? *

GITenberg’s servers currently contain 40,000 public domain books, from Joyce to Shakespeare to Darwin. We will create a separate _git_ repository, hosted on the popular social coding site Github.com, for each of these books. Git provides a system for people to make edits, additions, and to add ancillary data to their copies of the books and then submit them as proposed changes to the files hosted on github. These changes can be reviewed, publicly discussed, and accepted or rejected by the community. Git also tracks these changes, creating an open public record of their development over time, and attributing the collaborative work to its various participants.

Placing these texts and establishing these standard means of storing, contributing, adapting, and analyzing these texts will build bridges between technologists and the literature, criticism, and humanities communities, inspiring collaborations and yielding new ways of reading and understanding this rich collection of texts.

A software-style "bug tracker" allows for the reporting of errors found by readers, which may be discussed and fixed without necessarily altering the entire document.

###Analysis

In addition to the easily modifiable source texts, this system will automatically regenerate free ebooks (.epub, .pdf, .kindle) when changes are accepted. These files can be used as a basis for digital analysis. For example, we will offer three simple analysis scripts for others to build on:

  • word count (number of words)
  • vocabulary count (frequency occurrence of different words)
  • reading level (Flesch-Kincaid Grade Level test)

These analyses will be added to a metadata file stored in its respective repository, and complete analyses will also be made available. With these examples, metadata -- including quantitative analysis of texts -- that would useful to developers can be generated and distributed. An example of the potential for programmatic textual analysis might be a search engine for books that let searchers sort by reading level.

List any open content, open data and/or open source tools you plan to use *

# Open Content: The original source books will come from Project Gutenberg. Future content sources will include archive.org and google books. The project will improve this open content, and redistribute the changes to the original projects.

# Open Data: By treating books with the same tools used in open software projects, GITenberg allows for the data in books to be treated as data in software projects. This will effectively open this rich collection of texts to the benefits of open data analysis and practices.

# Open Source: + uses the Python library Natural Language Toolkit (NLTK) will be used for textual analysis

  • We have written an open source library for python to handle Library of
    Congress Classifications LCCC. The LCCC library will be a distributed as a separate open source project with documentation as a side effect of GITenberg.
  • uses Git, a modern Distributed Version Control System

How would you use the prize money to support your project? *

As proof of concept, there are already 700 example books on Github. However, there is still work to be done to improve interoperability. Documentation also needs to be improved. A long-term server home would also be funded by this grant. The project needs at least two months of focused developer time, which we have thus far been unable to offer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment