alxjrvs/seths-grant.rst

## seths-grant.rst

      
    Raw
  

              seths-grant.rst
            
          
### How would your project help to further humanities research?


(e.g. it will make more content and data openly available for researchers to use,
it will produce novel insights into humanities data through visualisations...) *
GITenberg will improve the quality of Public Domain transcriptions of books
by treating books like open source code.
The GITenberg project has compiled an archive of 40,000 public domain ebooks, and
by applying open source software tools and techniques to these texts,
the process of collaboratively improving ebook presentation -- from transcription to layout -- will be
easier, more transparent, and produce higher quality material.
Readers and editors of ebooks will have the ability to
improve the quality of these public domain ebooks through, for example:
fixing errors made during transcriptions, reformatting style markup, or editing included images.
The tools, broadly used in the open source software community, include
version tracking, annotations, commenting, metadata and textual interlinking,
bringing texts to life by building a community of readers, editors, and commentators
who are connected through the text itself.
Bringing modern textual analysis and management tools to ebooks will make it easier for
researchers and technologists to work with the texts, and to integrate public domain books into other projects.
The authors, editors, and publishers of these books are gone.
These books, being in the public domain, belong to everyone.
It is our shared responsibility to maintain these books,
but also to appropriate, reinterpret and incorporate these books
into new media, new software, and new frontiers.

Which aspects of your project will utilize open content, open data and/or open source tools and how will they be used? *

GITenberg’s servers currently contain 40,000 public domain books,
from Joyce to Shakespeare to Darwin.
We will create a separate _git_ repository,
hosted on the popular social coding site Github.com, for each of these books.
Git provides a system for people to make edits, additions, and to add ancillary data
to their copies of the books
and then submit them as proposed changes to the files hosted on github.
These changes can be reviewed, publicly discussed,
and accepted or rejected by the community.
Git also tracks these changes,
creating an open public record of their development over time, and
attributing the collaborative work to its various participants.
Placing these texts and establishing these standard means of
storing, contributing, adapting, and analyzing these texts will build
bridges between technologists and the literature, criticism, and humanities communities,
inspiring collaborations and yielding new ways of reading and understanding this
rich collection of texts.
A software-style "bug tracker" allows for the reporting of errors found by readers,
which may be discussed and fixed without necessarily altering the entire document.
###Analysis
In addition to the easily modifiable source texts,
this system will automatically regenerate free ebooks (.epub, .pdf, .kindle) when changes are accepted.
These files can be used as a basis for digital analysis.
For example, we will offer three simple analysis scripts for others to build on:

word count (number of words)
vocabulary count (frequency occurrence of different words)
reading level (Flesch-Kincaid Grade Level test)

These analyses will be added to a metadata file stored in its respective repository, and
complete analyses will also be made available.
With these examples, metadata -- including quantitative analysis of texts -- that would useful to developers can be generated
and distributed.
An example of the potential for programmatic textual analysis might be a search engine for books that
let searchers sort by reading level.

List any open content, open data and/or open source tools you plan to use *

# Open Content:
The original source books will come from Project Gutenberg. Future
content sources will include archive.org and google books.
The project will improve this open content, and redistribute the changes to
the original projects.
# Open Data:
By treating books with the same tools used in open software projects,
GITenberg allows for the data in books to be treated as data in software projects.
This will effectively open this rich collection of texts to the benefits of open data analysis and practices.
# Open Source:
+ uses the Python library Natural Language Toolkit (NLTK) will be used for textual analysis


We have written an open source library for python to handle Library of
Congress Classifications LCCC.
The LCCC library will be a distributed as a separate open source project
with documentation as a side effect of GITenberg.


uses Git, a modern Distributed Version Control System


How would you use the prize money to support your project? *

As proof of concept, there are already 700 example books on Github.
However, there is still work to be done to improve interoperability.
Documentation also needs to be improved.
A long-term server home would also be funded by this grant.
The project needs at least two months of focused developer time,
which we have thus far been unable to offer.