funkyfuture/proposal_django_xml_tei.md

## proposal_django_xml_tei.md

      
    Raw
  

              proposal_django_xml_tei.md
            
          
    An XML-TEI-targeted Django app - a proposal

Abstract

This document describes the approach for a first iteration to implement
building blocks that aim to easily develop XML-TEI related applications for the
Django web framework.
Motivation

Simply elaborated:

Java sucks.
XQuery sucks.
Python rules.
Django is okay.
XML-TEI is pretty neat.

The first two points refer to technologies that do not fulfil the needs of
modern web development, still they seem unavoidable when dealing with XML-TEI
encoded corpora atm.
High-level objectives


provide an extendible Django application to handle XML-TEI in webapps for the
scientific community
a simple default setup that enables developers to provide a satisfying web-
based view on XML-TEI documents
create a community that contributes extensions that suit the needs of
specific scientific branches

Paradigms


URLs, hashes and XPaths are first class identifiers
cache a lot, but keep it simple
everything may break apart, but is reconstructable from the first class
identifiers

Framework discussion

Django is a powerful web framework that allows rapid and consistent design of
web applications within a healthy ecosystem.
For the particular tasks it provides major core functionalities, others are
covered by third-party apps.
The framework is well designed, well maintained, well known and has proven to
be extendible on each and every of its corners.
Nonetheless a further iteration should aim to provide framework-agnostic
libraries where appropriate. E.g. the indexes may be alternatively be based on
the sqlalchemy_mptt package.
https://djangoproject.com
mptt is an approach to store trees in a relational database, this could be
the foundation of an index. See discussion below.
https://django-mptt.github.io/django-mptt/
Redis is a highly scalable, persisting in-memory key-value-store that has some
invalidation primitives and is thus the current powerhorse when it comes to
caching.
https://redis.io
Some thoughts on caching:

do cache as much as possible, but do not endlessly duplicate contents

since I'm assuming that Redis uses hash tables a lot, that implies a
deduplication of identical content in memory


make it simple for users to develop extensions without worrying about caching,
but let users fine-tune cache-parameters

Building blocks

The following building blocks need to be designed and implemented seperately,
while there may be interdependencies.

ORM models that indexes the contents of an XML document for quick querying
template filters that turn XML into aspect scoped representational models
template tags that transform such objects into a view
a rest api that provides clever ways to obtain parts of a collection or a
document in its raw or a transformed representation
default views and templates to render a paginated and a chapter-scoped
representation of a document including possible digitalized representations
of a source; one view for importing documents from an URL or a client's local
filesystem

oh, we could also have syntactic sugar for XPaths like pathlib.Path:
path = '.' // 'foo' / a_string_symbol
path = ('.' // 'foo' / a_string_symbol)[1]  # :-(

Document index / ORM models

The only supported database may be Postgres to leverage its superior
features like full text search, ArrayLists and HStores.
On the other hand, the mechanics relying on these may all target Redis.
Here are some illustrative stubs:
class Document(Model)
    source = CharField(db_index=True)  # or rather an UrlField !?
    collection = ManyToManyField('Collection')

    def __getitem__(self, xpath):
        return self.recent_version[xpath]

    @property
    def url(self) -> str:
        pass

    @property
    def recent_version(self) -> DocumentVersion:
        pass

    @property
    def versions(self) -> List[Documentversion]:
        pass


class DocumentVersion(Model):
    document = ForeignKey('Document')
    root_element = TreeForeignKey('Element')
    mod_time = DateTimeField()

    def __getitem__(self, xpath) -> Element:
        return self.root_element[xpath]

    @property
    def content_hash(self) -> str:
        return self.root_element.content_hash

    @classmethod
    def from_file(path) -> None:
        pass


class Element(MPTTModel):
    parent = TreeForeignKey('self', null=True, blank=True, related_name='children', db_index=True)
    tag = CharField(blank=True, db_index=True)
    attributes = HStoreField()  # alternatively this might be an Attribute model that are referred through m2m-relations
    content_hash = CharField(editable=False, unique=True, index=True)  # primary_key=True ?!

    xpath_cache = Cache(namespace='elements.xpath', ttl=settings.…)
    text_cache = Cache(namespace='elements.text, ttl=settings.…', max_size=settings.…)
    stripped_text_cache = Cache(namespace='elements.stripped_text, ttl=settings.…', max_size=settings.…)

    def __getitem__(self, xpath) -> Element:
        try:
            target_id = self.xpath_cache.get(self.id, xpath)
        except NotCached:
            target_id = self._evaluate_xpath(xpath)
            self.xpath_cache.set((self.id, xpath), target_id)
        return self.objects.get(id=target_id)

    @property
    def text(self) -> str:
        try:
            text = self.text_cache.get(self.id)
        except NotCached:
            if self.tag:
        	    text = '<{tag} {attributes}>{contents}</{tag}>'.format(
        		            tag=self.tag,
        		            attributes=' ' .join('{}={}'.format(k, v) for k, v in self.attributes.items()),
        		            contents=''.join(child.text() for child in self.get_children())
        		        )
            	else:
            	    text = self.attributes['text']
            self.text_cache.set(self.id, text)
        return text

    __str__ = text

    @property
    def stripped_text(self) -> str:
        …
    	if self.tag:
            return ''.join(child.stripped_text)
    	else:
    	    self.text
The current idea to deal with the fact that xml documents aren't exactly trees
as they contain arbitrary extra content in between nodes (aka text) but that
this should also be stored in the database's representation, is to assign an
empty tag. The logic of such instances differs from proper nodes.
Storing all this in a relational database and not just using lxml for querying
is motivated by the facts that these indexes in a rdb are persisting and can be
shared by multiple application instances.
On the other hand it's another rather complex layer in the stack, thus dropping
the mptt modeled part and solely relying on lxml and a proper cache would
reduce implementation and maintenance effort.
This should also be considered regarding the elegance that can be achieved in
user's implementations of representational models (see below).
Full-text search that correlates results that span over multiple elements with
the containing element will be a hard nut.
Template filters and tags

In this case let's have a look on a simple template example:
{% load dta_book_filters book_tags %}

{% with book=document|dta_book %}
  <h1>{{ book.authors }}: {{ book.title }}</h1>
  <h2>{{ page_name }}/{{ book.pages_count }}</h2>
  <div>
    {% with page=book|book_page={{ page_name }} %}
      {% pageview page mode="faximilie" %}
    {% endwith %}
  </div>
{% endwith %}
This is even simpler:
{% load dta_book_filters book_tags %}

{% toc document|dta_book %}
Who's mentioned in a doc and has a d-nb.de record?
{% load dta_indexes %}

{% with referenced_people=document|dta_people="d-nb.de" %}
  <ul>
  {% for person in referenced_people %}
    <li><a href="{{ person.dnb_url }}">{{ person.name }}</a></li>
  {% endfor %}
  </ul>
{% endwith %}
Filters consume documents, its nodes or a representational model and transform
it into a representational model instance. They aim to provide access to an
aspect-focused, simplified subset of contents. A module of filters should
implement all features of a tei-schema.
Representational model instances have lazy-evaluated, cached properties.
Tags consume representational model instances and other arguments to render
predefined views. Django's template and page caching should do the job here,
no extra effort required. They will mainly be implemented per project.
Further considerations

Some noteworthy thougts on details:

the import / parsing of documents should be executed in a background job in
order to not block the application with lengthy startup times on large
collections

Outlook

The following features are way beyond the scope of this document and an initial
development stage, require user and developmet feedback, yet they should be kept
in mind:

a detailed plan on versatile versioning that is deduplicating
a rest api that allows modifcation of documents through a primitive set of
directives

Roadmap


post this, gather feedback
…
release a final alpha, including:

an example project that let's user host and view the DTA Kernkorpus with

docker-compose up -d
xdg-open http://localhost:8000