Skip to content

Instantly share code, notes, and snippets.

@funkyfuture
Last active March 3, 2017 16:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save funkyfuture/49934aa6c69144e6c24f060c3712d167 to your computer and use it in GitHub Desktop.
Save funkyfuture/49934aa6c69144e6c24f060c3712d167 to your computer and use it in GitHub Desktop.

An XML-TEI-targeted Django app - a proposal

Abstract

This document describes the approach for a first iteration to implement building blocks that aim to easily develop XML-TEI related applications for the Django web framework.

Motivation

Simply elaborated:

  • Java sucks.
  • XQuery sucks.
  • Python rules.
  • Django is okay.
  • XML-TEI is pretty neat.

The first two points refer to technologies that do not fulfil the needs of modern web development, still they seem unavoidable when dealing with XML-TEI encoded corpora atm.

High-level objectives

  • provide an extendible Django application to handle XML-TEI in webapps for the scientific community
  • a simple default setup that enables developers to provide a satisfying web- based view on XML-TEI documents
  • create a community that contributes extensions that suit the needs of specific scientific branches

Paradigms

  • URLs, hashes and XPaths are first class identifiers
  • cache a lot, but keep it simple
  • everything may break apart, but is reconstructable from the first class identifiers

Framework discussion

Django is a powerful web framework that allows rapid and consistent design of web applications within a healthy ecosystem. For the particular tasks it provides major core functionalities, others are covered by third-party apps. The framework is well designed, well maintained, well known and has proven to be extendible on each and every of its corners. Nonetheless a further iteration should aim to provide framework-agnostic libraries where appropriate. E.g. the indexes may be alternatively be based on the sqlalchemy_mptt package.

https://djangoproject.com

mptt is an approach to store trees in a relational database, this could be the foundation of an index. See discussion below.

https://django-mptt.github.io/django-mptt/

Redis is a highly scalable, persisting in-memory key-value-store that has some invalidation primitives and is thus the current powerhorse when it comes to caching.

https://redis.io

Some thoughts on caching:

  • do cache as much as possible, but do not endlessly duplicate contents
    • since I'm assuming that Redis uses hash tables a lot, that implies a deduplication of identical content in memory
  • make it simple for users to develop extensions without worrying about caching, but let users fine-tune cache-parameters

Building blocks

The following building blocks need to be designed and implemented seperately, while there may be interdependencies.

  • ORM models that indexes the contents of an XML document for quick querying
  • template filters that turn XML into aspect scoped representational models
  • template tags that transform such objects into a view
  • a rest api that provides clever ways to obtain parts of a collection or a document in its raw or a transformed representation
  • default views and templates to render a paginated and a chapter-scoped representation of a document including possible digitalized representations of a source; one view for importing documents from an URL or a client's local filesystem

oh, we could also have syntactic sugar for XPaths like pathlib.Path:

path = '.' // 'foo' / a_string_symbol
path = ('.' // 'foo' / a_string_symbol)[1]  # :-(

Document index / ORM models

The only supported database may be Postgres to leverage its superior features like full text search, ArrayLists and HStores. On the other hand, the mechanics relying on these may all target Redis.

Here are some illustrative stubs:

class Document(Model)
    source = CharField(db_index=True)  # or rather an UrlField !?
    collection = ManyToManyField('Collection')

    def __getitem__(self, xpath):
        return self.recent_version[xpath]

    @property
    def url(self) -> str:
        pass

    @property
    def recent_version(self) -> DocumentVersion:
        pass

    @property
    def versions(self) -> List[Documentversion]:
        pass


class DocumentVersion(Model):
    document = ForeignKey('Document')
    root_element = TreeForeignKey('Element')
    mod_time = DateTimeField()

    def __getitem__(self, xpath) -> Element:
        return self.root_element[xpath]

    @property
    def content_hash(self) -> str:
        return self.root_element.content_hash

    @classmethod
    def from_file(path) -> None:
        pass


class Element(MPTTModel):
    parent = TreeForeignKey('self', null=True, blank=True, related_name='children', db_index=True)
    tag = CharField(blank=True, db_index=True)
    attributes = HStoreField()  # alternatively this might be an Attribute model that are referred through m2m-relations
    content_hash = CharField(editable=False, unique=True, index=True)  # primary_key=True ?!

    xpath_cache = Cache(namespace='elements.xpath', ttl=settings.…)
    text_cache = Cache(namespace='elements.text, ttl=settings.…', max_size=settings.…)
    stripped_text_cache = Cache(namespace='elements.stripped_text, ttl=settings.…', max_size=settings.…)

    def __getitem__(self, xpath) -> Element:
        try:
            target_id = self.xpath_cache.get(self.id, xpath)
        except NotCached:
            target_id = self._evaluate_xpath(xpath)
            self.xpath_cache.set((self.id, xpath), target_id)
        return self.objects.get(id=target_id)

    @property
    def text(self) -> str:
        try:
            text = self.text_cache.get(self.id)
        except NotCached:
            if self.tag:
        	    text = '<{tag} {attributes}>{contents}</{tag}>'.format(
        		            tag=self.tag,
        		            attributes=' ' .join('{}={}'.format(k, v) for k, v in self.attributes.items()),
        		            contents=''.join(child.text() for child in self.get_children())
        		        )
            	else:
            	    text = self.attributes['text']
            self.text_cache.set(self.id, text)
        return text

    __str__ = text

    @property
    def stripped_text(self) -> str:
        …
    	if self.tag:
            return ''.join(child.stripped_text)
    	else:
    	    self.text

The current idea to deal with the fact that xml documents aren't exactly trees as they contain arbitrary extra content in between nodes (aka text) but that this should also be stored in the database's representation, is to assign an empty tag. The logic of such instances differs from proper nodes.

Storing all this in a relational database and not just using lxml for querying is motivated by the facts that these indexes in a rdb are persisting and can be shared by multiple application instances. On the other hand it's another rather complex layer in the stack, thus dropping the mptt modeled part and solely relying on lxml and a proper cache would reduce implementation and maintenance effort. This should also be considered regarding the elegance that can be achieved in user's implementations of representational models (see below). Full-text search that correlates results that span over multiple elements with the containing element will be a hard nut.

Template filters and tags

In this case let's have a look on a simple template example:

{% load dta_book_filters book_tags %}

{% with book=document|dta_book %}
  <h1>{{ book.authors }}: {{ book.title }}</h1>
  <h2>{{ page_name }}/{{ book.pages_count }}</h2>
  <div>
    {% with page=book|book_page={{ page_name }} %}
      {% pageview page mode="faximilie" %}
    {% endwith %}
  </div>
{% endwith %}

This is even simpler:

{% load dta_book_filters book_tags %}

{% toc document|dta_book %}

Who's mentioned in a doc and has a d-nb.de record?

{% load dta_indexes %}

{% with referenced_people=document|dta_people="d-nb.de" %}
  <ul>
  {% for person in referenced_people %}
    <li><a href="{{ person.dnb_url }}">{{ person.name }}</a></li>
  {% endfor %}
  </ul>
{% endwith %}

Filters consume documents, its nodes or a representational model and transform it into a representational model instance. They aim to provide access to an aspect-focused, simplified subset of contents. A module of filters should implement all features of a tei-schema. Representational model instances have lazy-evaluated, cached properties. Tags consume representational model instances and other arguments to render predefined views. Django's template and page caching should do the job here, no extra effort required. They will mainly be implemented per project.

Further considerations

Some noteworthy thougts on details:

  • the import / parsing of documents should be executed in a background job in order to not block the application with lengthy startup times on large collections

Outlook

The following features are way beyond the scope of this document and an initial development stage, require user and developmet feedback, yet they should be kept in mind:

  • a detailed plan on versatile versioning that is deduplicating
  • a rest api that allows modifcation of documents through a primitive set of directives

Roadmap

  • post this, gather feedback
  • release a final alpha, including:
    • an example project that let's user host and view the DTA Kernkorpus with
      • docker-compose up -d
      • xdg-open http://localhost:8000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment