This document describes the approach for a first iteration to implement building blocks that aim to easily develop XML-TEI related applications for the Django web framework.
Simply elaborated:
- Java sucks.
- XQuery sucks.
- Python rules.
- Django is okay.
- XML-TEI is pretty neat.
The first two points refer to technologies that do not fulfil the needs of modern web development, still they seem unavoidable when dealing with XML-TEI encoded corpora atm.
- provide an extendible Django application to handle XML-TEI in webapps for the scientific community
- a simple default setup that enables developers to provide a satisfying web- based view on XML-TEI documents
- create a community that contributes extensions that suit the needs of specific scientific branches
- URLs, hashes and XPaths are first class identifiers
- cache a lot, but keep it simple
- everything may break apart, but is reconstructable from the first class identifiers
Django is a powerful web framework that allows rapid and consistent design of
web applications within a healthy ecosystem.
For the particular tasks it provides major core functionalities, others are
covered by third-party apps.
The framework is well designed, well maintained, well known and has proven to
be extendible on each and every of its corners.
Nonetheless a further iteration should aim to provide framework-agnostic
libraries where appropriate. E.g. the indexes may be alternatively be based on
the sqlalchemy_mptt
package.
mptt
is an approach to store trees in a relational database, this could be
the foundation of an index. See discussion below.
https://django-mptt.github.io/django-mptt/
Redis is a highly scalable, persisting in-memory key-value-store that has some invalidation primitives and is thus the current powerhorse when it comes to caching.
Some thoughts on caching:
- do cache as much as possible, but do not endlessly duplicate contents
- since I'm assuming that Redis uses hash tables a lot, that implies a deduplication of identical content in memory
- make it simple for users to develop extensions without worrying about caching, but let users fine-tune cache-parameters
The following building blocks need to be designed and implemented seperately, while there may be interdependencies.
- ORM models that indexes the contents of an XML document for quick querying
- template filters that turn XML into aspect scoped representational models
- template tags that transform such objects into a view
- a rest api that provides clever ways to obtain parts of a collection or a document in its raw or a transformed representation
- default views and templates to render a paginated and a chapter-scoped representation of a document including possible digitalized representations of a source; one view for importing documents from an URL or a client's local filesystem
oh, we could also have syntactic sugar for XPaths like pathlib.Path
:
path = '.' // 'foo' / a_string_symbol
path = ('.' // 'foo' / a_string_symbol)[1] # :-(
The only supported database may be Postgres to leverage its superior features like full text search, ArrayLists and HStores. On the other hand, the mechanics relying on these may all target Redis.
Here are some illustrative stubs:
class Document(Model)
source = CharField(db_index=True) # or rather an UrlField !?
collection = ManyToManyField('Collection')
def __getitem__(self, xpath):
return self.recent_version[xpath]
@property
def url(self) -> str:
pass
@property
def recent_version(self) -> DocumentVersion:
pass
@property
def versions(self) -> List[Documentversion]:
pass
class DocumentVersion(Model):
document = ForeignKey('Document')
root_element = TreeForeignKey('Element')
mod_time = DateTimeField()
def __getitem__(self, xpath) -> Element:
return self.root_element[xpath]
@property
def content_hash(self) -> str:
return self.root_element.content_hash
@classmethod
def from_file(path) -> None:
pass
class Element(MPTTModel):
parent = TreeForeignKey('self', null=True, blank=True, related_name='children', db_index=True)
tag = CharField(blank=True, db_index=True)
attributes = HStoreField() # alternatively this might be an Attribute model that are referred through m2m-relations
content_hash = CharField(editable=False, unique=True, index=True) # primary_key=True ?!
xpath_cache = Cache(namespace='elements.xpath', ttl=settings.…)
text_cache = Cache(namespace='elements.text, ttl=settings.…', max_size=settings.…)
stripped_text_cache = Cache(namespace='elements.stripped_text, ttl=settings.…', max_size=settings.…)
def __getitem__(self, xpath) -> Element:
try:
target_id = self.xpath_cache.get(self.id, xpath)
except NotCached:
target_id = self._evaluate_xpath(xpath)
self.xpath_cache.set((self.id, xpath), target_id)
return self.objects.get(id=target_id)
@property
def text(self) -> str:
try:
text = self.text_cache.get(self.id)
except NotCached:
if self.tag:
text = '<{tag} {attributes}>{contents}</{tag}>'.format(
tag=self.tag,
attributes=' ' .join('{}={}'.format(k, v) for k, v in self.attributes.items()),
contents=''.join(child.text() for child in self.get_children())
)
else:
text = self.attributes['text']
self.text_cache.set(self.id, text)
return text
__str__ = text
@property
def stripped_text(self) -> str:
…
if self.tag:
return ''.join(child.stripped_text)
else:
self.text
The current idea to deal with the fact that xml documents aren't exactly trees as they contain arbitrary extra content in between nodes (aka text) but that this should also be stored in the database's representation, is to assign an empty tag. The logic of such instances differs from proper nodes.
Storing all this in a relational database and not just using lxml
for querying
is motivated by the facts that these indexes in a rdb are persisting and can be
shared by multiple application instances.
On the other hand it's another rather complex layer in the stack, thus dropping
the mptt modeled part and solely relying on lxml
and a proper cache would
reduce implementation and maintenance effort.
This should also be considered regarding the elegance that can be achieved in
user's implementations of representational models (see below).
Full-text search that correlates results that span over multiple elements with
the containing element will be a hard nut.
In this case let's have a look on a simple template example:
{% load dta_book_filters book_tags %}
{% with book=document|dta_book %}
<h1>{{ book.authors }}: {{ book.title }}</h1>
<h2>{{ page_name }}/{{ book.pages_count }}</h2>
<div>
{% with page=book|book_page={{ page_name }} %}
{% pageview page mode="faximilie" %}
{% endwith %}
</div>
{% endwith %}
This is even simpler:
{% load dta_book_filters book_tags %}
{% toc document|dta_book %}
Who's mentioned in a doc and has a d-nb.de record?
{% load dta_indexes %}
{% with referenced_people=document|dta_people="d-nb.de" %}
<ul>
{% for person in referenced_people %}
<li><a href="{{ person.dnb_url }}">{{ person.name }}</a></li>
{% endfor %}
</ul>
{% endwith %}
Filters consume documents, its nodes or a representational model and transform it into a representational model instance. They aim to provide access to an aspect-focused, simplified subset of contents. A module of filters should implement all features of a tei-schema. Representational model instances have lazy-evaluated, cached properties. Tags consume representational model instances and other arguments to render predefined views. Django's template and page caching should do the job here, no extra effort required. They will mainly be implemented per project.
Some noteworthy thougts on details:
- the import / parsing of documents should be executed in a background job in order to not block the application with lengthy startup times on large collections
The following features are way beyond the scope of this document and an initial development stage, require user and developmet feedback, yet they should be kept in mind:
- a detailed plan on versatile versioning that is deduplicating
- a rest api that allows modifcation of documents through a primitive set of directives
- post this, gather feedback
- …
- release a final alpha, including:
- an example project that let's user host and view the DTA Kernkorpus with
docker-compose up -d
xdg-open http://localhost:8000
- an example project that let's user host and view the DTA Kernkorpus with