Skip to content

Instantly share code, notes, and snippets.

@abelsromero
Created August 17, 2015 11:22
Show Gist options
  • Save abelsromero/39e39e9854f25b47bdac to your computer and use it in GitHub Desktop.
Save abelsromero/39e39e9854f25b47bdac to your computer and use it in GitHub Desktop.

Document meta-model

:toc:macro

We deal with documents every day, sometimes in some ways we are not aware of. The idea of sharing ideas in a persistent support has been one the first thing that made the human kind what is it. From Altamira’s paintings in Spain to modern Wikipedia documents are about sharing knowledge.

So, it should be normal to assume that we are familiar with them and we know how to deal with them. I mean, is just a few words and images in a linear structure. Once it’s finished, just press btn:[Save] and btn:[Send] them.

Well, this is not true and this article talks about the complexity about dealing with groups of documents especially in digital system. In this article you will find an example model to help analyse document’s repositories.

What is a document

Pretty much everyone has an idea of what a document is. Depending on their daily activities, different people may describe them as: . A piece of paper with some text. . That filled form I had to provide to get my taxes back. . An article on Wikipedia, a blog post, ¿does a twit count as a document, isn’t it too shot? . My college degree. . Simple, a Word file. Pretty much all of them hit it and we all can recognize a document when put in front of us. But a document is something far more complex and that. To begin with, a document is not only the content itself, but it includes metadata associated to it, for instance, the name of the file on the filesystem or the URL pointing to it.

{images url}metadata
Figure 1. Metadata is also part of a document

Just with some quick though one may find several way to classify documents depending on many characteristics:

  1. Is it self-contained or links other elements?

  2. Does the document have more than one version?

  3. In what format is it stored (MS Word, plain text, PNG, etc.)? Can it be transformed into other formats?.[1]

  4. What kind of metadata is included?

  5. Is it a structured like an XML or is it free format? [2] and so on.

That’s only focusing on what features are involved in the constructions, presentation and basic storage of a document. If you enter into other aspects provided by content management tools like security, preservation, language support this becomes even more complex. For the sake of this analysis we can define the following model:

For simplicity let’s assume that the document entity is an aggregator of the different versions a document. This means, that the INCLUDE relationship is just a dependency relationship, we are not discussing here different types of inclusion or if a document may include some specific version, or even a renderitzation (we will hint that later).

Let’s play with the data

Data set

To test the model here is some data. To avoid looking into the data, this model represents the next scenario.

We have 3 documents:

  • A book: it has 2 versions (Draft and Final) each one with its own metadata, and original content. However, only the 2nd version generates an output, in PDF format.

  • A possible User manual inspired by GitHub README’s. This manual has 3 versions and in the third one a Contributor’s guide was include (refer to the note about << include-note, include meaning>>). Also, until the 3rd version, the manual was only published as HTML, from that version PDF was added to the pipeline.

  • Contributor’s guide included by the 2nd document. This document, being recent and not published on its own, does not have any version or output yet.

Initializations
CREATE (html:Output {name:'Html'})
CREATE (pdf:Output {name:'PDF'})

//
// Document 1 - Book - Anathem by Neal Stephenson
//
CREATE (doc1:Document {id:'1', name: 'Anathem (Book)'})

// HAS 2 versions
CREATE (version11:Version {name:'ver.1'})
CREATE (version12:Version {name:'ver.2'})
CREATE (doc1)-[:HAS]->(version11)
CREATE (doc1)-[:HAS]->(version12)

// Each version has it's own source content, but only version 2 GENERATES 1 PDF Output
CREATE (content11:Content {name:'anathem_draft.doc', mimetype:'application/msword'})
CREATE (content12:Content {name:'anathem_final.doc', mimetype:'application/msword'})
CREATE (version11)-[:IS_MADE_OF]->(content11)
CREATE (version12)-[:IS_MADE_OF]->(content12)
CREATE (version12)-[:GENERATES]->(pdf)

// Each version has it's own metadata ASSOCIATED
CREATE (metadata11:Metadata {title:'Anathem (DRAFT)', author:'Neal Stephenson', created:'2006-01-01'})
CREATE (metadata12:Metadata {title:'Anathem', author:'Neal Stephenson', created:'2008-09-09'})
CREATE (version11)-[:LINKS]->(metadata11)
CREATE (version12)-[:LINKS]->(metadata12)

//
// Document 2 - Book - Anathem by Neal Stephenson
//
CREATE (doc2:Document {id:'2', name: 'GitHub Project\'s README'})

// HAS 2 versions
CREATE (version21:Version {name:'ver.1'})
CREATE (version22:Version {name:'ver.2'})
CREATE (version23:Version {name:'ver.3'})
CREATE (doc2)-[:HAS]->(version21)
CREATE (doc2)-[:HAS]->(version22)
CREATE (doc2)-[:HAS]->(version23)

// Each version has it's own source content, we store the output of each version so people can check previous versions
CREATE (content21:Content {name:'README.adoc', mimetype:'text/x-asciidoc'})
CREATE (content22:Content {name:'README.adoc', mimetype:'text/x-asciidoc'})
CREATE (content23:Content {name:'README.adoc', mimetype:'text/x-asciidoc'})
CREATE (version21)-[:IS_MADE_OF]->(content21)
CREATE (version22)-[:IS_MADE_OF]->(content22)
CREATE (version23)-[:IS_MADE_OF]->(content23)
CREATE (version21)-[:GENERATES]->(html)
CREATE (version22)-[:GENERATES]->(html)
CREATE (version23)-[:GENERATES]->(html)
// In version 3 we added PDF output
CREATE (version23)-[:GENERATES]->(pdf)

// Each version has it's own metadata ASSOCIATED
CREATE (metadata21:Metadata {title:'Reference documentation', author:'userX', created:'2013-12-01'})
CREATE (metadata22:Metadata {title:'Reference documentation', author:'uxerY', created:'2015-01-10'})
CREATE (metadata23:Metadata {title:'Reference documentation & Contributor\'s guide', author:'userZ', created:'2015-07-06'})
CREATE (version21)-[:LINKS]->(metadata21)
CREATE (version22)-[:LINKS]->(metadata22)
CREATE (version23)-[:LINKS]->(metadata23)

// Document 3 is used by 2
CREATE (doc3:Document {id:'3', name: 'Contributors\' guide'})
CREATE (version31:Version {name:'ver.1'})
CREATE (doc3)-[:HAS]->(version31)
// This document only has 1 version and does not generate output on it's own
CREATE (content31:Content {name:'contributors.adoc', mimetype:'text/x-asciidoc'})
CREATE (version31)-[:IS_MADE_OF]->(content31)
CREATE (metadata31:Metadata {title:'Contributors\' guide', author:'uxerZ', created:'2015-07-04'})
CREATE (version31)-[:LINKS]->(metadata31)
CREATE (version31)-[:GENERATES]->(html)

// Document 4 is used by 2 and 6
CREATE (doc4:Document {id:'4', name: 'User manual'})
CREATE (version41:Version {name:'ver.1'})
CREATE (version42:Version {name:'ver.2'})
CREATE (doc4)-[:HAS]->(version41)
CREATE (doc4)-[:HAS]->(version42)
CREATE (content41:Content {name:'manual.adoc', mimetype:'text/x-asciidoc'})
CREATE (content42:Content {name:'manual.adoc', mimetype:'text/x-asciidoc'})
CREATE (version41)-[:IS_MADE_OF]->(content41)
CREATE (version42)-[:IS_MADE_OF]->(content42)
CREATE (metadat41:Metadata {title:'User manual', author:'uxerZ', created:'2015-06-01'})
CREATE (metadat42:Metadata {title:'User manual', author:'uxerZ', created:'2015-07-04'})
CREATE (version41)-[:LINKS]->(metadata41)
CREATE (version42)-[:LINKS]->(metadata42)
CREATE (version42)-[:GENERATES]->(html)

// Document 5 is used by 2 and 6
CREATE (doc5:Document {id:'5', name: 'Code examples'})
CREATE (version51:Version {name:'ver.1'})
CREATE (doc5)-[:HAS]->(version51)
CREATE (content51:Content {name:'examples.adoc', mimetype:'text/x-asciidoc'})
CREATE (version51)-[:IS_MADE_OF]->(content51)
CREATE (metadata51:Metadata {title:'Code examples', author:'uxerY', created:'2015-06-20'})
CREATE (version51)-[:LINKS]->(metadata51)
CREATE (version51)-[:GENERATES]->(html)

// Document 6 does not have content, only aggregates 4 in a PDF
CREATE (doc6:Document {id:'6', name: 'Offline User manual'})
CREATE (version61:Version {name:'ver.1'})
CREATE (doc6)-[:HAS]->(version61)
CREATE (metadata61:Metadata {title:'Offline User manual', author:'travisUser', created:'2015-08-03'})
CREATE (version61)-[:LINKS]->(metadata61)
CREATE (version61)-[:GENERATES]->(pdf)
CREATE (version61)-[:GENERATES]->(html)

// Finally, Document 2 includes Document 3
CREATE (doc2)-[:INCLUDES]->(doc3)
CREATE (doc2)-[:INCLUDES]->(doc4)

CREATE (doc4)-[:INCLUDES]->(doc5)

CREATE (doc6)-[:INCLUDES]->(doc4)

Here is the resulting graph.

Also, here you’ll find a more friendly model here.

Some use cases

Imagine you are responsible for keeping a repository like the one presented. Certainly you’ll want to obtain some statistic regarding the documents in order to optimize some tasks.

To begin with some simple use examples.

  • How about obtaining how many different outputs we have to build?

MATCH (v)-[:GENERATES]->(n) RETURN n.name,count(n)

This allows you to plan your generations pipelines to see what can be run in parallel, what might require more resources.

  • How about seeing dependencies?

This will help you sorting your document to detect any possible bottle necks. Not only for output generation, but it will provide you a graphical representation of highly used documents which are probably revised by many people.

MATCH (doc:Document)-[:INCLUDES]->(d) RETURN doc, d

If you have a document included by let’s say 50 other documents, it would be a good idea to implement caching methods to avoid the cost of retrieving it any time.

Of course, this depends on what kind of system yo use to store them.

Tip

Even when using VCS it’s not a good idea to have documents that can be. This may cause a big number edition conflict.

Consider using each document to describe a concept and not build big documents with many sections.

Now for something more complex,

Final notes

To end this, just a few words on why using a graph database offers advantages on this kind of analysis. Specially dealing with include relationshiops<

Certainly the example model is not too complex but, keep in mind that this is a simplification. It could be argued that with little effort it could be build on top of a relational database.

However, a real model becomes much more complex and specially interconnected. A real model should take into consideration that:

  1. Metadata is dynamic, with different sets of documents having different sets of properties.

  2. The number of dependency relationships amongst documents can be very varied. Inclusion is an example, but how about links? quotations? references? Each one of them requires different treatments and here is where a graph database shines.

  3. The number of dependencies can become extremely high. Think of the Wikipedia as an example, trying to model that amount of relationships would require some extra work to efficiently deal with them in a conventional database.

Also the felsibility

A more real mode

ASSOCIATE


1. This transformation process is usually also refered as renderitzations
2. Obviously all document formats have a formal structure but this case is from the final presentation point of view
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment