Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
BLAHmuc2016 submission

RichAnnotator - Annotation Tool For Rich Web Documents

Author: Nikola Milosevic, Goran Nenadic


Most current annotation schemas and tools allow annotation of entities and relationships between the words or phrases in raw textual documents. However, most of the documents on the web are actually rich documents, usually presented in HTML or specialized XML format, then interpreted in web browsers. Vast amounts of information is presented in tables, figures and other rich text elements, which is lost when only the text is annotated. For example, in clinical trial publications information about experimental settings, results and adverse events are usually presented in tables. Many of such publications are accessible via PubMedCentral in rich XML/HTML format, but gathering and semantically enriching information from elements such as tables is not currently possible.


We are aiming to develop a web based annotation tool which will allow users to annotate rich web documents and save annotations both internally (in a database) and externally as linked data (contains support for RDF and json-LD export). As a base annotation scheme we will use the PubAnnotation scheme, since it provides the necessary annotation features and attributes (denotations, relations, modifications).


We are developing a web dashboard that would be able to detect the position of the selected text on the page and generate an XPath of the selection. This can be archieved with the use of JavaScrip as it was done in XPathGenerator. The user can then chose the type of annotations and attributes to be applied. We will use Apache Jena library for linked data export.

Practical example

Example of short text annotated using PubAnnotations looks like this:

   "text": "IRF-4 expression in CML may be induced by IFN-α therapy",
   "denotations": [
      {"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
      {"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"},
      {"id": "E1", "span": {"begin": 6, "end": 16}, "obj": "Expression"},
      {"id": "E2", "span": {"begin": 31, "end": 38}, "obj": "Regulation"}
   "relations": [
      {"id": "R1", "subj": "T1", "pred": "themeOf", "obj": "E1"},
      {"id": "R2", "subj": "E1", "pred": "themeOf", "obj": "E2"},
      {"id": "R3", "subj": "T2", "pred": "causeOf", "obj": "E2"}
   "modifications": [
      {"id": "M1", "pred": "Speculation", "obj": "E2"}

The difference in proposed annotation schema for XML semi-structured documents compaired to PubAnnotation schema is the usage of xml parameter instead of text and the usage of xpath parameter instead of span. XPath is able to locate chunks of text, rich document elements and even chunks of text inside rich text elements such as tables or figures. By doing so, user can annotate textual parts of the docuement, tables, figures and other rich elements of the document.

Example of table annotated with the proposed annotation schema is presented below:

   "xml": "<table>
   "denotations": [
      {"id": "T1", "xpath": "/table/tr[1]/td[1]", "obj": "Header"},
      {"id": "T2", "xpath": "/table/tr[1]/td[1]", "obj": "Stub"},
      {"id": "T3","xpath": "/table/tr[1]/td[2]", "obj": "Header"},
      {"id": "T4","xpath": "/table/tr[2]/td[1]", "obj": "Stub"},
      {"id": "T5", "xpath": "/table/tr[2]/td[2]", "obj": "Data"},
      {"id" : "T6","xpath": "substring(/table/tr[1]/td[1],2,5)", 
"obj": "substringEx"}
   "relations": [
      {"id": "R1", "subj": "T4", "pred": "dataOfHeader", "obj": "T1"},
      {"id": "R2", "subj": "T5", "pred": "dataOfHeader", "obj": "T3"},
      {"id": "R3", "subj": "T5", "pred": "dataOfStub", "obj": "T4"},

Dashboard will be developped by using Java (JSP), JavaScript and MySQL. At a later stage, system will provide automatic integration with popular NER tools, before handing control to the curator.

Publications related to table mining and annotation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment