Author: Nikola Milosevic, Goran Nenadic
Most current annotation schemas and tools allow annotation of entities and relationships between the words or phrases in raw textual documents. However, most of the documents on the web are actually rich documents, usually presented in HTML or specialized XML format, then interpreted in web browsers. Vast amounts of information is presented in tables, figures and other rich text elements, which is lost when only the text is annotated. For example, in clinical trial publications information about experimental settings, results and adverse events are usually presented in tables. Many of such publications are accessible via PubMedCentral in rich XML/HTML format, but gathering and semantically enriching information from elements such as tables is not currently possible.
We are aiming to develop a web based annotation tool which will allow users to annotate rich web documents and save annotations both internally (in a database) and externally as linked data (contains support for RDF and json-LD export). As a base annotation scheme we will use the PubAnnotation scheme, since it provides the necessary annotation features and attributes (denotations, relations, modifications).
We are developing a web dashboard that would be able to detect the position of the selected text on the page and generate an XPath of the selection. This can be archieved with the use of JavaScrip as it was done in XPathGenerator. The user can then chose the type of annotations and attributes to be applied. We will use Apache Jena library for linked data export.
Example of short text annotated using PubAnnotations looks like this:
{
"text": "IRF-4 expression in CML may be induced by IFN-α therapy",
"denotations": [
{"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
{"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"},
{"id": "E1", "span": {"begin": 6, "end": 16}, "obj": "Expression"},
{"id": "E2", "span": {"begin": 31, "end": 38}, "obj": "Regulation"}
],
"relations": [
{"id": "R1", "subj": "T1", "pred": "themeOf", "obj": "E1"},
{"id": "R2", "subj": "E1", "pred": "themeOf", "obj": "E2"},
{"id": "R3", "subj": "T2", "pred": "causeOf", "obj": "E2"}
],
"modifications": [
{"id": "M1", "pred": "Speculation", "obj": "E2"}
]
}
The difference in proposed annotation schema for XML semi-structured documents compaired to PubAnnotation schema is the usage of xml parameter instead of text and the usage of xpath parameter instead of span. XPath is able to locate chunks of text, rich document elements and even chunks of text inside rich text elements such as tables or figures. By doing so, user can annotate textual parts of the docuement, tables, figures and other rich elements of the document.
Example of table annotated with the proposed annotation schema is presented below:
{
"xml": "<table>
<tr><td>parameter</td><td>number</td></tr>
<tr><td>male/famale</td><td>15/18</td></tr>
</table>",
"denotations": [
{"id": "T1", "xpath": "/table/tr[1]/td[1]", "obj": "Header"},
{"id": "T2", "xpath": "/table/tr[1]/td[1]", "obj": "Stub"},
{"id": "T3","xpath": "/table/tr[1]/td[2]", "obj": "Header"},
{"id": "T4","xpath": "/table/tr[2]/td[1]", "obj": "Stub"},
{"id": "T5", "xpath": "/table/tr[2]/td[2]", "obj": "Data"},
{"id" : "T6","xpath": "substring(/table/tr[1]/td[1],2,5)",
"obj": "substringEx"}
],
"relations": [
{"id": "R1", "subj": "T4", "pred": "dataOfHeader", "obj": "T1"},
{"id": "R2", "subj": "T5", "pred": "dataOfHeader", "obj": "T3"},
{"id": "R3", "subj": "T5", "pred": "dataOfStub", "obj": "T4"},
]
}
Dashboard will be developped by using Java (JSP), JavaScript and MySQL. At a later stage, system will provide automatic integration with popular NER tools, before handing control to the curator.
- Milosevic,N; Gregson, C; Hernandez, R; Nenadic, G. (2016, June). Disentangling the Structure of Tables in Scientific Literature. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings (Vol. 9612, p. 162). Springer.
- Milosevic, N., Gregson, C., Hernandez, R., & Nenadic, G. (2016). Extracting patient data from tables in clinical literature: Case study on extraction of BMI, weight and number of patients.. In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies ISBN 978-989-758-170-0, pages 223-228. DOI: 10.5220/0005660102230228
- Milosevic, N., Gregson, C., Hernandez, R., & Nenadic, G. Hybrid methodology for information extraction from tables in the biomedical literature. In Proceedings of the Belgrade Bioinformatics Conference (BelBi2016)
- Milosevic, N. (2016). Marvin: Semantic annotation using multiple knowledge sources. arXiv preprint arXiv:1602.00515.