Skip to content

Instantly share code, notes, and snippets.

@cwarny
Created March 30, 2015 15:10
Show Gist options
  • Save cwarny/cbb01790d6c5220bed55 to your computer and use it in GitHub Desktop.
Save cwarny/cbb01790d6c5220bed55 to your computer and use it in GitHub Desktop.
cant-think-of-a-title

Goal is to create a tool to enable users to generate a structured knowledge from unstructured data through interaction with that data via a GUI. Here are the different features we would like to support:

  1. Tagging content
    • This in turn will update a predictive model that will suggest other documents to be tagged similarly (to be later confirmed or rejected by users)
  2. Extracting entities by highlighting bits of text
  3. Defining simple semantic relationships between entities by drag and drop
  4. Defining ontologies (data models), i.e. rules and relationships that we expect among a set of entities
    • This in turn will enable the system to automatically generate inferences, and therefore new insights

In this meeting, I want to focus on 2, 3 and 4.

A key feature here is that we would like the users themselves to be able to build these things, rather than automatically by NLP or by parsing Wikipedia or what not. The users should be able to create the resources ("resource" as in RDF), the links and the models. For a while I thought that we could skip the human input and automatically generate a structured knowledge base off of a corpus of text documents by processing natural language, automatically extracting subject-verb-complement patterns and feed that into a reasoning RDF store. But that is probably a utopia. Instead, the key of this project would be to team up humans and machines.

The goal is to enable to do that through a highly intuitive interface not requiring extensive knowledge about the art of ontology-making. This means that we probably won't be able to leverage all the subleties of RDF and OWL and inference engines, but just use the basic stuff from all these frameworks.

Some key points I'd like to address:

  • What is a good, intuitive GUI for data modeling and creating OWL-like "inference rules"? Does it make sense to try to create our own, simplified ontology editor? Does SAS already have an ontology editor?
  • What is a good, scalable store for triple data that can be queried fast?
    • Sesame?
    • Jena?
  • Do these triplestores come with an inference engine?
  • Do inference engines actually create and store inferred triples alongside asserted triples, or do they merely check system consistency?
  • Can you feed triplestores both RDF data and OWL/RDFS metadata/models and that will automatically do the inferencing for you and generate new triples by itself?
  • Can graph databases (as opposed to SPARQL endpoints) be used to store and query semantic data?
  • Do triplestores support "reification" and blank nodes?
    • Since this knowledge base would be built based on a corpus of documents, many triples will be based off specific documents in the corpus. What is a good way of relating a specific triple to a specific document? Would reification be a good strategy, as in "document X says [reified triple]"?

FAQ

This FAQ is for those not that familiar with basic semantic web concepts.

  • What is RDF?

Resource Description Framework. It is a web standard for uniquely identifying resources or real-world things/entities that one wants to talk about as well as how to talk about it. Basically, you talk about things through subject-predicate-object triples.

  • What do you mean by "inference rules" and "ontology"?

That is for instance being able to say that if I tell the system that A is married to B, then it can safely deduce that B is married to A, without me explicitly telling the system. It is through an ontology that you define these kinds of rules. In this case, the ontology will define the "isMarriedTo" relationship as being symmetric.

Another example is: I'm on the website of some clothing store and I'm searching for that orange shirt that I remember seeing in their catalogue. But when I search for "orange" under the category "Shirts", nothing comes up. Then I realize that if I select the subcategory "Henleys", under "Shirts", then I find it. That wouldn't happen if in our ontology we had specified that "Henleys" was a subclass of "Shirts". The system would have been able to automatically infer that that orange henley is also an orange shirt.

There exists different frameworks to represent this kind of "metadata". Amongst the most popular are RDFS (not the same thing as RDF) and OWL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment