Skip to content

Instantly share code, notes, and snippets.

@yawnston
Last active February 18, 2022 12:11
Show Gist options
  • Save yawnston/528e56e3c55e4ac7ba52e2d5a1f8b9fa to your computer and use it in GitHub Desktop.
Save yawnston/528e56e3c55e4ac7ba52e2d5a1f8b9fa to your computer and use it in GitHub Desktop.

Research notes

Timestamping

We can use the NOW() function to timestamp the schema. Later we can do something like create a max age for the schema, at which point we will discard old parts of the schema and get them again.

The NOW() function does not give timezone info, but it should be consistent between calls on the same SPARQL endpoint. Therefore should not be an issue.

Q: could it be expected that the schema is changing? Or are the datasets expected to be static? A: new data could be added, but schema probably won't change often

Subclasses

Getting all subclasses of a class:

PREFIX se: <http://skodapetr.eu/ontology/sparql-endpoint/>
CONSTRUCT {
  [] se:class ?Class ;
    se:superclass ?Superclass;
    se:endpointUri "https://dev.nkod.opendata.cz/sparql";
    se:numberOfInstances ?numberOfInstances .
} WHERE {
  {
    SELECT ?Class ?Superclass (COUNT(?resource) AS ?numberOfInstances)
    WHERE {
      ?resource a ?Class.
      ?Class a ?Superclass.
    }
    GROUP BY ?Class ?Superclass
  }
}
LIMIT 10

We can make this transitive as well. However, it seems like a pretty uncommon scenario. Most of the time, it is just owl#Class or rdf-schema#Class. We should probably exclude things named Class.

TODO: add data about which endpoints and occurences

TODO: make list of live SPARQL endpoints we can use for our research

Counting instances

There is a problem with simply calling COUNT() - on large datasets, it will be unbearably slow. Some SPARQL endpoints will even terminate our queries before they finish running because of the counting.

Proposed solution: instead of extracting classes and the count at the same time, firstly run a query extracting a list of all classes:

SELECT DISTINCT ?Class
WHERE {
    ?something a ?Class.
}

Afterwards, we can individually query the endpoint for the number of instances for specific classes like so:

SELECT (COUNT(*) as ?numInstances)
WHERE {
  SELECT ?something
  WHERE {
    ?something a schema:Article.
  }
  LIMIT 50
}

If the returned count is equal to the set limit, we can note it in the schema as 50+, and later we can query the count again with an increased limit. This way we can quickly find which classes have very few instances.

TODO: find out where the counting takes too long, prepare the data

Note: WikiData is pretty large, maybe most datasets are small enough to not have to use this

Multi-properties and multi-relations

For each property/relation, we should find out whether it can appear multiple times on the same object. We should also consider whether to show a max count. Perhaps we could show a max count if a consistent upper limit can be found in the data.

Ranges of properties

We should keep track of which different types properties can take and how common they are.

Culling low count classes

We definitely want to prune some "unimportant" classes and properties from the schema. But we cannot simply ignore low-count classes since this could ignore important things in smaller datasets.

Suggestion: Come up with a 'importance' value for each schema element. This value could range from 1 (most important) to 0 (least important). We can calculate this value using factors like number of instances, number of properties, and number of instances of connected instances, etc.

Note: not part of schema itself, it's postprocessing above the schema

Suggestion: can we check if we can see the same connectivity component across different endpoints? e.g. Virtuoso classes (maybe out of scope)

Postprocessing

My suggestion is to add it ad-hoc as we come across endpoints which require it. This includes things like removing Virtuoso-specific instances. We should ignore it from the start and just add it on the fly as required.

SHACL Annotations

Q: What is the benefit of producing SHACL-like annotations? That is as opposed to a set of classes which are exposed by a npm package.

A: easier to understand for more people + some tools already exist which generate GraphQL schema from SHACL

Step 1: endpoint makes observations and saves them (i.e. SPARQL results saved as JSON) Step 2: take observations and makes the model Step 3: take the model and make SHACL

Note: currently it can be multiple steps/scripts

e.g. SHACL is nice to have and will make the tool stronger

TODO: try to describe our schema in SHACL (although we can do more, SHACL is not good for fuzzy data)

https://2022.eswc-conferences.org/call-for-posters-and-demos/ DEADLINE 7.3., do 3. nebo 4. brezna by to melo byt zprovozneny

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment