We can use the NOW()
function to timestamp the schema.
Later we can do something like create a max age for the schema, at which point
we will discard old parts of the schema and get them again.
The NOW()
function does not give timezone info, but it should be consistent
between calls on the same SPARQL endpoint. Therefore should not be an issue.
Q: could it be expected that the schema is changing? Or are the datasets expected to be static? A: new data could be added, but schema probably won't change often
Getting all subclasses of a class:
PREFIX se: <http://skodapetr.eu/ontology/sparql-endpoint/>
CONSTRUCT {
[] se:class ?Class ;
se:superclass ?Superclass;
se:endpointUri "https://dev.nkod.opendata.cz/sparql";
se:numberOfInstances ?numberOfInstances .
} WHERE {
{
SELECT ?Class ?Superclass (COUNT(?resource) AS ?numberOfInstances)
WHERE {
?resource a ?Class.
?Class a ?Superclass.
}
GROUP BY ?Class ?Superclass
}
}
LIMIT 10
We can make this transitive as well.
However, it seems like a pretty uncommon scenario.
Most of the time, it is just owl#Class
or rdf-schema#Class
.
We should probably exclude things named Class
.
TODO: add data about which endpoints and occurences
TODO: make list of live SPARQL endpoints we can use for our research
There is a problem with simply calling COUNT()
- on large datasets,
it will be unbearably slow. Some SPARQL endpoints will even terminate
our queries before they finish running because of the counting.
Proposed solution: instead of extracting classes and the count at the same time, firstly run a query extracting a list of all classes:
SELECT DISTINCT ?Class
WHERE {
?something a ?Class.
}
Afterwards, we can individually query the endpoint for the number of instances for specific classes like so:
SELECT (COUNT(*) as ?numInstances)
WHERE {
SELECT ?something
WHERE {
?something a schema:Article.
}
LIMIT 50
}
If the returned count is equal to the set limit, we can note it in the schema
as 50+
, and later we can query the count again with an increased limit.
This way we can quickly find which classes have very few instances.
TODO: find out where the counting takes too long, prepare the data
Note: WikiData is pretty large, maybe most datasets are small enough to not have to use this
For each property/relation, we should find out whether it can appear multiple times on the same object. We should also consider whether to show a max count. Perhaps we could show a max count if a consistent upper limit can be found in the data.
We should keep track of which different types properties can take and how common they are.
We definitely want to prune some "unimportant" classes and properties from the schema. But we cannot simply ignore low-count classes since this could ignore important things in smaller datasets.
Suggestion: Come up with a 'importance' value for each schema element. This value could range from 1 (most important) to 0 (least important). We can calculate this value using factors like number of instances, number of properties, and number of instances of connected instances, etc.
Note: not part of schema itself, it's postprocessing above the schema
Suggestion: can we check if we can see the same connectivity component across different endpoints? e.g. Virtuoso classes (maybe out of scope)
My suggestion is to add it ad-hoc as we come across endpoints which require it. This includes things like removing Virtuoso-specific instances. We should ignore it from the start and just add it on the fly as required.
Q: What is the benefit of producing SHACL-like annotations? That is as opposed to a set of classes which are exposed by a npm package.
A: easier to understand for more people + some tools already exist which generate GraphQL schema from SHACL
Step 1: endpoint makes observations and saves them (i.e. SPARQL results saved as JSON) Step 2: take observations and makes the model Step 3: take the model and make SHACL
Note: currently it can be multiple steps/scripts
e.g. SHACL is nice to have and will make the tool stronger
TODO: try to describe our schema in SHACL (although we can do more, SHACL is not good for fuzzy data)
https://2022.eswc-conferences.org/call-for-posters-and-demos/ DEADLINE 7.3., do 3. nebo 4. brezna by to melo byt zprovozneny