Skip to content

Instantly share code, notes, and snippets.

@knbknb
Last active May 2, 2024 06:37
Show Gist options
  • Save knbknb/665c2ef981c7af70709d55b91e1d2418 to your computer and use it in GitHub Desktop.
Save knbknb/665c2ef981c7af70709d55b91e1d2418 to your computer and use it in GitHub Desktop.
helmholtz-knowledge-graph (use in VSCode with Sparqlbook Extension)
[
{
"kind": 1,
"language": "markdown",
"value": "# Queries to explore a dataset: Helmholtz-KnowledgeGraph\n\n> Source: [bobdc.com](https://www.bobdc.com/blog/exploringadataset/) - Bob DuCharme's blog\n\nI recently worked on a project where we had a huge amount of RDF and no clue what was in there apart from what we saw by looking at random triples.\nI developed a few SPARQL queries to give us a better idea of the dataset’s content and structure and these queries are generic enough that I thought that they could be useful to other people.\n\nI’ve written about other exploratory queries before.\nIn Exploring a SPARQL Endpoint I wrote about queries that look for the use of common vocabularies that might be used at a particular endpoint, and how getting a few clues led me to additional related queries.\nThat blog post also mentioned the “Exploring the Data” section of my book Learning SPARQL, which has other general useful queries.\n\nYou can see those listed in the book’s table of contents; they often assume that some sort of schema or ontology is in use.\nA great thing about SPARQL and RDF, though, is that with no knowledge of a schema or any other clues about a dataset’s contents, simple queries can still let you explore that dataset to see what’s there.\nToday’s exploratory queries were not included among those that I described above.\n\nExample output for each query uses the [Helmholtz Knowledge Graph](https://sparql.unhide.helmholtz-metadaten.de/).\n\n##### How many triples does this dataset have in all?\n\n\n",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "SELECT (COUNT (*) AS?tripleCount) WHERE { ?s ?p ?o }",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "Definitely a hall of fame, classic query .\nHere is the result for the Helmholtz Knowledge Graph, as of the time of writing this post (2024-05-01):\n\n| tripleCount |\n|-----------------|\n| 71786350 |\n",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "\n##### Show all the types being used\n\nNever mind whether any types were declared; how many types are used? List them, but don’t repeat any.\n\n",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "SELECT DISTINCT ?type WHERE { ?s a ?type } LIMIT 40",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "The result with the Helmholtz Knowledge Graph is:\n\n| type |\n|--------------------------------------------------------|\n| rdf:Property |\n| owl:Class |\n| owl:Ontology |\n| rdfs:Class |\n| owl:OntologyProperty |\n| owl:AnnotationProperty |\n| sd:Service |\n| schema:Person |\n| schema:SoftwareSourceCode |\n| schema:ActionStatusType |\n\nand many more\n",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "\n##### Count instances per type\n\nOf the types that the previous query found being used, how many instances of each are there? This is useful when you are prioritizing what you’re going to do with the data.\n\n",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "SELECT ?type (COUNT (?s) AS ?instanceCount) \nWHERE { ?s a ?type . \n} GROUP BY ?type\nORDER BY DESC(?instanceCount)",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "The result:\n\n| type | instanceCount |\n|-------| -------------------------------------------|\n| schema:PropertyValue | 3871436 |\n| schema:DefinedTermSet | 3009404 |\n| schema:Person | 2532126 |\n| schema:DefinedTerm | 1939435 |\n| schema:Organization | 1058310 |\n| schema:DataDownload | 787610 |\n| schema:Dataset | 463432 |\n| schema:Place | 463349 |\n| schema:DataCatalog | 407405 |\n| schema:QuantitativeValue | 394792 |\n| schema:GeoCoordinates | ...|\n\nand many more\n",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "\n##### Count the properties that each type uses\n\nOf the types that were found above, how many different properties does each use?\n\n",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "# SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?c) \n# WHERE { ?s a ?type . \n# ?s ?p ?o . \n# } \n# GROUP BY ?type\n\n# only few types of object\nSELECT ?type (COUNT(DISTINCT ?p) AS ?c) \nWHERE { \n VALUES ?type { \n <http://schema.org/WebApplication> \n <http://schema.org/WebAPI> \n <https://w3id.org/software-types#DesktopApplication> \n <https://w3id.org/software-types#SoftwareLibrary> \n }\n ?s a ?type . \n ?s ?p ?o . \n}\nGROUP BY ?type\nORDER BY DESC(?c)",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "##### Number of properties used in the Beatles data, by type:\n\n| type\t| c |\n|------------| --------------------------------------------|\n| <https://w3id.org/software-types#DesktopApplication> | 4 |\n| schema:WebApplication | 3 | \n| <https://w3id.org/software-types#SoftwareLibrary> | 2 |\n| schema:WebAPI | 1 | \n",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "##### List properties per type\n\nThe next query will show us why the XXXX class uses so many properties . \n\nWhat are these properties that each type uses? This is also useful for prioritization .\nNote the similarities with and differences from the previous query.\n\n",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "SELECT DISTINCT ?type ?property \nWHERE { ?s a ?type . \n?s ?property ?o . \n} \nORDER BY ?type ?property\nLIMIT 10",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "The following is an excerpt from the middle of this query’s result, with <http://learningsparql.com/ns/schema/Song> reduced to s:Song to make it all fit better here .\nThis sample shows that all the different instruments, with all their different spellings, were properties of each song .\n(Read more about how that worked in my SPARQL queries of Beatles recording sessions blog post.)\n\n| Entity | Property |\n| --- | --- |\n| s:Song | <http://learningsparql.com/ns/instrument/guiro>\n| s:Song | <http://learningsparql.com/ns/instrument/guitar>\n| s:Song | <http://learningsparql.com/ns/instrument/handbell>\n| s:Song | <http://learningsparql.com/ns/instrument/handclaps>\n| s:Song | <http://learningsparql.com/ns/instrument/harmonica>\n| s:Song | <http://learningsparql.com/ns/instrument/harmonium>\n| s:Song | <http://learningsparql.com/ns/instrument/harmonyvocals>\n",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "#### Have a query create a schema for this schemaless data\n\nConsider that:\n\n- The dataset has no schema but we found types being used\n- We found properties associated with these types\n- Schemas are themselves datasets of triples\n- SPARQL lets you create triples\n\nThis all adds up to the ability to create a schema where there isn’t any .\nIn fact, we can do it with a slight variation on the last query:\n\n",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nCONSTRUCT { \n ?type a rdfs:Class .\n ?property a rdf:Property .\n} \nWHERE { \n ?s a ?type .\n ?s ?property ?o .\n}",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "We could go a little further by having the schema use the `rdfs:domain` and `rdfs:range properties` to associate the declared properties with the classes that the query found them with:",
"metadata": {}
},
{
"kind": 2,
"language": "sparql",
"value": "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nCONSTRUCT {\n ?type a rdfs:Class .\n ?property a rdf:Property .\n ?property rdfs:domain ?type .\n ?property rdfs:range ?otype . \n}\nWHERE {\n ?s a ?type .\n ?s ?property ?o .\n OPTIONAL { ?o a ?otype }\n}",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "Along with the schema triples you see above, this new version adds triples like these:\n\n```turtle\ni:banjo rdf:type rdf:Property ;\n rdfs:domain s:Song ;\n rdfs:range s:Musician .\n```",
"metadata": {}
},
{
"kind": 1,
"language": "markdown",
"value": "It also gives the `rdfs:label` property `rdfs:domain` values of `s:Instrument`, `s:Musician`, and `s:Song`, which isn’t quite right; as the RDFS spec tells us, \n\n> “[t]he `rdfs:domain` of `rdfs:label` is `rdfs:Resource`”. \n \nThe spec also tells us that \n\n> “the resources denoted by subjects of triples with predicate `P` are instances of all the classes stated by the `rdfs:domain` properties”, \n\nwhich in the case of my example means that every instance with an `rdfs:label` property is an instrument and a musician and song.\n\nWe clearly don’t want to say that, but if you are creating a schema for a dataset that lacks one, CONSTRUCT queries like this can give you a big head start. Just run one or the other with the dataset and then edit the schema that it creates as you see fit.",
"metadata": {}
}
]
# this query creates a resultset that contains URLs in the ?s column
# and the name of the software source code in the ?value column
# endpoint (webinterface): https://sparql.unhide.helmholtz-metadaten.de/
PREFIX schema: <http://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?s ?value WHERE {
VALUES ?type {
schema:SoftwareSourceCode
}
?s rdf:type ?type;
schema:name ?value.
FILTER( CONTAINS(STR(?s), "https://git.gfz-potsdam.de/icdp-osg/"))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment