Skip to content

Instantly share code, notes, and snippets.

View VladimirAlexiev's full-sized avatar

Vladimir Alexiev VladimirAlexiev

View GitHub Profile
@VladimirAlexiev
VladimirAlexiev / CHIN-restructure.md
Last active March 17, 2024 19:04
Examples of complex SPARQL queries that I've written

Prefixes

At the beginning the query defines all prefixes it uses, including for individuals like nom:, nomBib:, nomLang:.

The ontology nomo.ttl defines an even wider set of prefixes: when loaded to GraphDB, these become repository namespaces, so they are used in result display and export, which is very useful for the end-user.

Output

all: sparql-anything-test-xml.ttl sparql-anything-test-html.ttl
sparql-anything-test-xml.ttl: sparql-anything.sparql test.xml
sparql-anything.bat -q sparql-anything.sparql -v type=application/xml -v file=test.xml > sparql-anything-test-xml.ttl
sparql-anything-test-html.ttl: sparql-anything.sparql test.xml
sparql-anything.bat -q sparql-anything.sparql -v type=text/html -v file=test.xml > sparql-anything-test-html.ttl
@VladimirAlexiev
VladimirAlexiev / DIGIN10-30-LV1_EQ-fixed.jsonld
Last active December 26, 2023 09:40
Improved CIM JSON-LD Representation
{
"@context": {
"rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"cim" : "http://ucaiug.org/ns/CIM#",
"eu" : "http://iec.ch/TC57/CIM100-European#",
"dct" : "http://purl.org/dc/terms/",
"dcat" : "http://www.w3.org/ns/dcat#",
"prov" : "http://www.w3.org/ns/prov#",
"xsd" : "http://www.w3.org/2001/XMLSchema#",
"cim:Bay.VoltageLevel" : {"@type" : "@id"},
@VladimirAlexiev
VladimirAlexiev / README.md
Last active November 22, 2023 15:23
Migrating J. Paul Getty Museum Agent ID from P2432 to P12040

Migrating J. Paul Getty Museum Agent ID from P2432 to P12040

https://www.wikidata.org/wiki/Property:P12040

  • Renamed P2432 to "J. Paul Getty Museum agent DOR ID (old)"
  • New prop P12040 "J. Paul Getty Museum agent ID"

The old IDs https://www.getty.edu/art/collection/artists/377 redirect to new IDs https://www.getty.edu/art/collection/person/103JV9 . These pages include human-readable info and "APIs & other identifiers" on the bottom that lists:

  • Permalink: the new prop
  • DOR ID (internal digital object repository): the old prop. WD has 1054 values (about 9% of total)
epo:AccessTerm
epo:AcquiringCentralPurchasingBody
epo:AgentInRole
epo:AwardCriterion
epo:AwardDecision
epo:AwardEvaluationTerm
epo:Awarder
epo:AwardingCentralPurchasingBody
epo:BudgetProvider
epo:Business
@VladimirAlexiev
VladimirAlexiev / README.org
Created September 23, 2022 11:11
CrunchBase permalinks including uppercase

Most CB permalinks are uppercase, but a tiny percentage include uppercase letters:

grep "[A-Z]" permalink.txt|sort>permalink-uppercase.txt
wc -l permalink.txt permalink-uppercase.txt
 2050775 permalink.txt
     272 permalink-uppercase.txt

I attach the file so it can be added as exceptions to Wikidata.

@VladimirAlexiev
VladimirAlexiev / README.md
Last active August 16, 2022 14:14
Representing CrunchBase IPOs in FIBO

Representing CrunchBase IPOs in FIBO

CrunchBase has a table ipos with the following fields (see ipos-sample.csv):

field type descr
uuid string Unique identifier
name string Entity name
permalink string Suffix of url. Sometimes changes
@VladimirAlexiev
VladimirAlexiev / README.md
Last active August 8, 2022 13:42
Converting FIBO from RDF to Turtle

Converting FIBO from RDF to Turtle

  • Download FIBO 2022Q2 from https://github.com/edmcouncil/fibo/releases
  • Expand to folder 2022Q2-rdf (291 .rdf files in total)
  • Run rdf2ttl.sh:
    • Creates then runs mkdirs.sh, which replicates the directory structure from 2022Q2-rdf to 2022Q2-ttl
    • Creates then runs rdf2ttl-one.sh, which converts each .rdf file to a respective .ttl using Jena Riot
    • Collects and deduplicates all prefixes (@prefix lines) to prefixes.ttl (385 prefixes in total). This file is very useful until you can copy prefixed names driectly from the FIBO Viewer (edmcouncil/onto-viewer#270)
  • Finally, checks that no prefix and no namespace is used twice
@prefix nomo: <https://nomenclature.info/nom/ontology/> .
@prefix nomShape: <https://nomenclature.info/nom/shape/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix nom: <https://nomenclature.info/nom/> .

Crunchbase Challenge

Here's a challenge to the KG Construction CG:

  • Take Crunchbase: 10.5M rows, across 18 tables, served as CSV, updated daily.
  • The data of some nodes comes from multiple tables (eg Organization from organizations, org_parents, org_descriptions)
  • RDFize and store the total dataset, in under 1-2 hours time
    • Using the approach described here, GraphDB 9.11 with OntoRefine takes 76-119 minutes (1.3-2 hours) depending on hardware to produce and load 138M triples (19-30k triples per second)
  • Update the data daily, replacing the data of recently updated rows.
    • Using the approach described here, it takes about 15 minutes to update all of Crunchbase
  • Do it with your favorite RDFization toolkit, and preferably do it declaratively