Skip to content

Instantly share code, notes, and snippets.

View VladimirAlexiev's full-sized avatar

Vladimir Alexiev VladimirAlexiev

View GitHub Profile

This is a fixed representation of sec 7 Real ECLASS Content Example from ECLASS Serialization as RDF, Part 1, ECLASS Technical Specification 110, 22 April 2024.

I collected all text, added comments into puml:label, and had to make some fixes (marked with FIXED):

  • Fixed 8 prefixes from eclass: (used for ECLASS "metadata" terms) to eclass13-0: (used for ECLASS content terms)
  • Elided (commented out) 4 statements because they don't add clarity
  • Changed or added 10 URLs to make the whole graph connected

Then I used the rdfpuml tool to make a diagram: image

@VladimirAlexiev
VladimirAlexiev / CHIN-restructure.md
Last active March 17, 2024 19:04
Examples of complex SPARQL queries that I've written

Prefixes

At the beginning the query defines all prefixes it uses, including for individuals like nom:, nomBib:, nomLang:.

The ontology nomo.ttl defines an even wider set of prefixes: when loaded to GraphDB, these become repository namespaces, so they are used in result display and export, which is very useful for the end-user.

Output

all: sparql-anything-test-xml.ttl sparql-anything-test-html.ttl
sparql-anything-test-xml.ttl: sparql-anything.sparql test.xml
sparql-anything.bat -q sparql-anything.sparql -v type=application/xml -v file=test.xml > sparql-anything-test-xml.ttl
sparql-anything-test-html.ttl: sparql-anything.sparql test.xml
sparql-anything.bat -q sparql-anything.sparql -v type=text/html -v file=test.xml > sparql-anything-test-html.ttl
@VladimirAlexiev
VladimirAlexiev / DIGIN10-30-LV1_EQ-fixed.jsonld
Last active December 26, 2023 09:40
Improved CIM JSON-LD Representation
{
"@context": {
"rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"cim" : "http://ucaiug.org/ns/CIM#",
"eu" : "http://iec.ch/TC57/CIM100-European#",
"dct" : "http://purl.org/dc/terms/",
"dcat" : "http://www.w3.org/ns/dcat#",
"prov" : "http://www.w3.org/ns/prov#",
"xsd" : "http://www.w3.org/2001/XMLSchema#",
"cim:Bay.VoltageLevel" : {"@type" : "@id"},
@VladimirAlexiev
VladimirAlexiev / README.md
Last active November 22, 2023 15:23
Migrating J. Paul Getty Museum Agent ID from P2432 to P12040

Migrating J. Paul Getty Museum Agent ID from P2432 to P12040

https://www.wikidata.org/wiki/Property:P12040

  • Renamed P2432 to "J. Paul Getty Museum agent DOR ID (old)"
  • New prop P12040 "J. Paul Getty Museum agent ID"

The old IDs https://www.getty.edu/art/collection/artists/377 redirect to new IDs https://www.getty.edu/art/collection/person/103JV9 . These pages include human-readable info and "APIs & other identifiers" on the bottom that lists:

  • Permalink: the new prop
  • DOR ID (internal digital object repository): the old prop. WD has 1054 values (about 9% of total)
@VladimirAlexiev
VladimirAlexiev / Berkshire-describe.ttl
Last active May 17, 2023 08:14
Instance diagram about Berkshire Hathaway GLEIF L1 data from #data_world DESCRIBE query, see https://twitter.com/valexiev1/status/1298655567389564928
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix fn: <http://www.w3.org/2005/xpath-functions#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix gleif-L1: <https://www.gleif.org/ontology/L1/> .
@prefix gleif-L2: <https://www.gleif.org/ontology/L2/> .
@prefix gleif-base: <https://www.gleif.org/ontology/Base/> .
@prefix gleif-data-L1: <https://linked.opendata.gleif.org/L1/> .
@prefix gleif-data-L2: <https://linked.opendata.gleif.org/L2/> .
@prefix gleif-data-ra: <https://linked.opendata.gleif.org/RegistrationAuthority/> .
@VladimirAlexiev
VladimirAlexiev / README.md
Created March 30, 2017 06:54
How to use Google Sheets to Manage Wikidata Coreferencing

How to use Google Sheets to Manage Wikidata Coreferencing

A previous post How to Add Museum IDs to Wikidata explained how to use SPARQL to find missing data on Wikidata (Getty Museum IDs), how to create such values (from museum webpage URLs) and how to format them properly for QuickStatements.

Here I explain how to use Google sheets to manage a more advanced task. The sheet AAT-Wikidata matches about 3k AAT concepts to Wikipedia, WordNet30 and BabelNet (it restored an old mapping to Wordnet, retrieved it from BabelNet, mapped to Wikipedia).

  • For each row, it uses the following Google sheet formula (column C) to query the Wikipedia API and get the corresponding Wikidata ID (wikibase_item); split on two lines for readability:
=ImportXml(concat("https://en.wikipedia.o
epo:AccessTerm
epo:AcquiringCentralPurchasingBody
epo:AgentInRole
epo:AwardCriterion
epo:AwardDecision
epo:AwardEvaluationTerm
epo:Awarder
epo:AwardingCentralPurchasingBody
epo:BudgetProvider
epo:Business
@VladimirAlexiev
VladimirAlexiev / README.org
Created September 23, 2022 11:11
CrunchBase permalinks including uppercase

Most CB permalinks are uppercase, but a tiny percentage include uppercase letters:

grep "[A-Z]" permalink.txt|sort>permalink-uppercase.txt
wc -l permalink.txt permalink-uppercase.txt
 2050775 permalink.txt
     272 permalink-uppercase.txt

I attach the file so it can be added as exceptions to Wikidata.

Crunchbase Challenge

Here's a challenge to the KG Construction CG:

  • Take Crunchbase: 10.5M rows, across 18 tables, served as CSV, updated daily.
  • The data of some nodes comes from multiple tables (eg Organization from organizations, org_parents, org_descriptions)
  • RDFize and store the total dataset, in under 1-2 hours time
    • Using the approach described here, GraphDB 9.11 with OntoRefine takes 76-119 minutes (1.3-2 hours) depending on hardware to produce and load 138M triples (19-30k triples per second)
  • Update the data daily, replacing the data of recently updated rows.
    • Using the approach described here, it takes about 15 minutes to update all of Crunchbase
  • Do it with your favorite RDFization toolkit, and preferably do it declaratively