Skip to content

Instantly share code, notes, and snippets.

@yawnston
Last active November 13, 2022 16:02
Show Gist options
  • Save yawnston/ff8b0b920e902e779ef78f45ca48b76c to your computer and use it in GitHub Desktop.
Save yawnston/ff8b0b920e902e779ef78f45ca48b76c to your computer and use it in GitHub Desktop.
master-thesis-notes
  • behem implementace : kdyz narazim na vice variant reseni, napsat si bohate poznamky -> z toho udelat neco do finalniho textu
  • do diplomky je zajimava i diskuse -> kdyz mam 5 napadu, vybrat ten nejlepsi a probrat proc je nejlepsi
  • swaggerui/vaadin
  • client-server
  1. v dashboardu mit moznost pridatat dalsi DB systemy
  2. read only pracovat se semantickou kategorii (klikat dotazy)
  • MM-cat
  1. konstrukce dotazu nebo textova forma (overit aby SPARQL se dal pouzivat na semantickou kategorii)
  • vyhoda kategorie = skladani cesty -> misto slozitych cest lze dlouhe sipky
  1. vykonat dotaz -> vizualizace vysledku -> reprezentace dotazu, policko na to v jake reprezentaci to vratit explain -> ukazat mozne strategie dotazu a jejich casovou narocnost (DB + spojovani) -> vizualizace (graf/chart/strom) i vyber strategie! -> 3 moznost: execute, explain, explain and execute (volba strategie), moznost cachovani strategii, uzivatel muze rict ze chce pouzit nacachovanou strategii u opakovaneho dotazu (duvody: probiha v DB dlouha transakce a dlouho by se cekalo na vysledek - to ale nebudeme resit, pouze pocitame ceny strategii a vizualizujeme to)

2 moznosti u schema - explicit schema (databaze vrati na dotaz) nebo data first schema later (db schema odvodi)

  • SPARQL ID mame unikatni z te kategorie -> lze pouit misto IRI

  • duraz na ty execution plany -> jak se budou vizualizovat, jaka data o nich budeme ukazovat

  • vymyslet jaka data v tech databazich dostaneme z tech databazich v explainu, pripadne kdyz neco chybi tak jak to dopocitat, jak to pak pouzit ve vizualizaci

  • 2 veci na benchmarkovani - jak dobry je odhad optimizeru proti realite, a performance systemu jako celku (individualni systemy vs celek)

https://dspace.cuni.cz/handle/20.500.11956/173476

Frontend possibilities

Dashboarding

  • SwaggerUI is only a tool for creating simple UI from OpenAPI spec
  • Vaadin -> full stack Java, not easy to decouple BE&FE
  • React/Vue/Angular - vue used by MM-cat
  • Python

SPARQL for categorical model

Decisions to be made:

  • Notation of variables
    • I prefer to keep SPARQL ?varname notation
  • Notation of access paths
    • Which direction do we use? I prefer to reverse the direction as it makes visual sense
    • What notation to use? I show N where N is an ID, using / to chain paths
      • <N> is also an option but I don't think it's required, it's not ambiguous between numbers and IDs because of position where it can be used
      • / is used in SPARQL for property paths but we can use another char like -
    • Maybe something which can be compatible with libraries to parse SPARQL
      • Actually we can get ANTLR grammar for SPARQL and generate a client
        • This way we can make changes to the grammar also

Simple query

Get order number, contact type and contact value for all orders of Alice.

SELECT ?orderNumber ?contactType ?contactValue
WHERE {
  ?x 24/21/3 "Alice" ;
     25 ?orderNumber ;
     27/29/31 ?contactType ;
     27/33 ?contactValue .
}

Querying property

Is this useful? (also, construction for ID of object I don't think is useful, like ?x ID 121)

Here there can be ambiguity between number and ID, how do we solve it? Maybe we can "color" variables based on whether they are relationships or not, and based on that we evaluate the filter.

SELECT ?propertyId ?object
WHERE {
  ?x 47 42 ;
     ?propertyId ?object .
  FILTER (?propertyId != 47)
}

Optional patterns

WHERE {
  ?product 47 ?productId .
  OPTIONAL { ?product 57/59 ?bookLength }
}

Advanced paths

We can use things like SPARQL alternative paths and group paths (for precedence):

SELECT ?productId ?publisherOrProductName
WHERE {
  ?product 47 ?productId ;
           (49 | 57/61/63/65) ?publisherOrProductName .
}

As for inverse paths, we have two options:

  • Use the SPARQL notion of ^49 being reverse traversal of 49
  • Somehow use the bidirectional edges generated in the schema category instead (how?)

Multiple occurrences of a path like repeating I don't think are useful, since the schema category strongly doesn't really allow patterns like that (verify this?)

-> what did friends of friends order? -> we want to use this

Other things

  • UNION is extended naturally
  • Can we have multiple objects for one subject + predicate tuple in the category? How do we tell?
  • Primitives can be extended from SPARQL (excluding lang strings and IRIs), also funcitions for filtering etc can be extended from SPARQL, including arithmetic and equality operators etc
  • We can specify variables to return, but how does that translate to the result that is returned?
    • Instead of SELECT we can use CONSTRUCT to create a graph pattern to return -> use SELECT but with CONSTRUCT syntax
  • We should get some specific examples of what we want to show in the demo

---

  • Jak vypada explain ve vybranych systemech? e.g. mongo, postgre -> co vraci jejich explain
  • -> jak vypada nas explain? co v nem je atd
  • Definovat konkretni example do dema
  • UniBench -> inspirace dotazu
  • https://github.com/HY-UDBMS/UniBench
    • -> Postgre (relace + JSON dokumenty) + Mongo (JSON) -> redundance dat
  • da se pro odhad ceny kombinace udelat rychle trivialni dotazy (e.g. kolik zakazniku se jmenuje Karel atd)
    • pokud tohle nejde udelat nejak jednoduse, tak to zatim neresit a nechat tam zatim otaznik, potom muzeme pridat nejake heuristiky
  • kdyz uz budu zjistovat co ty explainy vraci -> udelat si srovnavaci tabulku co ty systemy v explainu umi a co neumi (-> do diplomky)

Query planner information

PostgreSQL

We have execution estimates (startup cost and total cost), estimates for num rows returned and estimated width.
We get this with:

EXPLAIN (FORMAT YAML) SELECT * FROM foo WHERE i='4';

          QUERY PLAN
-------------------------------
 - Plan:                      +
     Node Type: "Index Scan"  +
     Scan Direction: "Forward"+
     Index Name: "fi"         +
     Relation Name: "foo"     +
     Alias: "foo"             +
     Startup Cost: 0.00       +
     Total Cost: 5.98         +
     Plan Rows: 1             +
     Plan Width: 4            +
     Index Cond: "(i = 4)"    
(1 row)

MongoDB

Mongo has 3 explain modes:

queryPlanner mode (default)

Only runs the query planner, returns information about the query plan but unfortunately does not return estimated num rows returned or cost.

db.theaters.explain().find({ theaterId: { $gt: 1020 } });
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "sample_mflix.theaters",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "theaterId" : {
        "$gt" : 1020
      }
    },
    "winningPlan" : {
      "stage" : "FETCH",
      "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
          "theaterId" : 1
        },
        "indexName" : "theaterId_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
          "theaterId" : [ ]
        },
        "isUnique" : true,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
          "theaterId" : [
            "(1020.0, inf.0]"
          ]
        }
      }
    },
    "rejectedPlans" : [ ]
  }
}

executionStats mode

This mode chooses the winning plan and returns data about its execution. It returns timing information and number of documents/keys touched. The works value represents work units, i.e. abstract execution cost.

> db.theaters.explain({verbosity: "executionStats"}).find({theaterId: {$gt: 1020}});
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "sample_mflix.theaters",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "theaterId" : {
        "$gt" : 1020
      }
    },
    "winningPlan" : {
      "stage" : "FETCH",
      "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
          "theaterId" : 1
        },
        "indexName" : "theaterId_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
          "theaterId" : [ ]
        },
        "isUnique" : true,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
          "theaterId" : [
            "(1020.0, inf.0]"
          ]
        }
      }
    },
    "rejectedPlans" : [ ]
  },
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 861,
    "executionTimeMillis" : 1,
    "totalKeysExamined" : 861,
    "totalDocsExamined" : 861,
    "executionStages" : {
      "stage" : "FETCH",
      "nReturned" : 861,
      "executionTimeMillisEstimate" : 0,
      "works" : 862,
      "advanced" : 861,
      "needTime" : 0,
      "needYield" : 0,
      "saveState" : 6,
      "restoreState" : 6,
      "isEOF" : 1,
      "invalidates" : 0,
      "docsExamined" : 861,
      "alreadyHasObj" : 0,
      "inputStage" : {
        "stage" : "IXSCAN",
        "nReturned" : 861,
        "executionTimeMillisEstimate" : 0,
        "works" : 862,
        "advanced" : 861,
        "needTime" : 0,
        "needYield" : 0,
        "saveState" : 6,
        "restoreState" : 6,
        "isEOF" : 1,
        "invalidates" : 0,
        "keyPattern" : {
          "theaterId" : 1
        },
        "indexName" : "theaterId_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
          "theaterId" : [ ]
        },
        "isUnique" : true,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
          "theaterId" : [
            "(1020.0, inf.0]"
          ]
        },
        "keysExamined" : 861,
        "seeks" : 1,
        "dupsTested" : 0,
        "dupsDropped" : 0,
        "seenInvalidated" : 0
      }
    },
    "allPlansExecution" : [ ]
  }
}

It's possible that we could somehow get the information about already cached plans from the plan cache. https://www.slideshare.net/mongodb/reading-the-explain-output https://www.mongodb.com/docs/manual/reference/method/db.collection.getPlanCache/#mongodb-method-db.collection.getPlanCache

Neo4j

We have the execution plan with estimated number of rows for the stages:

{
  "query": {
    "text": "EXPLAIN MATCH (tom {name: \"Tom Hanks\"}) RETURN tom",
    "parameters": {}
  },
  "queryType": "r",
  "counters": {
    "_stats": {
      "nodesCreated": 0,
      "nodesDeleted": 0,
      "relationshipsCreated": 0,
      "relationshipsDeleted": 0,
      "propertiesSet": 0,
      "labelsAdded": 0,
      "labelsRemoved": 0,
      "indexesAdded": 0,
      "indexesRemoved": 0,
      "constraintsAdded": 0,
      "constraintsRemoved": 0
    },
    "_systemUpdates": 0
  },
  "updateStatistics": {
    "_stats": {
      "nodesCreated": 0,
      "nodesDeleted": 0,
      "relationshipsCreated": 0,
      "relationshipsDeleted": 0,
      "propertiesSet": 0,
      "labelsAdded": 0,
      "labelsRemoved": 0,
      "indexesAdded": 0,
      "indexesRemoved": 0,
      "constraintsAdded": 0,
      "constraintsRemoved": 0
    },
    "_systemUpdates": 0
  },
  "plan": {
    "operatorType": "ProduceResults@neo4j",
    "identifiers": [
      "tom"
    ],
    "arguments": {
      "planner-impl": "IDP",
      "Details": "tom",
      "PipelineInfo": "Fused in Pipeline 0",
      "planner-version": "4.4",
      "runtime-version": "4.4",
      "runtime": "PIPELINED",
      "runtime-impl": "PIPELINED",
      "version": "CYPHER 4.4",
      "EstimatedRows": 17,
      "planner": "COST"
    },
    "children": [
      {
        "operatorType": "Filter@neo4j",
        "identifiers": [
          "tom"
        ],
        "arguments": {
          "Details": "tom.name = $autostring_0",
          "EstimatedRows": 17,
          "PipelineInfo": "Fused in Pipeline 0"
        },
        "children": [
          {
            "operatorType": "AllNodesScan@neo4j",
            "identifiers": [
              "tom"
            ],
            "arguments": {
              "Details": "tom",
              "EstimatedRows": 340,
              "PipelineInfo": "Fused in Pipeline 0"
            },
            "children": []
          }
        ]
      }
    ]
  },
  "profile": false,
  "notifications": [],
  "server": {
    "address": "localhost:7687",
    "version": "Neo4j/4.4.5",
    "agent": "Neo4j/4.4.5",
    "protocolVersion": 4.4
  },
  "resultConsumedAfter": {
    "low": 0,
    "high": 0
  },
  "resultAvailableAfter": {
    "low": 0,
    "high": 0
  },
  "database": {
    "name": "neo4j"
  }
}
  • Podbarveni grafu -> lze udelat obarveni vrcholu v komponente na kterou jsem kliknul
  • Udelat paletu konstruktu z jazyka, kliknu do grafu -> otevre se mi paleta a z ni vyberu (filter, sjednoceni kde musim kliknout na dalsi vrchol atd) http://nosql.ms.mff.cuni.cz/mmcat/schema

Call s Jachymem:

  • related work - evoluce queries
  • Neo4j tam melo byt ale neni to tam naimplementovany
  • Postgre - vsechno string zatim (jine neuvazujeme), Mongo je slozitejsi
  • Wrappery - pracujeme s abstraktnima v algoritmech
  • Jak pouziju wrappery -> treba "relacni databaze" atd -> abstraktni wrapper jeste mezi "databaze" a "Postgre"
  • PROBRAT: Querying language -> 2 moznosti:
    • V SELECTove casti umoznime jen vzit podmnozinu schematu -> instancni kategorie
    • Umoznime i projekci -> vytvorit novou schematickou kategorii a pro ni instanci? Nebo jak? -> Budeme vubec mit neco jako je instancni kategorie? Nebo proste vratime jako JSON/atd a hotovo?
  • Moznost queryovat instancni kategorii -> to ale nemuzu protoze bych nacital vsechny data

31oct:

  • do diplomky i napsat tento postup navrhu jazyka, t.j. vezmu SPARQL grammar antlr -> upravim -> generuju parser

SELECT { ?person ?person2 . ?person ?person3 . }

-> JSON (pseudo): { superpritel { person2 } superpritel2 { person3 } }

-> vysledek bude instancni kategorie pro schema indukovane SELECT klauzuli

  • pri vyberu formatu vysledku si uzivatel vybere format (napr JSON -> musi vybrat i koren kde v grafu se ma JSON zakorenit)
  • potreba nove syntaxe pro definici novych morfismu
  • requirements: cistost, jednoduchost
  • v diplomce popsat 2 moznosti : mapovani primo do syntaxe jazyka, nebo mapovani mimo dle reprezentace dat
  • pri navrhu a v textu pocitat s grafovym modelem a Neo4j, ale implementace staci PostgreSQL a MongoDB -> vlastne to i staci, protoze tyhle 2 modely toho maji vetsinu (mame aggregate-ignorant i aggregate-oriented)
  • nebat se v textu napsat "tohle je tezky, je to out of scope diplomky, nikdo ho jeste neresil"

Do ctvrtka 10.11.:

Vybrat pro každého zákazníka nejdražší předmět, který si objednal. Zajímá mě strom v SELECT Vrátit např. { name, surname, theMostExpensiveItem { name, price}} From Me to Everyone 08:13 PM SELECT { ?customer ?item . ?customer ?name . ?item ?itemName . ?item ?itemPrice . } WHERE { ?customer 12 ?order . ?order 42 ?item . # ?item je nejdrazsi vec co si zakaznik objednal }

  • poslat i strukturu dotazu

Vrátit seznam produktů, které mají nižší cenu než 150 - pro každého zákazníka. Pro každého zákazníka... From Me to Everyone 08:19 PM

SELECT {
  ?customer cheapItems _:cheapItems ;
            name       ?name        ;
            surname    ?surname     .

  _:cheapItems item ?item .

  ?item name  ?itemName  ;
        price ?itemPrice .
}
WHERE {
  ?customer -12/14 ?order .
  ?order -42/40 ?item .
  ?item 56 ?itemName ;
        57 ?itemPrice .
  FILTER (?itemPrice <= 150)
}

EXPLANATION:

  • WHERE clause does pattern matching on the schema category - generates solutions matching this pattern (using triples subject predicate object .)
    • ?customer is a variable named customer
    • 56 is the ID of a morphism, in this case the morphism from Product to Name. Morphisms prefixed with - are duals, i.e. opposite direction.
      • Note: I arbitrarily assigned IDs to the morphisms from the categorical approach PDF from Pavel Koupil.
    • We can form paths using morphisms: -12/14 means to traverse the dual of 12, and then traverse 14.
    • ; is syntactic sugar for having multiple triples with the same subject, equivalent to simply repeating the subject again in the next triple.
    • FILTER removes some solutions from consideration. Only solutions matching the FILTER are returned.
  • SELECT clause describes the returned data - it implicitly defines a schema category, and returned data will be instances of this category.
    • Using an alphanumeric string in the predicate position in SELECT defines a new morphism (and implicitly its dual also).
    • Using bound variables will simply substitute that variable's data.
    • _:cheapItems is syntax for a blank node named cheapItems - this is necessary in order to add new objects to the returned schema, otherwise we could only use bound variables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment