yawnston/master-thesis-notes.md

## master-thesis-notes.md

      
    Raw
  

              master-thesis-notes.md
            
          
behem implementace : kdyz narazim na vice variant reseni, napsat si bohate poznamky -> z toho udelat neco do finalniho textu
do diplomky je zajimava i diskuse -> kdyz mam 5 napadu, vybrat ten nejlepsi a probrat proc je nejlepsi
swaggerui/vaadin
client-server


v dashboardu mit moznost pridatat dalsi DB systemy
read only pracovat se semantickou kategorii (klikat dotazy)


MM-cat


konstrukce dotazu nebo textova forma (overit aby SPARQL se dal pouzivat na semantickou kategorii)


vyhoda kategorie = skladani cesty -> misto slozitych cest lze dlouhe sipky


vykonat dotaz -> vizualizace vysledku -> reprezentace dotazu, policko na to v jake reprezentaci to vratit
explain -> ukazat mozne strategie dotazu a jejich casovou narocnost (DB + spojovani) -> vizualizace (graf/chart/strom)
i vyber strategie! -> 3 moznost: execute, explain, explain and execute (volba strategie), moznost cachovani strategii, uzivatel muze rict ze chce pouzit nacachovanou strategii u opakovaneho dotazu (duvody: probiha v DB dlouha transakce a dlouho by se cekalo na vysledek - to ale nebudeme resit, pouze pocitame ceny strategii a vizualizujeme to)


postgre (jen relacni cast), mongo, neo4j
https://gitlab.mff.cuni.cz/contosp/evolution-management
http://nosql.ms.mff.cuni.cz/mmcat
http://nosql.ms.mff.cuni.cz/mmcat-docs/
https://github.com/dbishagen/darwin
SwaggerUI + CoreUI template
v emailu najit od pavla koupila navrh dotazu (to odpovida vice mene sparqlu)
definovat technologie, architekturu, API
vytvorit dummy UI, dummy implementaci s hardcoded datama a schematem (vaadin vs swagger, nebo obecne neco co umi delat dashboardy, na vaadin je potreba licence na vizualizaci grafu)
overit SPARQL nad kategorii, viz navrh dotazu od pavla koupila

2 moznosti u schema - explicit schema (databaze vrati na dotaz) nebo data first schema later (db schema odvodi)


SPARQL ID mame unikatni z te kategorie -> lze pouit misto IRI


duraz na ty execution plany -> jak se budou vizualizovat, jaka data o nich budeme ukazovat


vymyslet jaka data v tech databazich dostaneme z tech databazich v explainu, pripadne kdyz neco chybi tak jak to dopocitat, jak to pak pouzit ve vizualizaci


2 veci na benchmarkovani - jak dobry je odhad optimizeru proti realite, a performance systemu jako celku (individualni systemy vs celek)


https://dspace.cuni.cz/handle/20.500.11956/173476
Frontend possibilities

Dashboarding


SwaggerUI is only a tool for creating simple UI from OpenAPI spec
Vaadin -> full stack Java, not easy to decouple BE&FE
React/Vue/Angular - vue used by MM-cat
Python

provides dashboarding libraries for data scientists for fast prototyping
both BE and FE in Python -> unified code base
Dash

not everything provided by the free version, only some things opensource
similar to Streamlit in its purpose


Streamlit

fully opensource
possible to create custom components
very fast and easy prototyping
chosen solution
https://github.com/null-jones/streamlit-plotly-events
problems with interactive graph -> how to do it?
https://discuss.streamlit.io/t/is-it-possible-to-process-click-events-on-charts/1258/3


SPARQL for categorical model

Decisions to be made:

Notation of variables

I prefer to keep SPARQL ?varname notation


Notation of access paths

Which direction do we use? I prefer to reverse the direction as it makes visual sense
What notation to use? I show N where N is an ID, using / to chain paths

<N> is also an option but I don't think it's required, it's not ambiguous between numbers and IDs because of position where it can be used
/ is used in SPARQL for property paths but we can use another char like -


Maybe something which can be compatible with libraries to parse SPARQL

Actually we can get ANTLR grammar for SPARQL and generate a client

This way we can make changes to the grammar also


Simple query

Get order number, contact type and contact value for all orders of Alice.
SELECT ?orderNumber ?contactType ?contactValue
WHERE {
  ?x 24/21/3 "Alice" ;
     25 ?orderNumber ;
     27/29/31 ?contactType ;
     27/33 ?contactValue .
}

Querying property

Is this useful?
(also, construction for ID of object I don't think is useful, like ?x ID 121)
Here there can be ambiguity between number and ID, how do we solve it? Maybe we can "color" variables based on whether they are relationships or not, and based on that we evaluate the filter.
SELECT ?propertyId ?object
WHERE {
  ?x 47 42 ;
     ?propertyId ?object .
  FILTER (?propertyId != 47)
}

Optional patterns

WHERE {
  ?product 47 ?productId .
  OPTIONAL { ?product 57/59 ?bookLength }
}

Advanced paths

We can use things like SPARQL alternative paths and group paths (for precedence):
SELECT ?productId ?publisherOrProductName
WHERE {
  ?product 47 ?productId ;
           (49 | 57/61/63/65) ?publisherOrProductName .
}

As for inverse paths, we have two options:

Use the SPARQL notion of ^49 being reverse traversal of 49
Somehow use the bidirectional edges generated in the schema category instead (how?)

Multiple occurrences of a path like repeating I don't think are useful, since the schema category strongly doesn't really allow patterns like that (verify this?)
-> what did friends of friends order? -> we want to use this
Other things


UNION is extended naturally
Can we have multiple objects for one subject + predicate tuple in the category? How do we tell?
Primitives can be extended from SPARQL (excluding lang strings and IRIs), also funcitions for filtering etc can be extended from SPARQL, including arithmetic and equality operators etc
We can specify variables to return, but how does that translate to the result that is returned?

Instead of SELECT we can use CONSTRUCT to create a graph pattern to return
-> use SELECT but with CONSTRUCT syntax


We should get some specific examples of what we want to show in the demo

---


Jak vypada explain ve vybranych systemech? e.g. mongo, postgre -> co vraci jejich explain
-> jak vypada nas explain? co v nem je atd
Definovat konkretni example do dema
UniBench -> inspirace dotazu
https://github.com/HY-UDBMS/UniBench

-> Postgre (relace + JSON dokumenty) + Mongo (JSON) -> redundance dat


da se pro odhad ceny kombinace udelat rychle trivialni dotazy (e.g. kolik zakazniku se jmenuje Karel atd)

pokud tohle nejde udelat nejak jednoduse, tak to zatim neresit a nechat tam zatim otaznik, potom muzeme pridat nejake heuristiky


kdyz uz budu zjistovat co ty explainy vraci -> udelat si srovnavaci tabulku co ty systemy v explainu umi a co neumi (-> do diplomky)

Query planner information

PostgreSQL

We have execution estimates (startup cost and total cost), estimates for num rows returned and estimated width.

We get this with:
EXPLAIN (FORMAT YAML) SELECT * FROM foo WHERE i='4';

          QUERY PLAN
-------------------------------
 - Plan:                      +
     Node Type: "Index Scan"  +
     Scan Direction: "Forward"+
     Index Name: "fi"         +
     Relation Name: "foo"     +
     Alias: "foo"             +
     Startup Cost: 0.00       +
     Total Cost: 5.98         +
     Plan Rows: 1             +
     Plan Width: 4            +
     Index Cond: "(i = 4)"    
(1 row)
MongoDB

Mongo has 3 explain modes:
queryPlanner mode (default)

Only runs the query planner, returns information about the query plan but unfortunately does not return estimated num rows returned or cost.
db.theaters.explain().find({ theaterId: { $gt: 1020 } });
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "sample_mflix.theaters",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "theaterId" : {
        "$gt" : 1020
      }
    },
    "winningPlan" : {
      "stage" : "FETCH",
      "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
          "theaterId" : 1
        },
        "indexName" : "theaterId_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
          "theaterId" : [ ]
        },
        "isUnique" : true,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
          "theaterId" : [
            "(1020.0, inf.0]"
          ]
        }
      }
    },
    "rejectedPlans" : [ ]
  }
}
executionStats mode

This mode chooses the winning plan and returns data about its execution.
It returns timing information and number of documents/keys touched.
The works value represents work units, i.e. abstract execution cost.
> db.theaters.explain({verbosity: "executionStats"}).find({theaterId: {$gt: 1020}});
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "sample_mflix.theaters",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "theaterId" : {
        "$gt" : 1020
      }
    },
    "winningPlan" : {
      "stage" : "FETCH",
      "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
          "theaterId" : 1
        },
        "indexName" : "theaterId_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
          "theaterId" : [ ]
        },
        "isUnique" : true,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
          "theaterId" : [
            "(1020.0, inf.0]"
          ]
        }
      }
    },
    "rejectedPlans" : [ ]
  },
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 861,
    "executionTimeMillis" : 1,
    "totalKeysExamined" : 861,
    "totalDocsExamined" : 861,
    "executionStages" : {
      "stage" : "FETCH",
      "nReturned" : 861,
      "executionTimeMillisEstimate" : 0,
      "works" : 862,
      "advanced" : 861,
      "needTime" : 0,
      "needYield" : 0,
      "saveState" : 6,
      "restoreState" : 6,
      "isEOF" : 1,
      "invalidates" : 0,
      "docsExamined" : 861,
      "alreadyHasObj" : 0,
      "inputStage" : {
        "stage" : "IXSCAN",
        "nReturned" : 861,
        "executionTimeMillisEstimate" : 0,
        "works" : 862,
        "advanced" : 861,
        "needTime" : 0,
        "needYield" : 0,
        "saveState" : 6,
        "restoreState" : 6,
        "isEOF" : 1,
        "invalidates" : 0,
        "keyPattern" : {
          "theaterId" : 1
        },
        "indexName" : "theaterId_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
          "theaterId" : [ ]
        },
        "isUnique" : true,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
          "theaterId" : [
            "(1020.0, inf.0]"
          ]
        },
        "keysExamined" : 861,
        "seeks" : 1,
        "dupsTested" : 0,
        "dupsDropped" : 0,
        "seenInvalidated" : 0
      }
    },
    "allPlansExecution" : [ ]
  }
}
It's possible that we could somehow get the information about already cached plans from the plan cache.
https://www.slideshare.net/mongodb/reading-the-explain-output
https://www.mongodb.com/docs/manual/reference/method/db.collection.getPlanCache/#mongodb-method-db.collection.getPlanCache
Neo4j

We have the execution plan with estimated number of rows for the stages:
{
  "query": {
    "text": "EXPLAIN MATCH (tom {name: \"Tom Hanks\"}) RETURN tom",
    "parameters": {}
  },
  "queryType": "r",
  "counters": {
    "_stats": {
      "nodesCreated": 0,
      "nodesDeleted": 0,
      "relationshipsCreated": 0,
      "relationshipsDeleted": 0,
      "propertiesSet": 0,
      "labelsAdded": 0,
      "labelsRemoved": 0,
      "indexesAdded": 0,
      "indexesRemoved": 0,
      "constraintsAdded": 0,
      "constraintsRemoved": 0
    },
    "_systemUpdates": 0
  },
  "updateStatistics": {
    "_stats": {
      "nodesCreated": 0,
      "nodesDeleted": 0,
      "relationshipsCreated": 0,
      "relationshipsDeleted": 0,
      "propertiesSet": 0,
      "labelsAdded": 0,
      "labelsRemoved": 0,
      "indexesAdded": 0,
      "indexesRemoved": 0,
      "constraintsAdded": 0,
      "constraintsRemoved": 0
    },
    "_systemUpdates": 0
  },
  "plan": {
    "operatorType": "ProduceResults@neo4j",
    "identifiers": [
      "tom"
    ],
    "arguments": {
      "planner-impl": "IDP",
      "Details": "tom",
      "PipelineInfo": "Fused in Pipeline 0",
      "planner-version": "4.4",
      "runtime-version": "4.4",
      "runtime": "PIPELINED",
      "runtime-impl": "PIPELINED",
      "version": "CYPHER 4.4",
      "EstimatedRows": 17,
      "planner": "COST"
    },
    "children": [
      {
        "operatorType": "Filter@neo4j",
        "identifiers": [
          "tom"
        ],
        "arguments": {
          "Details": "tom.name = $autostring_0",
          "EstimatedRows": 17,
          "PipelineInfo": "Fused in Pipeline 0"
        },
        "children": [
          {
            "operatorType": "AllNodesScan@neo4j",
            "identifiers": [
              "tom"
            ],
            "arguments": {
              "Details": "tom",
              "EstimatedRows": 340,
              "PipelineInfo": "Fused in Pipeline 0"
            },
            "children": []
          }
        ]
      }
    ]
  },
  "profile": false,
  "notifications": [],
  "server": {
    "address": "localhost:7687",
    "version": "Neo4j/4.4.5",
    "agent": "Neo4j/4.4.5",
    "protocolVersion": 4.4
  },
  "resultConsumedAfter": {
    "low": 0,
    "high": 0
  },
  "resultAvailableAfter": {
    "low": 0,
    "high": 0
  },
  "database": {
    "name": "neo4j"
  }
}

Podbarveni grafu -> lze udelat obarveni vrcholu v komponente na kterou jsem kliknul
Udelat paletu konstruktu z jazyka, kliknu do grafu -> otevre se mi paleta a z ni vyberu (filter, sjednoceni kde musim kliknout na dalsi vrchol atd)
http://nosql.ms.mff.cuni.cz/mmcat/schema

Call s Jachymem:

related work - evoluce queries
Neo4j tam melo byt ale neni to tam naimplementovany
Postgre - vsechno string zatim (jine neuvazujeme), Mongo je slozitejsi
Wrappery - pracujeme s abstraktnima v algoritmech
Jak pouziju wrappery -> treba "relacni databaze" atd -> abstraktni wrapper jeste mezi "databaze" a "Postgre"
PROBRAT: Querying language -> 2 moznosti:

V SELECTove casti umoznime jen vzit podmnozinu schematu -> instancni kategorie
Umoznime i projekci -> vytvorit novou schematickou kategorii a pro ni instanci? Nebo jak?
-> Budeme vubec mit neco jako je instancni kategorie? Nebo proste vratime jako JSON/atd a hotovo?


Moznost queryovat instancni kategorii -> to ale nemuzu protoze bych nacital vsechny data

31oct:

do diplomky i napsat tento postup navrhu jazyka, t.j. vezmu SPARQL grammar antlr -> upravim -> generuju parser

SELECT {
?person  ?person2 .
?person  ?person3 .
}
-> JSON (pseudo):
{
superpritel {
person2
}
superpritel2 {
person3
}
}
-> vysledek bude instancni kategorie pro schema indukovane SELECT klauzuli

pri vyberu formatu vysledku si uzivatel vybere format (napr JSON -> musi vybrat i koren kde v grafu se ma JSON zakorenit)
potreba nove syntaxe pro definici novych morfismu
requirements: cistost, jednoduchost
v diplomce popsat 2 moznosti : mapovani primo do syntaxe jazyka, nebo mapovani mimo dle reprezentace dat
pri navrhu a v textu pocitat s grafovym modelem a Neo4j, ale implementace staci PostgreSQL a MongoDB
-> vlastne to i staci, protoze tyhle 2 modely toho maji vetsinu (mame aggregate-ignorant i aggregate-oriented)
nebat se v textu napsat "tohle je tezky, je to out of scope diplomky, nikdo ho jeste neresil"

Do ctvrtka 10.11.:
Vybrat pro každého zákazníka nejdražší předmět, který si objednal. Zajímá mě strom v SELECT
Vrátit např. { name, surname, theMostExpensiveItem { name, price}}
From Me to Everyone 08:13 PM
SELECT {
?customer  ?item .
?customer  ?name .
?item  ?itemName .
?item  ?itemPrice .
}
WHERE {
?customer 12 ?order .
?order 42 ?item .
# ?item je nejdrazsi vec co si zakaznik objednal
}

poslat i strukturu dotazu

Vrátit seznam produktů, které mají nižší cenu než 150 - pro každého zákazníka.
Pro každého zákazníka...
From Me to Everyone 08:19 PM
SELECT {
  ?customer cheapItems _:cheapItems ;
            name       ?name        ;
            surname    ?surname     .

  _:cheapItems item ?item .

  ?item name  ?itemName  ;
        price ?itemPrice .
}
WHERE {
  ?customer -12/14 ?order .
  ?order -42/40 ?item .
  ?item 56 ?itemName ;
        57 ?itemPrice .
  FILTER (?itemPrice <= 150)
}

EXPLANATION:

WHERE clause does pattern matching on the schema category - generates solutions matching this pattern (using triples subject predicate object .)

?customer is a variable named customer
56 is the ID of a morphism, in this case the morphism from Product to Name. Morphisms prefixed with - are duals, i.e. opposite direction.

Note: I arbitrarily assigned IDs to the morphisms from the categorical approach PDF from Pavel Koupil.


We can form paths using morphisms: -12/14 means to traverse the dual of 12, and then traverse 14.
; is syntactic sugar for having multiple triples with the same subject, equivalent to simply repeating the subject again in the next triple.
FILTER removes some solutions from consideration. Only solutions matching the FILTER are returned.


SELECT clause describes the returned data - it implicitly defines a schema category, and returned data will be instances of this category.

Using an alphanumeric string in the predicate position in SELECT defines a new morphism (and implicitly its dual also).
Using bound variables will simply substitute that variable's data.
_:cheapItems is syntax for a blank node named cheapItems - this is necessary in order to add new objects to the returned schema, otherwise we could only use bound variables.

In this case, because _:cheapItems is first defined in the object position, a new instance will be created for each ?customer instance.
If _:cheapItems was first defined in the subject position, there would be a single object shared between all customers.
https://gist.github.com/yawnston/298534cb5127da48471eb0ec7af56f1c