-
-
Save bobvanluijt/a6f812589095f7435e4e8a99a7f8fef6 to your computer and use it in GitHub Desktop.
### | |
# The result below shows the sum of population of all cities. | |
### | |
{ | |
Local { | |
Get(where:{ | |
operands: [{ | |
path: ["Things", "City", "population"], | |
operator: GreaterThan | |
valueInt: 1000000 | |
}] | |
},{ | |
group:{ | |
operands: [{ | |
path: ["Things", "City", "population"], | |
aggregate: SUM # other options: COUNT, MAX, MIN, SUM, AVG, | |
}] | |
} | |
}) { | |
Things { | |
City { | |
population | |
} | |
} | |
} | |
} | |
} |
@bobvanluijt, can you give an example result? What might the result be if you are asking for 'name' and 'population' of a city, and you group by the sum of the population?
Sure @laura-ham,
Assume our DB has:
- city-a with a population of 100.000
- city-b with a population of 100.001
- city-c with a population of 200.000
Option 1
The query below will result in:
- population = 300.000
{
Local {
Get(where:{
operands: [{
path: ["Things", "City", "population"],
operator: GreaterThan
valueInt: 1000000
}]
},{
group:{
operands: [{
path: ["Things", "City", "population"],
aggregate: SUM # other options: COUNT, MAX, MIN, SUM, AVG,
}]
}
}) {
Things {
City {
population
}
}
}
}
}
Option 2
The query below will result in:
- name: city-b
- population: 100.001
- name: city-c
- population: 200.000
{
Local {
Get(where:{
operands: [{
path: ["Things", "City", "population"],
operator: GreaterThan
valueInt: 1000000
}]
},{
group:{
operands: [{
path: ["Things", "City", "population"],
aggregate: SUM # other options: COUNT, MAX, MIN, SUM, AVG,
}]
}
}) {
Things {
City {
name
population
}
}
}
}
}
Option 3
The query below will result in:
- name = 2
{
Local {
Get(where:{
operands: [{
path: ["Things", "City", "population"],
operator: GreaterThan
valueInt: 1000000
}]
},{
group:{
operands: [{
path: ["Things", "City", "population"],
aggregate: COUNT # other options: SUM, MAX, MIN, SUM, AVG,
}]
}
}) {
Things {
City {
name
}
}
}
}
}
Option 4
The query below will result in:
- population = 2
{
Local {
Get(where:{
operands: [{
path: ["Things", "City", "population"],
operator: GreaterThan
valueInt: 1000000
}]
},{
group:{
operands: [{
path: ["Things", "City", "population"],
aggregate: COUNT # other options: SUM, MAX, MIN, SUM, AVG,
}]
}
}) {
Things {
City {
population
}
}
}
}
}
option 5
change in the DB
- city-a with a population of 100.000
- city-b with a population of 100.001
- city-c with a population of 200.000
- city-c with a population of 100.000 (there a two
city-c
's)
The query below will result in:
- name: city-b
- population: 100.001
- name: city-c
- population: 300.000 ⬅️
{
Local {
Get(where:{
operands: [{
path: ["Things", "City", "population"],
operator: GreaterThan
valueInt: 1000000
}]
},{
group:{
operands: [{
path: ["Things", "City", "population"],
aggregate: SUM # other options: COUNT, MAX, MIN, SUM, AVG,
}]
}
}) {
Things {
City {
name
population
}
}
}
}
}
To me these feel like 'statistical' functions, and not 'Get' functions.
I'd argue that a separate 'Aggregated' or 'Statistics' field under Local would do wonders for keeping the 'Get' function simple.
{
Local {
Aggregated() {
Things { City { .... } }
}
// or
Statistics(...) {
Things { City { .... } }
}
}
}
I image that such a field could be translated to Network queries too.
I find it hard to understand what these different aggregations are supposed to do, based on the GraphQL query.
I believe that we should distinguish between simple counts with conditions, and more complex operations like groupBy
.
Each different function/operation should ideally correspond to a field below 'Aggregated' or 'Statistics'.
This will make it very simple for end users to start to do some operations.
Initial impressions count, and a initial expore to simple statistics are very good for the demo-ability of Weaviate.
Simple sum
{
Local {
Statistics(where: { ... }) {
Sum {
Things {
City {
population
}
}
}
}
}
}
Output
{ "Local": { "Statistics": { "Sum": { "Things": { "City": { "population": 42 } } } } } }
95% percentile
{
Local {
Statistics(where: { ... }) {
# Compute 95% range of data.
Percentile(from: 0.25, to: 0.975) {
Things {
City {
population
}
}
}
}
}
}
Output
{ "Local": { "Statistics": { "Sum": { "Things": { "City": { "population": {
"min": 1000,
"max": 2000
}} } } } } }
Group By
{
Local {
Statistics(where: { ... }) {
GroupBy() {
Things {
City {
country @groupBy(fn: GROUP_BY)
totalPopulation: population @groupBy(fn: SUM)
smallestCity: population @groupBy(fn: MIN)
biggestCity: population @groupBy(fn: MAX)
}
}
}
}
}
}
A extra advantage of doing tis that you'll be able to clearly defend that these are different functions with different pricing than just slurping the data out of weaviate/a network.
That indeed sounds reasonable. Not necessarily in favor for one or the other but syntactically it might absolutely be preferable to introduce a Stats{}
function.
Naming wise, maybe Aggregate{}
would suit better. Any thoughts @laura-ham and @moretea?
It would also be possible to add all aggregation functions as GQL-functions.
{
Local {
Aggregate(where: { ... }) { # or Stats...
Sum{}
Percentile{}
Count{}
Average{}
Maximum{}
Median{}
Minimum{}
Mode{}
GroupBy() {} # Would be used for more complext group by functions.
}
}
@moretea would it be fair to say that splitting these aggregate functions (except for GroupBy()
) would be relatively easier to implement?
Maybe they are simple to implement. I would expect so, based on my experience with SQL. However, Gremlin is not as well rounded, and I did not research this yet for Gremlin.
info and types: https://en.wikipedia.org/wiki/Aggregate_function