Skip to content

Instantly share code, notes, and snippets.

@kylebrandt
Last active February 14, 2020 17:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kylebrandt/b17ff133127b88165fd22a65a0424e81 to your computer and use it in GitHub Desktop.
Save kylebrandt/b17ff133127b88165fd22a65a0424e81 to your computer and use it in GitHub Desktop.
wip wide/tall df graphing issue

Support Graphing of Long (a.k.a. Tall) and Wide Format Time Series Dataframes

Use Case

Currently our SQL plugins derive things from the special names of the columns. Instead we can use more conventional Long and Wide formats and derive the format from the schema. This will result in a more conventional SQL experience for users. The amount of code plugin authors need to create will be reduced.

Taking a MySQL Example:

SELECT
  $__timeGroup(createdAt,'$summarize') as time_sec,
  avg(value) as value,
  hostname as metric
FROM 
  grafana_metric
WHERE
  $__timeFilter(createdAt) AND
  measurement = 'logins.count' AND
  hostname IN($host)
GROUP BY 1, 3
ORDER BY 1

Which becomes:

SELECT
  UNIX_TIMESTAMP(createdAt) DIV 60 * 60 as time_sec,
  avg(value) as value,
  hostname as metric
FROM 
  grafana_metric
WHERE
  createdAt BETWEEN FROM_UNIXTIME(1580908377) AND FROM_UNIXTIME(1580985308) AND
  measurement = 'logins.count' AND
  hostname IN('10.1.100.1','10.1.100.10','server1','server2')
GROUP BY 1, 3
ORDER BY 1

Produces Long Format Data:

"Time";"Metric";"Value"
"2020-02-06 10:35:00";"server2";"445.36"
"2020-02-06 10:35:00";"server1";"445.85"
"2020-02-06 10:35:00";"10.1.100.10";"445.64"
"2020-02-06 10:35:00";"10.1.100.1";"445.59"
"2020-02-06 10:34:00";"server2";"446.28"
"2020-02-06 10:34:00";"server1";"446.25"
"2020-02-06 10:34:00";"10.1.100.10";"446.56"
"2020-02-06 10:34:00";"10.1.100.1";"446.72"

In order to minimally impact existing queries, we can update timeGroup to select time column. Users will still have to update when not using macros. Will also need to make sure we can always select into time objects from our all of our sql datasources to have a proper schema.

So the SQL query becomes the following (having sql create time):

SELECT
  FROM_UNIXTIME(UNIX_TIMESTAMP(createdAt) DIV 60 * 60) as time_sec,
  avg(value) as value,
  hostname as metric
FROM 
  grafana_metric
WHERE
  createdAt BETWEEN FROM_UNIXTIME(1580908377) AND FROM_UNIXTIME(1580985308) AND
  measurement = 'logins.count' AND
  hostname IN('10.1.100.1','10.1.100.10','server1','server2')
GROUP BY 1, 3
ORDER BY 1

Distinguishing Aspects Formats

  • Both:
    • Have Time column
    • Have one or more Numeric Field
  • Long
    • Will have String Fields (in Result)
    • Time will be repeated (Not detectable in schema, may be sparse)
    • Have no Labels on the Dataframe's Fields.
  • Wide
    • Will Not have string Fields
    • Time (should) not be repeated

How Long Format Time Series format is interpreted

Reference from azure data explorer datasource: https://github.com/grafana/azure-data-explorer-datasource#time-series-queries

  • Has one Time column.
  • Number type columns are "Metrics". The name of the column is the name of the metric.
  • String type columns are Key/Value pair identifiers (a.k.a labels, tags). The column name is the identifer key, the value in each row is identifier Value.
  • Each unique time series in the response is the unique combination of metric and Key/Value pair identifiers.

How Wide Format Time Series format is interpret

  • Has one Time column.
  • Has one or more Number value columns. The Unique name of the column is the Metric.
  • If the column has a Labels property, they become Key/Value pair identifiers. And the Name does not have to be unique if Name+Labels is unique.

Misc notes and Reference

  • Alerting and GEL will need same interpretation ability.
  • There are some other types to support. Both those involve nested arrays most likely and dataframe does have those yet.
  • Wide vs Long is determined by format of query result, not format of data in database.
  • There may be cases where there is a second time column, and maybe that should be treated as a graph-able value.
  • We should not force these schemas to always be time series. Assumption is the context is one of time series.
  • A reference on wide vs long: https://docs.timescale.com/latest/introduction/data-model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment