Apiman metrics can collect arbitrary headers (and query params) as per user configuration (once the relevant PR is merged, anyway!).
There are also a bunch of standard metrics that are always collected, so those are simple (i.e. always non-null columns)
For example, let's imagine the user decides they want to capture the X-Request-Id
header in the request and response, and X-Foo
in only the request.
In standard InfluxDB/QuestDB you can just dynamically add and remove columns for each one with some reasonable name.
Now they get something like:
headers.request.foo | headers.request.x-request-id | headers.response.x-request-id | lots of other fixed columns |
---|---|---|---|
Hello | Bazzy | Foo | (not important for this) |
Hello2 | Something | Great! | ... |
At some future point, the user decides they want to stop capturing x-request-id
, but continue to capture x-foo
.
Now (AIUI) for future requests you end up in the database with...
headers.request.foo | headers.request.x-request-id | headers.response.x-request-id | lots of other fixed columns |
---|---|---|---|
Hello3 | NULL | NULL | |
Hello4 | NULL | NULL |
Another approach -- although it appears it is not supported from what I can understand from the QuestDB docs -- is that you are allowed to have a list of tuples as a value in a column; it is a kind of correlated sub-document, or reference to another table (not sure what the correct name is in QuestDB parlance). I don't know if this is allowed in InfluxDB either - but wanted to know if it's allowed in QuestDB and if it's a good idea.
So you end up with something like ....
headers.request | headers.response | lots of other fixed columns |
---|---|---|
[(x-request-id, Bazzy), (x-foo, Hello)] |
[(x-request-id, Foo)] |
(not important for this |
And when you decide to stop capturing X-Request-Id
in the future, it's just something like this.
headers.request | headers.response | lots of other fixed columns |
---|---|---|
[(x-foo, Hello)] |
[] |
(not important for this) |
Which avoids all of the null columns.
From the perspective of Apiman writing into the database, both are equally easy -- but I wanted to know whether one particular approach or another is better for users of that data (and the DB itself) by avoiding all of the null columns when dynamically growing and shrinking the fields captured.
(excluded types, etc, to keep noise down)
hi, thanks for this write-up, it explains it very clearly!
I'm afraid there is no good answer at the moment.
In the first case, QuestDB will behave as you described with Influx: When your columns change in time then you will accumulate "null baggage". QuestDB will create column files which will contain mostly nulls.
The workaround with tuples does not sound practical: QuestDB does not have a concept of tuples. You could store string tuples as regular strings, but QuestDB would have no visibility into it. That would make further data processing more complicated as it would force you to use various string functions to get individual headers back. So that's not great either.
I see 2 possible solutions in QuestDB:
Partition-Local ColumnTops: When a new column is added then we save a current
RowId
into a table metadata. This tells us that all rows before that RowId have this column set to null and we don't have to physically recreate column files for all past rows. This behaviour is already implemented. Perhaps we could extend it further: When a new (time-)partition is created then we could postpone the creation of a columnar file until a first non-null value is inserted into a given column. This way we wouldn't need to actively write null values until a row with a non-null value is inserted into a partition. I can imagine complications with indexes, out-of-order insertions, etc. But it theory, it should be possible. The question is how much cost & risk it is vs. benefits.Data compression: If a column file contains just nulls then it should be pretty trivial to compress it. Data compression is already being discussed. So that could be a solution too.