Skip to content

Instantly share code, notes, and snippets.

@msavy
Last active June 22, 2022 17:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save msavy/252da38147baae5bafda897a5d76e246 to your computer and use it in GitHub Desktop.
Save msavy/252da38147baae5bafda897a5d76e246 to your computer and use it in GitHub Desktop.

Apiman metrics can collect arbitrary headers (and query params) as per user configuration (once the relevant PR is merged, anyway!).

There are also a bunch of standard metrics that are always collected, so those are simple (i.e. always non-null columns)

For example, let's imagine the user decides they want to capture the X-Request-Id header in the request and response, and X-Foo in only the request.

In standard InfluxDB/QuestDB you can just dynamically add and remove columns for each one with some reasonable name.

Now they get something like:

headers.request.foo headers.request.x-request-id headers.response.x-request-id lots of other fixed columns
Hello Bazzy Foo (not important for this)
Hello2 Something Great! ...

At some future point, the user decides they want to stop capturing x-request-id, but continue to capture x-foo.

Now (AIUI) for future requests you end up in the database with...

headers.request.foo headers.request.x-request-id headers.response.x-request-id lots of other fixed columns
Hello3 NULL NULL
Hello4 NULL NULL

Another approach -- although it appears it is not supported from what I can understand from the QuestDB docs -- is that you are allowed to have a list of tuples as a value in a column; it is a kind of correlated sub-document, or reference to another table (not sure what the correct name is in QuestDB parlance). I don't know if this is allowed in InfluxDB either - but wanted to know if it's allowed in QuestDB and if it's a good idea.

So you end up with something like ....

headers.request headers.response lots of other fixed columns
[(x-request-id, Bazzy), (x-foo, Hello)] [(x-request-id, Foo)] (not important for this

And when you decide to stop capturing X-Request-Id in the future, it's just something like this.

headers.request headers.response lots of other fixed columns
[(x-foo, Hello)] [] (not important for this)

Which avoids all of the null columns.

From the perspective of Apiman writing into the database, both are equally easy -- but I wanted to know whether one particular approach or another is better for users of that data (and the DB itself) by avoiding all of the null columns when dynamically growing and shrinking the fields captured.

(excluded types, etc, to keep noise down)

@jerrinot
Copy link

hi, thanks for this write-up, it explains it very clearly!

I'm afraid there is no good answer at the moment.

In the first case, QuestDB will behave as you described with Influx: When your columns change in time then you will accumulate "null baggage". QuestDB will create column files which will contain mostly nulls.

The workaround with tuples does not sound practical: QuestDB does not have a concept of tuples. You could store string tuples as regular strings, but QuestDB would have no visibility into it. That would make further data processing more complicated as it would force you to use various string functions to get individual headers back. So that's not great either.

I see 2 possible solutions in QuestDB:

  1. Partition-Local ColumnTops: When a new column is added then we save a current RowId into a table metadata. This tells us that all rows before that RowId have this column set to null and we don't have to physically recreate column files for all past rows. This behaviour is already implemented. Perhaps we could extend it further: When a new (time-)partition is created then we could postpone the creation of a columnar file until a first non-null value is inserted into a given column. This way we wouldn't need to actively write null values until a row with a non-null value is inserted into a partition. I can imagine complications with indexes, out-of-order insertions, etc. But it theory, it should be possible. The question is how much cost & risk it is vs. benefits.

  2. Data compression: If a column file contains just nulls then it should be pretty trivial to compress it. Data compression is already being discussed. So that could be a solution too.

@msavy
Copy link
Author

msavy commented Jun 22, 2022

Thanks so much for this. Sounds like just accepting the NULL baggage (great phrase) is the way to go for now, with some optimisations coming down the pipeline for that.

In Elasticsearch people like to do stuff such as "header.*.x-correlation.id="

@bluestreak01
Copy link

What about deformalizing storage altogether? E.g. store something like:

header, value, timestamp

?

You can lean on JIT to search for things quickly and minimize storage.

@msavy
Copy link
Author

msavy commented Jun 22, 2022

@bluestreak01 Is your idea that the headers are stored in a separate table and referenced by a common timestamp/ID. Or that we repeat the other 25 or so standard columns of data until we've covered every header in a kind of cartesian product?

e.g.

# request header x-foo=hello (end of line)
apiman_metrics,generator=apiman-gateway,apiOrgId="TestOrg",apiId="TestApi",apiVersion="1.0",planId="Gold",clientOrgId="TestOrg",clientId="TestClient",clientVersion="1.0",contractId="12345",user="user1" requestStart=1455096600000,requestEnd=1455096600300,apiStart=1455096600000,apiEnd=1455096600300,url="http://localhost:8080/test/1",resource="/test/1",method="GET",responseCode=200,responseMessage="OK",failureCode=0,error=false,header="x-foo",headerType="request",value="hello" 1655916938120


# request header x-correlation-id=bazzy  (end of line)
apiman_metrics,generator=apiman-gateway,apiOrgId="TestOrg",apiId="TestApi",apiVersion="1.0",planId="Gold",clientOrgId="TestOrg",clientId="TestClient",clientVersion="1.0",contractId="12345",user="user1" requestStart=1455096600000,requestEnd=1455096600300,apiStart=1455096600000,apiEnd=1455096600300,url="http://localhost:8080/test/1",resource="/test/1",method="GET",responseCode=200,responseMessage="OK",failureCode=0,error=false,header="x-correlation-id",headerType="request",value="bazzy" 1655916938120


# response header x-correlation-id=foo  (end of line)
apiman_metrics,generator=apiman-gateway,apiOrgId="TestOrg",apiId="TestApi",apiVersion="1.0",planId="Gold",clientOrgId="TestOrg",clientId="TestClient",clientVersion="1.0",contractId="12345",user="user1" requestStart=1455096600000,requestEnd=1455096600300,apiStart=1455096600000,apiEnd=1455096600300,url="http://localhost:8080/test/1",resource="/test/1",method="GET",responseCode=200,responseMessage="OK",failureCode=0,error=false,header="x-correlation-id",headerType="response",value="foo" 1655916938120

etc, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment