Skip to content

Instantly share code, notes, and snippets.

@wosc
Created July 8, 2021 07:49
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wosc/dcda023a11093263f1febc7a050b949e to your computer and use it in GitHub Desktop.
Save wosc/dcda023a11093263f1febc7a050b949e to your computer and use it in GitHub Desktop.
Creating tracing data from an haproxy access log via fluent bit and opentelemetry collector

Creating opentelemetry tracing data from an haproxy access log

Approach

Unfortunately, otelcol currently has no receiver for logfiles that produces tracing data. There is log tailing functionality, and e.g. a fluent forward protocol receiver, but they all create "log" data, not tracing. I've seen this proof of concept to implement a forward receiver that creates tracing data, but that seems to have no traction and no relation to the upstream project at all (not even git history.

Thus, we've settled on this approach:

  • Format haproxy log as JSON, using the zipkin format
  • Read log with fluentbit, send it via HTTP to the opentelemetry collector zipkin receiver

As far as I can tell, this is the only combination of fluentbit output and otelcol receiver that matches. The only sensible fluenbit output for our purpose is HTTP, and the only otelcol receivers for plaintext over HTTP I can see are otlp and zipkin (everything is else is binary, grpc/thrift/protobuf/etc). The fluentbit HTTP JSON format is fixed to "list of records" [{}, {}], which luckily works with zipkin. (otlp json is {"resourcesSpans": [{},{}]}, i.e. a toplevel dict and not a list, see here and here)

zipkin format implementation notes

timestamp

The fluentbit HTTP output can format the timestamp as either iso8601 or double, but the zipkin protocol requires an int, which matches neither of those (and otelcol rejects them, I've tried). Thus we point fluentbit to a separate timestamp field (we called it fluentbit) and do not output the resulting fluenbit-internal timestamp (json_date_key false) at all. Instead we generate and format a timestamp field which fluentbit just passes through, like any other data field.

For this timestamp field, zipkin requires microseconds since epoch, but haproxy offers milliseconds (via %Ts%ms). The haproxy log-format placeholders are not delineated, so we cannot use %Ts%ms000. Sample expressions are delineated (%[foo]), but they don't provide the request start time, which is the timestamp we need. We cannot use the JSON number formatting hack (%Ts%ms.e3), as zipkin protocol requires an int, not a float (and otelcol rejects it). Thus we use a fluentbit lua filter to multiply by 1000, which seems to be the only way in fluenbit to modify a record value.

duration

zipkin requires microseconds, but haproxy "total request time" %Ta is milliseconds. Interestingly enough, here the JSON hack works (%Ta.e3), not sure why otelcol accepts this float as an int, but I sure am happy, since this seems way less cumbersome than the lua route.

Head-based sampling

This concrete setup assumes that the haproxy is not the first system in the request chain. Thus it relies on w3c traceparent for propagation of both the trace-id and the sampling decision (head-based sampling). Of course doing it this way is not at all necessary, you could just as well generate the trace-id in haproxy, but just to make the context situation clear.

We (ab)use the fact that the trace-flags field is currently specifed as exactly "01" or "00" for "should record" or "not", and write this directly into the sampling.priority span attribute, where the otelcol probabilistic_sampler processor expects it (and it recognizes int/float/string values there, so this works directly).

Also related to head-based sampling, there is no standardization yet for specifying the sampling rate. We're using https://honeycomb.io/ as storage, which uses a custom span attribute SampleRate (case sensitive) to transmit that (see e.g. here). So we are (ab?)using the w3c tracestate header to propagate this (because opentelemetry implementations already propagate this header for us), combined with custom code/config in each system to parse that header and create a span attribute.

# For completeness, this is our head-based sampling implementation, which runs on the Fastly CDN.
sub set_traceparent {
declare local var.trace_id STRING;
declare local var.span_id STRING;
declare local var.trace_flags STRING;
set var.trace_id = randomstr(32, "1234567890abcdef");
set var.span_id = randomstr(16, "1234567890abcdef");
set var.trace_flags = "00";
# Head-based sampling.
# 16**32 is too big for ints, but assuming each character is equally likely
# we can just as well only look at the e.g. last two characters to apply the samplerate.
declare local var.samplerate INTEGER;
set var.samplerate = 20;
declare local var.trace_hash INTEGER;
set var.trace_hash = std.strtol(substr(var.trace_id, -2), 16);
set var.trace_hash %= var.samplerate;
if (var.trace_hash == 0) {
set var.trace_flags = "01";
}
set req.http.traceparent = "00-" var.trace_id "-" var.span_id "-" var.trace_flags;
set req.http.tracestate = "zon=SampleRate:${samplerate}";
}
sub vcl_recv {
if (fastly.ff.visits_this_service == 0 && req.restarts == 0) {
call set_traceparent;
}
}
# fluentbit insists on splitting its configuration into different files;
# for brevity, I've inlined the snippets here as comments instead.
[SERVICE]
daemon Off
flush 5
log_level info
parsers_file /etc/td-agent-bit/zon-parsers.conf
# [PARSER]
# Name zon-haproxy-json
# Format json
# Time_Key fluentbit
# Time_Format %d/%b/%Y:%H:%M:%S.%L
[INPUT]
Name tail
Path /var/log/haproxy.log
Parser zon-haproxy-json
Tag zon-haproxy
DB /var/lib/fluent-bit-haproxy.sqlite
# Sync is expensive, and we don't care that much about the logfile location
DB.Sync Off
Mem_Buf_Limit 200MB
[FILTER]
Name throttle
Match *
Rate 800
Window 300
Interval 1s
[FILTER]
Name lua
Match zon-haproxy
Script /etc/td-agent-bit/haproxy.lua
Call add_microseconds
# function add_microseconds(tag, timestamp, record)
# record["timestamp"] = record["timestamp"] * 1000
# return 1, timestamp, record
# end
[OUTPUT]
Name http
Match zon-haproxy
host localhost
port 9411
uri /api/v2/spans
format json
json_date_key false
global
log 127.0.0.1:514 len 8192 local0 info
# Note: For the simplest setup, point haproxy directly to a fluentbit syslog input <https://docs.fluentbit.io/manual/v/1.7/pipeline/inputs/syslog>
# If you have a more clasic setup that uses e.g. rsyslogd to write logfiles (which fluentbit then tails),
# you have to remove the typical syslog prefix `Jul 8 06:25:01 myhostname haproxy[1477]:`, and only write JSON lines to the file
# On Ubuntu, this can be done via an /etc/rsyslog.d/49-haproxy.conf like this:
# module(load="imudp")
# input(type="imudp" port="514")
# template(name="onlymsg" type="string" string="%msg%\n")
# local0.* action(type="omfile" file="/var/log/haproxy.log" template="onlymsg")
# & ~
defaults
mode http
option httplog
log-format '{"timestamp":%Ts%ms,"fluentbit":"%tr","duration":%Ta.e3,"tags":{"SampleRate":"%[var(txn.samplerate)]","sampling.priority":"%[var(txn.trace_flags)]","haproxy.backend_server":"%s","haproxy.tq":"%TR","haproxy.tw":"%Tw","haproxy.tc":"%Tc","haproxy.tr":"%Tr","haproxy.tt":"%Ta","haproxy.t_state":"%tsc","haproxy.actconn":"%ac","haproxy.feconn":"%fc","haproxy.beconn":"%bc","haproxy.srv_conn":"%sc","haproxy.retries":"%rc","haproxy.srv_queue":"%sq","haproxy.backend_queue":"%bq","http.host":"%[capture.req.hdr(0),json(utf8s)]","http.user_agent":"%[capture.req.hdr(1),json(utf8s)]","http.referrer":"%[capture.req.hdr(2),json(utf8s)]","http.client_ip":"%ci","http.url":"%HU","http.method":"%HM","http.status_code":"%ST","http.request_content_length":"%U","http.response_content_length":"%B","service.instance.id":"myhostname.example.com"},"localEndpoint":{"serviceName":"haproxy"},"traceId":"%[var(txn.trace_id)]","parentId":"%[var(txn.parent_id)]","id":"%[var(txn.span_id)]","kind":"SERVER"}'
# The ordering of these is relevant
capture request header Host len 30 # id 0
capture request header User-Agent len 255 # id 1
capture request header Referer len 255 # id 2
capture request header traceparent len 255 # id 3
capture request header tracestate len 255 # id 4
# Parse w3c traceparent
http-request set-var(txn.trace_id) capture.req.hdr(3),regsub(\"^00-([^-]+)-([^-]+)-([0-9]+)$\",\"\1\")
http-request set-var(txn.parent_id) capture.req.hdr(3),regsub(\"^00-([^-]+)-([^-]+)-([0-9]+)$\",\"\2\")
http-request set-var(txn.trace_flags) capture.req.hdr(3),regsub(\"^00-([^-]+)-([^-]+)-([0-9]+)$\",\"\3\")
# Parse custom w3c tracestate (see README)
http-request set-var(txn.samplerate) capture.req.hdr(4),regsub(\"^.*zon=SampleRate:([0-9.]+).*$\",\"\1\")
# Propagate traceparent
# Kludgy way of using only 16 chars of a 32-char uuid
http-request set-var(txn.span_id) uuid(4),regsub(\"^([^-]+)-([^-]+)-([^-]+)-([^-]+)-([^-]+)$\",\"\1\2\3\")
http-request set-header traceparent 00-%[var(txn.trace_id)]-%[var(txn.span_id)]-%[var(txn.trace_flags)]
service:
pipelines:
traces:
receivers: [zipkin]
processors: [probabilistic_sampler]
exporters: [otlp]
receivers:
zipkin: # port 9411
processors:
probabilistic_sampler:
# Only evaluate the `sampling.priority` span attribute here; we rely on traceparent flags for sampling (see README)
sampling_percentage: 100
exporters:
otlp:
endpoint: "api.honeycomb.io:443"
headers:
"x-honeycomb-team": "YOUR-API-KEY"
"x-honeycomb-dataset": "YOUR-DATASET-NAME"
@elbaldfun
Copy link

hi, wosc, I am very happy to see your greate solution. I also met the same question: how to collect request trace info that pass haproxy. I found fastly.lvc in your solution, could you explain the fastly.lvc? How to use in some service? Where to place the config file? thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment