Skip to content

Instantly share code, notes, and snippets.

@chrisseto
Created December 13, 2023 21:12
Show Gist options
  • Save chrisseto/a4bae27ff63be11b73c1c7c99ba3a1c4 to your computer and use it in GitHub Desktop.
Save chrisseto/a4bae27ff63be11b73c1c7c99ba3a1c4 to your computer and use it in GitHub Desktop.
Debugging the RSW with `.otlp.jsondn.gz` files

Debugging the RSW with .otlp.jsondn.gz files

This is a short guide on how to work with .otlp.jsondn.gz files that you'll find in the artifacts directory of TeamCity runs after a run of the TestWorkload test.

Tools

  1. DuckDB
  2. Go

Querying the Data

These files are produce by the OTLPFileExporter. Its documentation will tell you all about the format.

At the time of writing, there's no off the shelf tooling that works with OTeL's file protocol. You can technically use zcat and curl to send these archives into an OTLP processor but the work to set them up vastly outweighs their usefulness. Your best bet is Jaeger's OTLP endpoint. However, the OTLP protocol added a breaking change to their .proto specs back in some version. The binary encoded formats are backwards compatible but the JSON formats had a key renamed. You'll have to do some post-processing to get this data into Jaeger. Using the version of the OTEL SDK in CRDB to unmarshal the JSON, re-serialize it as binary data, and then send it to the grpc OTLP Jaeger endpoint is going to be your best bet. I don't have a script for doing so because I never found it to be useful.

The OTLP format is pretty heavily nested as it's meant to be sent as batches over the wire. This makes it pretty annoying to work with. otlp-duck.fish is a fish script that will unnest the data into a single spans table, which is much nicer to poke around. DuckDB has builtin support for decompressing and reading gzipped jsonnd files, just make sure you're using a recent version. I chose DuckDB because it's fast, support ephermeral databases, and can deal with nested database better than anything else I've tried. If it's not your jam, you can just as easily pipe data into anything else with zcat and/or jq.

SQL Replay

Next we come to the go script within this gist. It was an attempt to automatically reproduce an error discovered by the RSW. It currently uses the PID attribute to group queries by connection. As noted by Rafi, this is incorrect and another attribute will have to be added.

To start, you'll want to get DuckDB up and running with your data loaded into the spans table using the otlp-duck.fish script.

From there, we'll make a CSV "script" that will be replayed.

I've been using this line of SQL to produce said script. It can probably be improved.

SELECT attributes->'pid', "end", regexp_replace(attributes->>'sql', '\s+', ' ', 'gs'), [x->'Value'->>'StringValue' for x in (attributes->'args'->'values')::JSON[]], error FROM spans

By default DuckDB will use the duckbox format and omit anything more than 40 lines. You can use these "dot" commands to dump the results to a CSV file:

.mode csv
.out script.csv

Great! We now have a script to try to replay. You'll need two terminal panes next. One to run the go script and the other to run a cockroach demo cluster.

Here's the command for running a demo cluster:

 ./cockroach-short demo --listening-url-file ./demo-url --nodes 3 --demo-locality=region=us-east1,az=1:region=us-east2,az=1:region=us-east3,az=1 --empty

I've been using demo because it can emulate multi-region which is done within TestWorkload. --listening-url-file just cats the connection URL to a file which makes scripting easier. --empty does what it sounds like and doesn't load in any demo data, again this is what TestWorkload does/expects.

Before running the replay script, we'll have to make the correct database:

CREATE DATABASE schemachange;

And finally, let's run our replay script (assuming you've go install'ed it):

sql-replay script.csv (cat demo-url | sed 's/defaultdb/schemachange/')

Note: sed is used to ensure that we connect to the schemachange database instead of defaultdb which can cause errors with the replay at times.

As discussed, this doesn't work great right now. There's a lot of non-determinism within the workload and the connection grouping isn't perfect either. Though it get's further than you'd expect it to most of the time. With a bit of work and some leasing hackery to control schemachange jobs, we can probably get repros to work as expected!

package main
import (
"bytes"
"context"
"encoding/csv"
"encoding/json"
"io"
"log"
"os"
"regexp"
"strings"
"time"
"github.com/jackc/pgx/v5"
"github.com/jackc/pgx/v5/pgconn"
)
func Must(err error) {
if err != nil {
panic(err)
}
}
func MustT[T any](r T, err error) T {
Must(err)
return r
}
var connections = map[string]*pgx.Conn{}
func getConn(ctx context.Context, id string) *pgx.Conn {
if conn, ok := connections[id]; ok {
return conn
}
conn := MustT(pgx.Connect(ctx, os.Args[2]))
connections[id] = conn
return conn
}
func main() {
ctx := context.Background()
script := MustT(os.ReadFile(os.Args[1]))
reader := csv.NewReader(bytes.NewReader(script))
// Eat header
_, _ = reader.Read()
var lastTS time.Time
for i := 0; ; i++ {
record, err := reader.Read()
if err != nil {
if err == io.EOF {
return
}
}
connID := record[0]
ts := MustT(time.Parse("2006-01-02 15:04:05.999999", record[1]))
sql := record[2]
error := record[4]
var args []any
if record[3] != "" {
x := regexp.MustCompile(`\[\]string\{([^}]+)\}`).ReplaceAllString(record[3], "[$1]")
Must(json.Unmarshal([]byte(x), &args))
}
conn := getConn(ctx, connID)
if !lastTS.IsZero() {
time.Sleep(ts.Sub(lastTS))
}
lastTS = ts
log.Printf("%q: %q %v", connID, sql, args)
rows, err := conn.Query(ctx, sql, args...)
var out []map[string]any
if err == nil {
out, err = pgx.CollectRows(rows, pgx.RowToMap)
}
if err != nil {
if error == "" {
panic(err)
}
pgErr := err.(*pgconn.PgError)
errorsMatch := err.Error() == error
isRetryErr := pgErr.Code == "40001" && strings.HasSuffix(error, "(SQLSTATE 40001)")
if errorsMatch || isRetryErr {
log.Printf("%s: %#v", connID, err)
} else {
log.Printf("%q: UNEXPECTED ERROR", connID)
log.Printf("EXPECTED: %s", error)
log.Printf("GOT: %#v", err)
return
}
} else {
if error != "" {
log.Printf("%q: Expected %q got no error", connID, error)
return
}
// jsonified := MustT(json.MarshalIndent(out, "", "\t"))
jsonified := MustT(json.Marshal(out))
log.Printf("%q: %s", connID, jsonified)
}
}
}
function otlp-duck
set filename $argv[1]
set script "CREATE OR REPLACE TABLE spans AS
WITH
unwrapped_spans AS (
SELECT unnest(spans.spans, recursive := true) FROM (SELECT unnest(instrumentation_library_spans) AS spans FROM ( SELECT unnest(resource_spans, recursive:=true) FROM read_ndjson_auto('$filename')))
), span_attrs AS (
SELECT
span_id,
json_group_object(key, COALESCE(value->'StringValue', value->'BoolValue', value->'IntValue', value->'ArrayValue')) as attributes,
FROM (SELECT span_id, key, Value::JSON as value FROM (SELECT span_id, unnest(attributes, max_depth:=3) FROM unwrapped_spans))
GROUP BY span_id
)
SELECT
trace_id,
unwrapped_spans.span_id,
parent_span_id,
name,
make_timestamp(start_time_unix_nano//1000) as start,
make_timestamp(end_time_unix_nano//1000) as end,
message as error,
span_attrs.attributes,
FROM unwrapped_spans JOIN span_attrs ON span_attrs.span_id = unwrapped_spans.span_id;"
duckdb -init (echo $script | psub)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment