Skip to content

Instantly share code, notes, and snippets.

@jmacd
Last active November 2, 2022 16:00
Show Gist options
  • Save jmacd/17a6f2885d92d7864ba76680a8ee367a to your computer and use it in GitHub Desktop.
Save jmacd/17a6f2885d92d7864ba76680a8ee367a to your computer and use it in GitHub Desktop.
Proposal: Open-source jmacd's experimental Go logging library

Proposal: Open-source jmacd's experimental Go logging library

TL;DR jmacd left 10s of thousands of lines of hobby project sitting in /.../experimental/users/jmacd/logging/sloggo or some such path. Speaking as an open-source telemetry expert, this would be valuable to release as an experimental, which jmacd is willing to demonstrate.

During the years 2014-2015 I had a "20%" project that did several interesting things. I was at the time working with Google Cloud Logging and Error Reporting teams, and I had previously been working in the base C++ logging library.

My efforts to promote structured logging inside Google during the years 2011-2013 had been an utter failure, and I had begun to think I understood why. Structured logging is not very useful by itself. Logs are too expensive to collect at scale, which is why we sample traces and aggregate metrics. Structured logs are potentially less expensive than text-formatted logs, but not enough to matter if you're still trying to collect them all. Sampling individual log messages doesn't help.

We had clear existence proofs of useful systems that were mostly constructed of logging events, but they didn't record 100%. The C++ /requestz servlet, which I had owned and maintained since Dapper, was the system I was studying. /requestz was a widely-used tracing interface and RPC debugging tool, and my goal with this logging library was to be able to write a query ("view") that would operate over log events and compile into a requestz implementation.

Formal influences

Systems like Fay were talking about injecting code to configure instrumentation.

Pivot tracing was published after I had done a lot of this work, but it explained what I was doing with this /requestz plan. You can see SQL-like queries that turn log events into metrics and spans and something about a HAPPENS BEFORE relation, which is simple to maintain inside a single process.

The idea was to implement a compiler from query into a /requestz-like servlet, which would execute the query but apply sampling to limit collection costs. You'd write one query to find the start of a span, one query to find the end of the span, join them and calculate latency, then group by latency bucket; meanwhile you'd join with events in the context.

Google's "Yedalog" system is a very large-scale system with capabilities like this. I wrote a query evaluator inspired by (but much more brute-force than) Magic sets and other strange ways to implement logic programs (Ullman, 1986) here.

Learning Go

Before joining Google I had wished to some day need to write a SQL parser and use it for something. The Go team had opened up "self-readability" to Googlers and I decided to learn Go by writing a SQL-like parser (at the outset of this project). When I had finished my draft, I was to file a ticket asking for a Go-team reviewer. Because I had used go-yacc to generate the grammar, which no one else wanted to touch, Rob Pike was assigned as my readability reviewer. (I would later pitch Pike on the assembled logging package and he rejects this kind of complexity.)

I began writing a structured logging library after finishing the SQL parser, but I didn't begin connecting them until later. Into this library I distilled some of the findings from my work in the base C++ library, but I was learning Go in the process.

Structured log references

Census had a logging primitive that could record a few words of data in approximately 50 nanoseconds, the key was that it was recording only variables, nothing of the static content. I wanted a logging format that could support this kind of performance and compression.

As an example, consider logging a stacktrace. The cost of capturing a stacktrace from a typical runtime is 10x faster than the cost of formatting the information as text. My goal was that after the frames of a stacktrace had been logged once, each frame could be logged as a backward reference in the log.

Pattern exraction: printf() vs print()

From the beginning Go supported three variations on a print statement, Print(), Println(), and Printf(). I would prefer that logging APIs have a single primitive and that static messages be extracted from the source. I'd rather write Print("text", var1, "text", var2) than Printf("text %v text %v", var1, var2). Both logically have the same message format, but in Go there's a loss of information when using the non-Printf form.

This logging library would be able to capture and record the message format for Print statements. The per-event cost of a log statement is reduced to the cost of recording the call-site variables and backward reference to the statement definition.

The package includes tooling to extract log message format strings and generate a dictionary of statements at compile time.

Short-circuit: safety mechanisms

There are several safety mechanisms built in to the library worth mentioning, since most Go debugging libraries are a bit dangerous in this regard. I set the following requirements.

Requirement: Logging libraries must have a way to cheaply disable individual logging statements. For example, the library has to decide whether the statement will log before it captures a stacktrace, since the stacktrace is expensive to compute.

Requirement: Logging libraries must not deadlock themselves, when for example a synchronous logs consumer decides to call the logging interface. This is tricky to do in Go (can be detected when capturing the stacktrace).

Requirement: Logging libraries must not deadlock themselves, when for example a method used to format an object encounters infinite recursion. As a test for any logging library: what happens when you try to log a cyclic data structure? It's safe and correct in this library with the use of (intra-) log references.

Log yourself: object formatting

A structured logging API should have a way to encode structured data, including with annotations. Similar to MarshalJSON for JSON encoding of custom types, the library supports a custom logging method for types to define how they are logged. This is just a recursive form of the logging API, with support for adding key-value attributes to the data.

A log statement like

  log.With(...).Log("message: ", property.Value(variable))

places a variable into the log, associated with property which is a defined attribute key. The class of variable might define a logging method:

func (o *Object) Log(logger Logger) {
   logger.Log("an object named %v with %v", o.name, otherProperty.Value(o.value))
}

which defines the value in terms of a recursive log statement. In the example here, the encoded event would include associations for property and otherProperty.

Labels, attributes, and structured objects

In OTel today we are using "label" and "attribute" as synonyms, but I was looking for two concepts and I chose to use "label" and "association", both of which are attributes in the OTel sense. The "associations" section of a log event are attributes that identify/describe the log event, whereas the "attributes" section of a log message are attributes contained within the log.

Labels are a unique map of properties about the log event (source location, severity level, ...), whereas attributes were a list of key-values associated with the log payload (and stored "to the side"). The structure of a log event output by the library:

  1. Timestamp: obvious meaning
  2. Sequence: an ascending event counter
  3. Message: the message format of the event
  4. Labels: list of (attribute-key, value) about the event itself
  5. Args: list of variables with the event
  6. Associations: list of (attribute-key, value) about the args
  7. Definitions: first-encoded value of entities in the log.

Log references could be used in any of the labels or associations. For example:

  • an label value refers back to a severity level definition (about the event)
  • an label value contains a list of to the stack frame references (about the event)
  • an association value refers to the 1st argument (e.g., to note a corrupt shard)
  • an association value refers to the account/username path of the 2nd argument (e.g., to note PII).

Namespaces

Attribute (label and association) keys are universal. Some namespaces are reserved for the system, like a log. prefix for conventional logging attributes and a rpc. prefix for conventional RPC attributes.

Entities

I've described references within a log, but not how to create entities in the log. Entities are identified by a single attribute key-value.

I found two categories of entity: named and anonymous. Named entities can be global and can be instantiated by multiple parties (with conflicting definitions!).

Anonymous entities are created in a specific log and are defined once.

A log severity level is an example of a named entity. The entity namespace, entity name, and name in this case may be represented log.severity=INFO for INFO severity. This entity may be use as a label on a log event, relating it to the definition of INFO that was given when the entity was created. Now, an log event can be marked "INFO", where INFO is just an entity defined earlier in log.severity log.

Self-describing format

The format is self describing; only the log.relation entity is built in, from which new relations are created. For example, there is an entry for log.relation=log.severity explaning what log severity is. Each entity definition will be logged before it is first used.

This means the initial log event in a log will bootstrap itself, providing all its definitions.

Log identification

The system uses semantic convention for process identity. The log.process entity must be defined and used as a label on every log event emitting from a process. The process is required to manage the sequence number of its logs, so that we can determine HAPPENS BEFORE within a single process log.

A log process is an anonymous entity. It's defined by its attributes, which are logged when it is created. So the initial log event contains a process definition, and every event is labeled by that process entity.

Applications

I had built three demos to explore the benefits of all this complexity.

Structured log browser

This application is written using Go HTML templates. I wanted to be able to annotate data in the logs API as described above, and then write corresponding templates to display the data. I had written example templates to structure individual log messages and other known schemas, like the stacktrace.

I had been working with the ACRA library to report Android crashes for Cloud Logging, considered it a good example of a structured log. In that format is a screenshot. I wanted to be able to log a screenshot and have it display in the logs browser. There's a template that recognizes a content-type association and uses it to correctly tag image data in the browser.

This browser correctly displays expands log references (until they become cyclic), built on a stateful reader which remembers the definitions it has seen.

The logs browser has two modes, one chronological and one in the frequency domain (like ecatcher). Click to add a filter. Pivot between chronological and frequency displays.

RPC logging conventions

When a client connects to a server, it sends its peer a "global" log reference which is a reference to its log event qualified with the unique process identity. A logs receiver that receives the logs from both processes will be able to link those references.

A structured log browser could follow the logging context from the client to the server and back. This may sound like joining logs by a trace_id--the outcome is similar--but the information density is higher in the log. The log reference exchanged between RPC peers refers to point in the log where each transacted with the other; you could use that log reference to discover details about the process that were recorded once when the process started (instead of once per span or trace record).

/requestz proof-of-concept

I was not finished with this.

I had compiled the SQL expression into a logic program which operated over logs, and I had unit-tested the SQL parser and the logic expressions. I had written a test program that would re-process a log file and apply the logic program.

I don't remember exactly what was working as I departed Google. I seem to recall an expression like this working:

SELECT end.time - start.time AS latency FROM * AS start, * AS end WHERE start.ctxid == end.ctxid AND start HAPPENS BEFORE end;

Note that start.ctxid == end.ctxid has a special relationship with HAPPENS BEFORE, the same one exploited by /requestz. The "ctxid" and HAPPENS BEFORE expressions disappear because the query is evaluated in a sequential context, and sampling can be added to this query in the same way /requestz does.

Why is this interesting to OpenTelemetry?

Tigran Najaryan has proposed a data model for logging about Resources.

It proposes that you could gather a series of ordered log events with simple semantics and produce a new signal from them. I want to ask some of the questions that arise naturally to me: can we define spans as a collection of event logs?

The thing that /requestz did really well, that Dapper never could, is deal with indefinite spans. If we can formalize a model for deriving Resources from log events, can we derive a model for monitoring indefinite-lifetime spans? (We should be able to define a span with a cutoff of 1 minute, for example, with a slight change of the query.)

Can we use the same model as a raw format for metrics, as opposed to adding raw metric data points to OTLP? I am not sure of the answer, but I'd like to be able to point at this work when asking the question. In that way, I see this work as similar to and connected with the prototype for Resource Scope that I did. It went nowhere but I've referred to it a hundred times and still there's hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment