jmacd/jmacd-logging-go-experiment.md Secret

## jmacd-logging-go-experiment.md

      
    Raw
  

              jmacd-logging-go-experiment.md
            
          
    Proposal: Open-source jmacd's experimental Go logging library

TL;DR
jmacd left 10s of thousands of lines of hobby project sitting
in /.../experimental/users/jmacd/logging/sloggo or some such path.
Speaking as an open-source telemetry expert, this would be valuable
to release as an experimental, which jmacd is willing to demonstrate.
During the years 2014-2015 I had a "20%" project that did several
interesting things.  I was at the time working with Google Cloud
Logging and Error Reporting teams, and I had previously
been working in the base C++ logging library.
My efforts to promote structured logging inside Google during the
years 2011-2013 had been an utter failure, and I had begun to think I
understood why.  Structured logging is not very useful by itself.
Logs are too expensive to collect at scale, which is why we sample
traces and aggregate metrics.  Structured logs are potentially less
expensive than text-formatted logs, but not enough to matter if you're
still trying to collect them all.  Sampling individual log messages
doesn't help.
We had clear existence proofs of useful systems that were mostly
constructed of logging events, but they didn't record 100%.  The C++
/requestz servlet, which I had owned and maintained since Dapper, was
the system I was studying.  /requestz was a widely-used tracing
interface and RPC debugging tool, and my goal with this logging
library was to be able to write a query ("view") that would operate
over log events and compile into a requestz implementation.
Formal influences

Systems like Fay
were talking about injecting code to configure instrumentation.
Pivot tracing was
published after I had done a lot of this work, but it explained what
I was doing with this /requestz plan.  You can see SQL-like queries
that turn log events into metrics and spans and something about a
HAPPENS BEFORE relation, which is simple to maintain inside a single
process.
The idea was to implement a compiler from query into a /requestz-like
servlet, which would execute the query but apply sampling to limit
collection costs.  You'd write one query to find the start of a span,
one query to find the end of the span, join them and calculate
latency, then group by latency bucket; meanwhile you'd join with
events in the context.
Google's "Yedalog" system is a very large-scale system with
capabilities like this.  I wrote a query evaluator inspired by (but
much more brute-force than) Magic sets and other strange ways to
implement logic
programs
(Ullman, 1986) here.
Learning Go

Before joining Google I had wished to some day need to write a SQL
parser and use it for something.  The Go team had opened up
"self-readability" to Googlers and I decided to learn Go by writing a
SQL-like parser (at the outset of this project).  When I had finished
my draft, I was to file a ticket asking for a Go-team reviewer.
Because I had used go-yacc to generate the grammar, which no one
else wanted to touch, Rob Pike was assigned as my readability
reviewer.  (I would later pitch Pike on the assembled logging package
and he rejects this kind of complexity.)
I began writing a structured logging library after finishing the SQL
parser, but I didn't begin connecting them until later.  Into this
library I distilled some of the findings from my work in the base C++
library, but I was learning Go in the process.
Structured log references

Census had a logging primitive that could record a few words of data
in approximately 50 nanoseconds, the key was that it was recording
only variables, nothing of the static content.  I wanted a logging
format that could support this kind of performance and compression.
As an example, consider logging a stacktrace.  The cost of capturing a
stacktrace from a typical runtime is 10x faster than the cost of
formatting the information as text.  My goal was that after the frames
of a stacktrace had been logged once, each frame could be logged as a
backward reference in the log.
Pattern exraction: printf() vs print()

From the beginning Go supported three variations on a print statement,
Print(), Println(), and Printf().  I would prefer that logging
APIs have a single primitive and that static messages be extracted
from the source.  I'd rather write Print("text", var1, "text", var2)
than Printf("text %v text %v", var1, var2).  Both logically have the
same message format, but in Go there's a loss of information when
using the non-Printf form.
This logging library would be able to capture and record the message
format for Print statements. The per-event cost of a log statement
is reduced to the cost of recording the call-site variables and
backward reference to the statement definition.
The package includes tooling to extract log message format strings and
generate a dictionary of statements at compile time.
Short-circuit: safety mechanisms

There are several safety mechanisms built in to the library worth
mentioning, since most Go debugging libraries are a bit dangerous in
this regard.  I set the following requirements.
Requirement: Logging libraries must have a way to cheaply disable
individual logging statements.  For example, the library has to decide
whether the statement will log before it captures a stacktrace, since
the stacktrace is expensive to compute.
Requirement: Logging libraries must not deadlock themselves, when for
example a synchronous logs consumer decides to call the logging
interface.  This is tricky to do in Go (can be detected when capturing
the stacktrace).
Requirement: Logging libraries must not deadlock themselves, when for
example a method used to format an object encounters infinite
recursion.  As a test for any logging library: what happens when you
try to log a cyclic data structure?  It's safe and correct in this
library with the use of (intra-) log references.
Log yourself: object formatting

A structured logging API should have a way to encode structured data,
including with annotations.  Similar to MarshalJSON for JSON
encoding of custom types, the library supports a custom logging method
for types to define how they are logged.  This is just a recursive
form of the logging API, with support for adding key-value attributes
to the data.
A log statement like
  log.With(...).Log("message: ", property.Value(variable))

places a variable into the log, associated with property which is a
defined attribute key.  The class of variable might define a logging
method:
func (o *Object) Log(logger Logger) {
   logger.Log("an object named %v with %v", o.name, otherProperty.Value(o.value))
}

which defines the value in terms of a recursive log statement.  In the
example here, the encoded event would include associations for
property and otherProperty.
Labels, attributes, and structured objects

In OTel today we are using "label" and "attribute" as synonyms, but I
was looking for two concepts and I chose to use "label" and
"association", both of which are attributes in the OTel sense.  The
"associations" section of a log event are attributes that
identify/describe the log event, whereas the "attributes" section of a
log message are attributes contained within the log.
Labels are a unique map of properties about the log event (source
location, severity level, ...), whereas attributes were a list of
key-values associated with the log payload (and stored "to the side").
The structure of a log event output by the library:

Timestamp: obvious meaning
Sequence: an ascending event counter
Message: the message format of the event
Labels: list of (attribute-key, value) about the event itself
Args: list of variables with the event
Associations: list of (attribute-key, value) about the args
Definitions: first-encoded value of entities in the log.

Log references could be used in any of the labels or associations.
For example:

an label value refers back to a severity level definition (about the event)
an label value contains a list of to the stack frame references (about the event)
an association value refers to the 1st argument (e.g., to note a corrupt shard)
an association value refers to the account/username path of the 2nd argument (e.g., to note PII).

Namespaces

Attribute (label and association) keys are universal.  Some namespaces
are reserved for the system, like a log. prefix for conventional
logging attributes and a rpc. prefix for conventional RPC
attributes.
Entities

I've described references within a log, but not how to create entities
in the log.  Entities are identified by a single attribute key-value.
I found two categories of entity: named and anonymous.  Named entities
can be global and can be instantiated by multiple parties (with
conflicting definitions!).
Anonymous entities are created in a specific log and are defined once.
A log severity level is an example of a named entity.  The entity
namespace, entity name, and name in this case may be represented
log.severity=INFO for INFO severity.  This entity may be use as a
label on a log event, relating it to the definition of INFO that was
given when the entity was created.  Now, an log event can be marked
"INFO", where INFO is just an entity defined earlier in log.severity
log.
Self-describing format

The format is self describing; only the log.relation entity is built
in, from which new relations are created.  For example, there is an
entry for log.relation=log.severity explaning what log severity is.
Each entity definition will be logged before it is first used.
This means the initial log event in a log will bootstrap itself,
providing all its definitions.
Log identification

The system uses semantic convention for process identity.  The
log.process entity must be defined and used as a label on every log
event emitting from a process.  The process is required to manage the
sequence number of its logs, so that we can determine HAPPENS BEFORE
within a single process log.
A log process is an anonymous entity.  It's defined by its attributes,
which are logged when it is created.  So the initial log event
contains a process definition, and every event is labeled by that
process entity.
Applications

I had built three demos to explore the benefits of all this
complexity.
Structured log browser

This application is written using Go HTML templates.  I wanted to be
able to annotate data in the logs API as described above, and then
write corresponding templates to display the data.  I had written
example templates to structure individual log messages and other known
schemas, like the stacktrace.
I had been working with the ACRA library to report Android crashes for
Cloud Logging, considered it a good example of a structured log.  In
that format is a screenshot.  I wanted to be able to log a screenshot
and have it display in the logs browser.  There's a template that
recognizes a content-type association and uses it to correctly tag
image data in the browser.
This browser correctly displays expands log references (until they
become cyclic), built on a stateful reader which remembers the
definitions it has seen.
The logs browser has two modes, one chronological and one in the
frequency domain (like ecatcher).  Click to add a filter.  Pivot
between chronological and frequency displays.
RPC logging conventions

When a client connects to a server, it sends its peer a "global" log
reference which is a reference to its log event qualified with the
unique process identity.  A logs receiver that receives the logs from
both processes will be able to link those references.
A structured log browser could follow the logging context from the
client to the server and back.  This may sound like joining logs by a
trace_id--the outcome is similar--but the information density is
higher in the log.  The log reference exchanged between RPC peers
refers to point in the log where each transacted with the other; you
could use that log reference to discover details about the process
that were recorded once when the process started (instead of once per
span or trace record).
/requestz proof-of-concept

I was not finished with this.
I had compiled the SQL expression into a logic program which operated
over logs, and I had unit-tested the SQL parser and the logic
expressions.  I had written a test program that would re-process a log
file and apply the logic program.
I don't remember exactly what was working as I departed Google.  I
seem to recall an expression like this working:
SELECT end.time - start.time AS latency FROM * AS start, * AS end WHERE start.ctxid == end.ctxid AND start HAPPENS BEFORE end;

Note that start.ctxid == end.ctxid has a special relationship with
HAPPENS BEFORE, the same one exploited by /requestz.  The "ctxid" and
HAPPENS BEFORE expressions disappear because the query is evaluated in
a sequential context, and sampling can be added to this query in the
same way /requestz does.
Why is this interesting to OpenTelemetry?

Tigran Najaryan has proposed a data model for logging about
Resources.
It proposes that you could gather a series of ordered log events with
simple semantics and produce a new signal from them.  I want to ask
some of the questions that arise naturally to me: can we define spans
as a collection of event logs?
The thing that /requestz did really well, that Dapper never could, is
deal with indefinite spans.  If we can formalize a model for deriving
Resources from log events, can we derive a model for monitoring
indefinite-lifetime spans?  (We should be able to define a span with a
cutoff of 1 minute, for example, with a slight change of the query.)
Can we use the same model as a raw format for metrics, as opposed to
adding raw metric data points to OTLP?  I am not sure of the answer,
but I'd like to be able to point at this work when asking the
question.  In that way, I see this work as similar to and connected
with the prototype for Resource Scope that I did.
It went nowhere but I've referred to it a hundred times and still
there's hope.