Skip to content

Instantly share code, notes, and snippets.

@mearns
Last active October 28, 2021 15:11
Show Gist options
  • Save mearns/539389644fa48a89f65fe8bb9999d667 to your computer and use it in GitHub Desktop.
Save mearns/539389644fa48a89f65fe8bb9999d667 to your computer and use it in GitHub Desktop.
Brian's Software Principles

This is a rough sketch of some of my generally-held principles for writing software and designing/implementing software systems. YMMV, and so may mine. I reserve the right to deny ever having said any of this.

Like all the best principles, these are only meant to get you thinking, not as hard and fast rules.

TOC

Thought Terminating Clichés

A thought terminating cliché is something that we've heard people say and it sounds good, so we repeat it. In fact, it sounds so good, that we don't really have to think about it. So we don't. And that's dangerous.

YAGNI - You Ain't Gonna Need It

I usually hear this as a reason for only doing what's needed right now, for not designing a flexible system. There's some inherent sense in not building things you won't need, which is why people latch onto this and repeat it. The tricky part is knowing what you are and aren't going to need. That's where evolutionary design comes in. You don't have to imagine every possible future and build every possible feature, but designing your system in such a way that it can change to accomodate whatever the future has in store is good thing.

A less obvious but important point related to this expression is that trying to build things before they're needed increases the chances that you'll build it wrong; you just won't have all the information you need to know exactly what is going to be needed. I agree with this but again, it's not a reason to create a rigid design that can't easily be changed. In fact, it's a very good reason not to do that.

KISS - Keep it Simple Stupid

I have nothing against simple and over-engineering is something a lot of us are prone to. But it's important to recognize when problems are actually complex and not try to solve a complex problem with a simple solution. Maybe a better principle than "Keep it simple" is "Make it exactly as simple as it can be, and no more".

DRY - Don't Repeat Yourself

This is an oft-quoted but rarely understood expression. Most people think it means that repeated code is bad; if you have the same lines of code in multiple places in your project, pull it into a function. If you have the same function in multiple projects, pull it out into a library.

That's not what DRY is meant to address.

This principle comes from the book The Pragmatic Programmer, where it meant:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

So what this is really getting at is having a single source of truth. As in, don't have two different subsystems that are independently responsible for determining the same value.

Example: how do we generate the URL for a user's profile page from their username? There can only be one correct answer so having multiple bits of code that independently generate this URL is fragile.

This applies at higher levels as well. Example: how do we store users in the database? There should be one module that has to know this (i.e., the schema) and it's the only thing that touches users in the database, it's the single source of truth and everything else uses that module as an interface.

DRY doesn't care about summing the values in an array, or converting a CamelCase string to daisy-case, or performing a map-reduce on a collection.

Error Handling TL;DR

  • Don't be a hero (only handle errors where you can actually do something about it).
  • Don't make a bad situation worse.
  • Provide context.

Error Handling

Don't Be a Hero

I might also phrase this principle as "don't be afraid to rethrow".

The idea here is that you shouldn't be trying to handle every error. I realize this may sound like sacrilege to some, as it did to me at first. But it's important to understand the nuance of this priniciple.

This doesn't mean you shouldn't code defensively: you should.

What it means is that you should only handle errors that you actually know how to handle.

Say you're writing a REST service that has to fetch information from an upstream service. If that upstream service returns a 500, you could employ a retry strategy with exponential backoff and jitter to try to wait out whatever trouble the upstream service is having.

But that may not be what your user wants.

Maybe they don't care that much about this response, and they don't want to wait around for your retries.

Maybe you wait so long for a successful retry, that your user's request timed out anyway so they're going to retry. Now you've got one thread waiting for the upstream service and you've got another incoming request: you're going to hit the upstream service again and possibly make a bad situation worse. Multiply that by however many users you have trying and retrying to get through to you.

So what errors should you handle? Again: only handle errors if you can actually handle them. Is there something the system can do to actually remediate the exceptional case? Great! Have your system do that so you don't have to get woken up to do it. Can your system reach upstream and fix the issues happening on another server? No? Then you probably shouldn't be handling that error.

For anything else, there's a good chance you still want to catch the error, but only so you can rethrow it with additional context. No user wants to see a naked "ENOENT" come back from a rest API and no developer wants to see a "Host Unreachable" error in their logs with with no context about what host was trying to be reached or why. More on providing useful errors below.

To retry or not to retry

Note that retries are sort of a gray area here. As a general guideline: if your system is doing stream processing or responding to synchronous requests, you probably don't want to do retries. If you're doing batch processing or working on asynchronous requests, retries are probably appropriate (keeping in mind any timeouts those jobs may have).

Useful Errors

It's 3AM and you're roused from your bed because the system crashed. You stumble to your computer, fumble with your password, and pull up the logs.

"Error: A runtime error occurred"

When is it most important to have detailed information about your system? When an error occurs.

TK

Logging TL;DR

  • Logs should be structured text (e.g, JSON lines).
  • Log messages should be static strings, using other fields to provide useful information.
  • Use a correlation ID in a dedicated field to tie logs together.
  • Refine correlation IDs as appropriate by appending to them.

Logging

Logs are for machines first, humans second. Logs have two important purposes: 1) Monitoring; 2) Debugging. Monitoring is an all-the-time thing where as (hopefully) debugging is compartively rare, hence the logs are for monitoring first, and debugging second. That said, the same principles make your logs better for both purposes! To be used for monitoring or debugging, Logs need to be:

  1. understandable
  2. specific
  3. informative

Understandable means that both the machine and the human can make sense of the data in the log entry. In other words, it needs to be unambiguously parseable. String interpolation or concatenation is the wrong answer here because "username=" + username works fine until the username has a space in it, and "username=\"%s\"" works ok until the username has a quote in it, etc.

Structured text is the name of the game. JSON-lines has broad support and it's reasonably human readable.

In general, a log entry should look like this:

{
  timestamp: Date;
  message: string;
  details: {};
}

For specific, we're talking precisely about the message field. Specific means that each log entry is identifiable as a specific event. In short, this means that every line of code where your logger is called should pass a different string for the message. While it should be a clear message that a human will understand, it may be useful to think of it more as a label than a full log message: if you search for this string in your logs, every result should be entries generated on this one line of code. This also means that if someone is looking at this line of code, they should know what the string is to search for, which means it should generally be an inline string literal, or potentially a constant whose value is a string-literal.

Informative means that once the reader (human or machine) has made identified the log entry they care about (specific) and made sense of the data (understandable), that data is actually useful. That's where the details comes in: any additional pieces of information that are likely to be relevant to the log entry should be included here. Think of message as providing the context, and details providing the...details.

Correlation IDs

Logs tend to be interleaved; multiple instances, multiple processes, multiple threads, all spitting out concurrent logs for different tasks and they all get mixed together. It's important to know which logs go together. This is where correlation IDs come in.

A correlation ID is just a string that's included in related logs. For instance, if you're system is processing user requests, you can generate a new unique request ID as soon as the request is received: any logs related to that request should include that request ID. If you're processing rows from a database, the row's unique ID makes a good correlation ID. If you're pulling jobs off a queue, the queue item's ID should work.

If there is no good intrinsic identifier, just generate one. Either a random v4 UUID or NanoID are usually reasonable choices.

Correlation IDs can also be refined by appending to them. For instance, if you're processing a user request and using requestId as the correlation ID, then when you move on to generating the response you might want to use requestId + ".resp" as the correlation ID. To get all the logs related to this request, you can search for entries that have a correlation ID starting with the request ID. If you only care about the logs related to generating the response, you can refine your search appropriately.

Another nice thing about using correlation IDs, is you don't have to keep repeating information all the time. For instance, you don't need to include the user ID in every log, just log it once and you'll know that all correlated logs share the same user ID (though in some cases, you may want to repeat information anyway, so a developer (or search query) doesn't need to go looking for it).

For logging purposes, you should pick a dedicated field name for the correlation ID, something like corr_id. That way when you're looking at a log entry with a user_id and a request_id and a job_id and they're all UUIDs, you don't need to guess at which one is being used to correlate logs.

Additionally, it may be helpful to add a field like corr_id_from to indicate where the correlation ID came from (e.g., "request ID", "row ID", "generated"). This is probably only necesssary for the first log entry.

Logging Examples

Here are some bad log statements:

  1. BAD: log("Received ${httpMethod} request to ${path}.", { corrId, headers, body })
    1. The log message is dynamic, so it's going to be hard to find this in the logs.
    2. The httpMethod and path are included in the log entry, but they need to be parsed from the non-structured part of the entry, which is just more work for a developer writing a log search to do, and more work for the machine to do while searching your logs.
    3. We have a correlation ID, but it might be useful to know what intrinsic value that correlation ID came from.
    4. BETTER: log("Request received", { corrId: corrIdFrom: "requestId", httpMethod, path, headers, body })
  2. BAD: log({ corrId, corrIdFrom: "requestId", httpMethod, path, headers, body })
    1. There is no log message (i.e., no "tag") so not only is it going to be very hard to search for these log entries, if you see them in the logs you probably won't have any idea what they mean.
    2. BETTER: log("Request received", { requestId, httpMethod, path, headers, body })
  3. BAD: log("Response sent", { statusCode, headers, body })
    1. There's no correlation ID, so we have no idea what the request was that generated this response.
    2. BETTER: log("Response sent", { corrId, statusCode, headers, body })
  4. BAD: log("Unknown exception caught: " + error.message)
    1. We have no context for where this exception was caught, or what we were trying to do when it was thrown.
    2. Error and exception objects often have additional useful properties beyond just their message, but these haven't been included in the log.
    3. The error message itself needs to be parsed from the non-structured error message.
    4. BETTER: log("ERROR Unknown exception caught while processing request", { corrId, error })

Service Interfaces TL;DR

  • When results are pulled out of a service, provide an HTTP interface.
  • When results are pushed out of a service, use pub-sub for the output interface.
  • Queues are great inside your system, they are not good interfaces.

Service Interfaces

Queues and What They're Good For

I don't like queues (as distinguished from topics) for the interface between services. They produce too much coupling.

A queue implies a single class of consumer because each message is only processed once: if you have multiple things to do with each message, you have to either have a single consumer doing all the things, or you need to duplicate your message into multiple queues. Having multiple queues means the producer is doing more work, and (more importantly) the producer needs to be aware of all the different classes of consumer because each needs to have its own queue (hence tight coupling). Adding a new class of consumer means adding a new queue, which means changing the producer to be aware of the new queue.

Who owns the queue? As in, who is responisible if a queue is backed up? It should be the consumer because they're the ones who are supposed to be draining the queue. But if the queue actually fills up, this affects the producer as well, because trying to add more messages will fail. This is often used as a mechanism for back-pressure, but when there are multiple consumers they're all throttled by the slowest consumer. Again with the coupling.

A publish-subscribe model (aka "pub-sub", aka "a topic") is generally preferred as the interface between services. In this model, the producer is only responsible for delivering the message to the topic (i.e., "publishing" or "broadcasting" the message), and any consumers that care about it are responsible for subscribing to that topic so that they will be notified of all new messages.

All that said, a queue is a great subscription target. When a consumer needs to do at-least-once processing, they probably don't want to try to process every message as it arrives from the topic. Set up your subscription to copy the message into a queue and a separate process can then be used to handle messages from your queue. This allows your producer to be de-coupled from the consumers via the pub-sub model but still gives your consumers all the benefits of a queue:

  1. You can handle bursts of messages, as long as your processing rate is at least as fast as the average message rate.
  2. You have persistance so you can retry messages if something goes wrong during processing.
  3. You can keep messages in order if that's important.
  4. They give you great visibility on how the consumer is doing.

But I only Have One Consumer

For now. Building your system based on that assumption is needlessly limiting.

When Should you Use a Queue as an Interface

Queues as an interface give you one thing that topics don't: backpressure. Backpressure is how a consumer tells a producer that it needs to slow down. A simple backpressure mechanism is to set a highwater mark on the queue: once it reaches that point the producer needs to pause until it drains back below that mark.

As mentioned above, backpressure is a form of coupling between producer and consumer, as well as between consumer and other consumers: if the producer is slowing down to allow one consumer to catch up, all consumers are affected. If you need backpressure, your services are coupled: acknowledge that and deal with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment