Skip to content

Instantly share code, notes, and snippets.

@matthiasr
matthiasr / opsy.md
Last active April 13, 2024 19:44
My company is going through a shift to more agile processes. How does work come into ops teams?

In my experience, there are three major sources of work for an opsy team:

  • Keeping the lights on – updates, maintenance, keeping entropy at bay. You can barely control how much of this there is, however you can control how much work it is to deal with through
  • Work to improve your own situation: automating stuff, architecting better systems, evolving processes, getting dev teams to do something in more ops friendly ways. Both of these compete for focus with
  • Support work, that is helping product teams in all the things they need to build product. This is almost always interrupt-driven, you can try to get wind of these needs ahead of time but if you insist on exact requests in time for your own sprint planning, you're instantly at least doubling the lead time on everything.

Things I have seen work to deal with this:

  • At an appropriate granularity of organization (by default, per team), designate a rotating first responder. This is the person who takes in short term requests on whatever communications ch
@matthiasr
matthiasr / repos.md
Created August 23, 2023 20:47
On monorepos and many repositories

With the caveat that this is based on a sample size of 2, I would lean towards one repository per deliverable thing – one app, one service, or service with worker components.

Tooling support for monorepos is not great. You can make do but you end up either running bazel or wishing you were running bazel. The usual hosted CI feels geared towards repos per use case, GitHub gets creaky with large amounts of code and change, IDEs start to struggle.

In a monorepo, it's easy to share code, but hard to keep a handle on ownership of that shared code. With separate repositories it's much more legible to management when a core piece of code is unowned and unmaintained, so it's easier to get clear funding for this.

Versioned releases of shared frameworks, and generally smaller repositories, make it a lot easier to see what's in a deploy. Consequently they make deployment and rollback a lot safer. You pay a price for that though: either you live with a huge amount of drift in those core frameworks, making it nearly im

@matthiasr
matthiasr / index.md
Created May 11, 2023 09:32
Configuring Format on Save for Terraform in VS Code

If VS Code does not format Terraform code on save, but does format correctly when ask explicitly (Command palette -> Format File), add this to the configuration JSON:

    "[terraform]": {
        "editor.defaultFormatter": "hashicorp.terraform",
        "editor.formatOnSave": true
    }

source

@matthiasr
matthiasr / gpg_wsl2.md
Last active February 10, 2023 10:45
GPG signing with full gpg-agent support in WSL2: the easy way

Problem statement

Signing with GPG in Windows System for Linux (WSL2) does not work smoothly out of the box. Notably, when using a TTY-based pinentry, signing in Visual Studio Code does not work at all.

Solution

  1. Install Gpg4Win: winget install -e GnuPG.Gpg4win or download and install manually
  2. Start Kleopatra and generate or import keys
  3. Insert links to gpg.exe inside of WSL:
@matthiasr
matthiasr / 00_strategy.md
Created October 16, 2022 19:07
Strategies for system design interviews

First of all, read about NALSD. Not every company leans that hard into the scaling aspect but even if they don't, having it in the back of your head helps.

If they give you any numbers on scale at all, write them down. You will need them shortly. For online interviews, have something you can draw into. If something is provided in the invite, play around with it ahead of time to make sure you can draw boxes of different shapes and connect them with arrows. If not, have an empty canvas ready in your online drawing tool of choice.

Figure out the general shape of the problem. Is it a request-response type system (we want a website that does X), a stream type system (we get events and process them as they come), a batch type system (we have a pile of data and want to get through it)?

Clarify anything you are unsure about – this is about communication as much as system design. Listen for what they want you to do – ask them for every underspecified detail, or us

@matthiasr
matthiasr / 20_percent.md
Created September 14, 2022 18:42
On 20% / self allocated time

20% time is a buffer against many kinds of process failure. If your planning is perfect, you can always exactly predict what the costs and benefit of future work will be, and how long everything takes, you don't need it. If your process is perfect, you can perfectly line up every tooling improvement, every product experimentation, every tech debt repayment, every personal development investment, at 100% utilization.

But you can't. Sometimes you can't really formulate how something will improve productivity in the future, what feature your product has been missing, which blog post is the one that each engineer must read. 20% time acknowledges that you're only going to be right about 80% on all this – and that there is value in having slack for the other 20%.

This time becomes a buffer for planning emergencies (because let's be real, sometimes planned work spills over into it), for exploration that doesn't make it into a roadmap until the exploration has already happened, for productivity improvements that yo

@matthiasr
matthiasr / event_reporting.md
Created July 18, 2022 16:35
The website says I shouldn't use Prometheus for billing, what should I do?

That's a whole system design interview question right there 😉

Fundamentally, if you need that level of detail and fidelity, you compromise on timeliness. You (somehow) record an event for every message sent, then tally those up on a daily or monthly basis.

It depends on the system and the details of your requirements. I would start by looking at how messages are being sent to begin with. For example, if there's a job in a database, you might be able to count the jobs marked as completed right there.

If the message system is fed by a Kafka topic, and you can get away with billing for messages attempted to send, you can count those from the source topic.

You can also explicitly record events, say into another Kafka topic or a database table. You will need to make some choices about CAP un that case: if recording the event fails, do you want to send anyway? What if you're not sure recording it worked?

Kubernetes defaults to scaling by CPU usage, because that is what is always available. However, this is not a great metric to scale on. Most backend services are not CPU-bound – they mostly wait for responses from other services, caches, or databases. If one of those gets slow, or worse, talking to one of those gets slow, CPU-based scaling will tear down resources rather than scaling out, because all it sees is "idle" instances. This is especially bad if the contended resource is concurrency on those network requests. If many requests are waiting to check out a connection from the database connection pool, scaling down is what you want the least.

My favorit metric to scale on is how many requests you have currently ongoing (per instance). There's a relationship between latency, request rate, and this number of ongoing requests, that means that say a 20% increase in latency at the same request rate, or a 20% increase in requests at the same latency, result i

@matthiasr
matthiasr / 01.md
Created May 19, 2022 07:50
What makes operating databases more complex than stateless services?

To me, the fundamental difficulty in managing databases is the amount of state they have, the time it takes to move that state around, and the difficulty in keeping it in sync.

Whether it's Cassandra, MySQL, or PostgreSQL, bringing up a new instance takes time, orders of magnitude more than replacing some stateless service. Network mountable volumes help, because that state mostly lives on whatever provides that, but you still need to account for "moving" it into the cache.

Additionally, you necessarily have shared state that you cannot ever reset. A lot of the usefulness from "cattle" servers is that you have a clear way to reset them to a known good state. In most cases you cannot do that with your data.

Some of the mechanics of a cattle management system like Kubernetes still work but they're on timelines that give the cluster operators headaches.

@matthiasr
matthiasr / questions.md
Created December 3, 2021 07:38
Someone asked: How can I be less annoying to collaborate with?

A great technique that a manager set me on the path towards, and that I learned over the last years is to ask many good questions. this takes practice. a good question pushes the other person towards an insight, but lets them supply their own experiences and gives them the satisfaction of figuring something out vs. being told what to do.

For example, instead of stating "this container needs a readiness probe", you could ask "what would a good readiness probe be for this container?", or even more open "how can we make sure this gets taken out of traffic when it breaks? how can we detect that it's broken?"

I found the questioning approaches from Resilience Engineering/LFI helpful, especially to never ask why – it's always an accusation. Better questions are How and What. Sometimes it takes me a few iterations in my head to go from "Why" to "What b