mrcrilly/SFE.md

## SFE.md

      
    Raw
  

              SFE.md
            
          
    Always Start Simple (ASS)

This is why I'm grumpy and sulking in the (metaverse) corner:

Businesses solve problems via solutions they sell on the open market
When it's a software based solution, it's really easy to get things right
Everything is simple, however everyone overcomplicates it
It's entirely possible businesses choose certain technology stacks because they feel that by not choosing them they're unable to attract the latest and greatest talent
And the labour market also feels that by not choosing to learn and understand the latest and greatest technologies, they become unhirable
An incideous cycle is produced that means everyone is building solutions, and everyone is learning how-to build them, based purely on a fear of missing out (FOMO), and not on the idea that less is more.

So I want to talk about this, but to be clear - I'm an operations guy. I've done software engineering too, and I see how decline there as well, so this is will mostly be about operations, but a bit of software in there too.
This entire piece is broken down into the problem I think exists, what I believe my solution is, and then a big breakdown of some rules (broken into specific problem domains) I try to follow when innovating a new solution for a client.
Just a quick warning: I do absolutely think Docker and Kubernetes (k8s) are simpy not required at all. In most cases they're never needed. If you're sick of anti-Docker, anti-k8s content, then feel free to look away.

"Solve the problem with as few moving parts as possible. In fact, engineer and build things that junior engineers can understand, maintain and scale." - Me.

The Problem

Look at this coffee machine:
The "Over-Kill T-1000" Coffee Machine
It's a tablet that you interact with to get a machine to make you coffee. It breaks down a lot, and when it does, you have no other choice to but find a working machine or go buy a coffee.
The same applies to those Zip tap things. You push a button and get boiling water. The other button gives you chilled water. My wife was telling how the one at her place of work broke. Because they're a proprietary product, it took three months to get it fixed. Luckily, "[a colleague] pulled a kettle from the cupboard and we could boil water again"... boiling water is now locked behind proprietary, complex boilers and taps so that it's slightly more convenient and looks neat (and maybe they're slightly safer, too?)
I recenely added smart LEDs to my home and now use Siri to control them. It's great when it works, but when it doesn't, it's a pain in the arse. I've overcomplicated the concept of turning on a light and dimming them.
We're doing the same with with our operational practices...

We're losing sight of the fundamentals, and eventually no one will know how anything works except a few massive corporations who owns all the resources and, most importantly, the highly proficient (deep) knowledge workers
Eventually, development and operational departments will simply be consuming their APIs and they will know nothing else
Docker has made ops lazy, and Kubernetes is just even worse
Abstractions upon abstractions, when the OS has been there for decades and has been very easy to maintain and secure.
And operating systems today, especially with systemd (love it or hate it - I used to hate it), are very capable of creating security boundaries between applications
You can also use systemd with plain, none privileged user accounts to enable, start, and operate services without ever touching the root user
Ansible (and other CAC tools) make configuring the OS simple - really simple
Packer makes baking images really, really simple, for every Cloud provider you could care about
AppArmor and SELinux are simple - you just have to know how and learn the tools
Ultimately, it's extremely easy to take code written in any language, using any set of libs/dependencies, frameworks, etc., and get it onto a disk and served on the network in a few steps. You don't need containers; you don't need K8s

And some very well respected experts in our field have some interesting things to say about software engineering today, which now extends in operations, too:

"Preventing the Collapse of Civilization / Jonathan Blow (Thekla, Inc)": https://www.youtube.com/watch?v=ZSRHeXYDLko
"The Thirty Million Line Problem": https://www.youtube.com/watch?v=kZRE7HIO3vk

My favourite part is about how it shouldn't take 56,000,000 lines of code to deliver a text file to an end user


And both Jonathon and Casey aren't even addressing the same problem that we're seeing in operations... that's just the software development cycle. It's getting bad in ops, too.
We've forgotten how-to choose boring software stacks: https://mcfunley.com/choose-boring-technology

The Solution (IMO)

This is a solution to the operations (ops) problem, and a bit of the software problem. You can pick and choose what works for you, from zero to everything. It's up to you. But as a contractor I move around a lot, so I believe I've experienced enough situations, problem domains both small and large, and solutions to all of this to confidently say: we've gone too far, made things too complicated, and the benefits are bloat and abstractions that are making people dumb and our future generations completely ignorant.
Some quotes from Hacker News that I liked on this subject:

"Simple solutions are also easier to refactor into more complex solutions when the complexity becomes necessary. Going the other way is much harder." - https://news.ycombinator.com/item?id=30770305


"When I was younger, I always thought the old guys pushing boring solutions just didn't want to learn new things. Now I'm starting to realize that after several decades of experience, they simply got burned enough times to learn a thing or two [and] had developed much better BS-detectors than 20-something me." - https://news.ycombinator.com/item?id=30769004

So my thoughts on this are:

Serverless is likely a good option for your (probably) web based application or service...

And yes, I know, Serverless is likely being made available via Docker and probably Kubernetes, but that's fine as you've got literally nothing to manage but the code and the account/bill with the vendor. Or put another way - it's perfectly acceptable to pay a barrista to make you a coffee instead of makingit yourself at home, but if you do choose to make your own at home, you don't need a $15,000 industrial coffee making setup, you need a decent Moka Pot that will cost you sub $50 and some coffee grinds.


PaaS is also the next best bet, downwards from Serverless. From there, it's IaaS, not managed K8s.


But assuming Serverless and PaaS are not an option...


Avoid Docker and Kubernetes. They're just tools, sure, but they're also overcomplicated abstractions that give you the false ideal that you've done something you wouldn't otherwise have been able to run (run some software)
"You cannot operate at scale without Docker or K8s. It's impossible!"


This is simply false...
"My company runs without containers. We process petabytes of data monthly, thousands of CPU cores, hundreds of different types of data pipelines running continously, etc etc. Definitely a distributed system with lots of applications and databases.
We use Nix for reproducible builds and deployments. Containers only give reproducible deployments, not builds, so they would be a step down. The reason that's important is that it frees us from troubleshooting "works on my machine" issues, or from someone pushing an update somewhere and breaking our build. That's not important to everyone if they have few dependencies that don't change often, but for an internet company, the trend is accelerating towards bigger and more complex dependency graphs.

Kubernetes has mostly focused on stateless applications so far. That's the easy part! The hard part is managing databases. We don't use Kubernetes, but there's little attraction because it would be addressing something that's already effortless for us to manage.

What works for us is to do the simplest thing that works, then iterate. I remember being really intimidated about all the big data technologies coming out a decade ago, thinking they are so complex that they must know what they're doing! But I'd so often dive in to understand the details and be disillusioned about how much complexity there is for relatively little benefit. I was in a sort of paralysis of what we'd do after we outgrew postgresql, and never found a good answer. Here we are years later, with a dozen+ postgresql databases, some measuring up to 30 terabytes each, and it's still the best solution for us.

Perhaps I've read too far into the intent of the question, but maybe you can afford to drop the research project into containers and kubernetes, and do something simple that works for now, and get back to focusing on product?"

(https://news.ycombinator.com/item?id=30768244.)
And this:
"I worked at WhatsApp, prior to moving to Facebook infra, we had some jails for specific things, but mostly ran without containers.
Stack looked like:

FreeBSD on bare metal servers (host service provided a base image, our shell script would fetch source, apply patches, install a small handful of dependencies, make world, manage system users, etc)

OTP/BEAM (Erlang) installed via rsync etc from build machine

Application code rsynced and started via Makefile scripts

Not a whole lot else. Lighttpd and php for www. Jails for stud (a tls terminator, popular fork is called hitch) and ffmpeg (until end to end encrypted media made server transcoding unpossible).

No virtualized servers (I ran a freebsd vm on my laptop for dev work, though).

When WA moved to Facebook infra, it made sense to use their deployment methodology for the base system (Linux containers), for organizational reasons. There was no consideration for which methodology was technically superior; both are sufficient, but running a very different methodology inside a system that was designed for everyone to use one methodology is a recipie for operational headaches and difficulty getting things diagnosed and fixed as it's so tempting to jump to the conclusion that any problem found on a different setup is because of the difference and not a latent problem. We had enough differences without requiring a different OS."

(https://news.ycombinator.com/item?id=30768485.)
But what I will say, is this is a very interesting comment on the above:
"Containers are just tar.gz files, you know? The whole layers thing it’s just an optimization. You can actually very simply run those tar.gz files without docker involved, just cgroups. But then you’ll have to write some daemon scripts to start, stop, restart, etc

Follow this path and soon you’ll have a (worst) custom docker. Try to create a network out of those containers and soon a (worst) SDN network appears.

Try to expand that to optimal node usage and soon a (worst) Kubernetes appears.

My point here is: it’s just software packaged with their dependencies. The rest seems inconsequential, but it’s actually the hard part."

(https://news.ycombinator.com/item?id=30768921. )
The above is clearly indicative of the fact Docker, and Kubernetes, are what you'd end up with if you tried to use the Linux kernel primitives they're utilising... but...
The fact others are running complex, distributed software solutions at petabyte scale with thousands of CPU cores tells me that Docker and Kubernetes
You can just not do anything of that and you're still able to run:

Run the software, probably a virtual environment
Use the networking stack the OS provides directly
Use systemd functionality to lock down the process
Use systemd as a non-privileged user to run everything, without every needing root/sudo
Use AppArmor or SELinux to really lock down the system
CIS harden the base image everything is using
And still end up with the same results... a running process.

The SFE Rules

Now I want to break down the Simple First Engineering (SFE) rules that have been written to help businesses decide on their technology choices.
Technology Selection


Pick a single language and use it across the entire organisation
Pick a single framework and use it across the entire organisation
Pick a single CI/CD stack and use it across the entire organisation
Pick a particular testing methodology and use it across the entire organisation
Pick a particular GitOps methodology and use it across the entire organisation
Use Linux on workstations. Windows is not a usable operating system; macOS is slow and getting slower

Project Management & Documentation


Use OKRs
Use DevOps methodologies to deliver OKRs
Find an Agile workflow that works and stick to it

And improve it overtime, tweaking as need be


Use static documentation - do not use a wiki or Confluence

Use MkDocs
Keep the docs inside the repository


Writing Software


Ask for as little data as you can from customers
Transmit that data across as few wires/networks as you can
Process it via as few lines of code as you can
Security scan, secure, and protect as many of those lines of code as you can
Regression test as many of those lines of code as you can
Store the (original or transmuted) data in as few places as you can
Get the (original or transmuted) data to that storage via as few networks as you can
Create as few copies of the (original or transmuted) data as you can
Create encrypted backups of that (original or transmuted) data...

3 in total;
at least 2 across different storage mediums
with one being remote/off-network


Regulary check the data can be retreived and utilised as securely as possible, whenever you can
Use as little (original or transmuted) data as possible for the process/person that needs the data
Delete as much (original or transmuted) data as you can, as soon as you can
Audit everything single interaction with the (original or transmuted) data
Audit everything single interaction with the code that captures, processes or otherwise interacts with the
(original or transmuted) data
Write software that is Cloud aware and can write to logs and metrics to S3 directly
Write software that executes as close to the hardware as possible

This generally means Go, Rust, ...
But Python is awesome too


Systems Engineering


Pick a particular Cloud provider and stick with it
Avoid multi Cloud and its false promises
Use Infrastructure As Code

Specifically, Terraform
Avoid CDKs like Terraform CDK, AWS CDK, and Pulumi


Use Configuration As Code

Specifically, Ansible in pull mode
Avoid creating complex clustered solutions like Salt, Puppet, Chef


Use Images As Code

Specifically, Packer
Maintain baseline images


Only use Linux, not Windows Server

This pairs well with your use of Linux workstations


Use as few abstractions as possible to deliver software to end users

Kubernetes is an abstraction. Is too complicated for what it (eventually) does: runs a process
Docker is an overly complex abstraction. You don't need it, except in a few cases

If you're running Linux locally, it's likely not needed at all, but it's a good tool for running esoteric services like databases, etc.


Learn to use the operating system
Allow as few writes to operating system filesystems as possible

Make the file system immutable, if possible


Allow as few permissions to manipulate the operating systems as possible

Make the entire system immutable, if possible


Produce immutable artifacts and deploy those

Build an image and deploy it


Roll forward whenever possible
Logs and metrics should never touch the local filesystem
Logs and metrics should cross as few networks as possible
Logs and metrics should be stored in "dumb" storage first, in their raw state

S3, for example
Logs and metrics should be ingested from "dumb" storage, analysed, and acted upon

Such as processing the event and alerting when an error is found


Use life-cycle policies to migrate old logs and metrics to cheaper storage options
Do not try to maintain live clusters that try to keep 100s of GBs of logs indexed and searchable
Build JIT solutions for looking at logs as and when you need them


Do not use Kubernetes until scaling becomes a pain point
Do not use Docker to package and ship software if you're not using Kubernetes

SecOps


Do use SAST and DAST tools to scan all code, always
Use Open Policy Agent (OPA) whereever you can, for everything you can
Do shift Terraform, Ansible, and other operational tools into the CI/CD pipeline
Do not allow engineers to run Terraform, Ansible, and other operational tools from their local CLIs

The only case this is permissible is when you're setting everything up for the first time


...