FlorianHeigl/swmaintainability.md

## swmaintainability.md

      
    Raw
  

              swmaintainability.md
            
          
    Common architecture issues make patching "HARD"

An attempt to explain where exactly we can improve things.

From someone with hope, but without even the smallest grain of illusion.
I could just say "well, it was US who carried VMS to it's grave!"

It wouldn't help much. We're definitely looking at long-solved problems that aren't solved any longer due to design limitations in our OS and application integrations.
The question is how to get around it.
SW Issue forecast missing.

gather history of security issues for each component you use.
This has to be done during the design phase.
The people who patch will just see a WAR file and nothing else.
They also don't have budgeted time to tear it apart each time.
They're also probably handed by someone who also doesn't know what's inside.
a great example: libxml

has had 10+ issues, so you can estimate 1/year
you also learn OSCAP oscap oval eval (uses libxml to validate downloaded XML)
should not be used unless the XML input was prior scanned for malicious content.

For newer components, at least get an average for components in that language (i.e. typical Golang projects)
You can match it with the security history of similar code.
Everything will break

Assume every critical part, and each that processes input will have a major issue every 3-5 years!

This is mandatory for anything that is security related.
(heartbleed: 4 years ago, debian disaster: 9 years

ssh privsep issues: maybe 10, if we're lucky, and lets not mention v1)
Keep in mind Features make this worse:

Enabling PFS also turned out to be the most likely reason to have
a recent-enough SSL to be affected by heartbleed.
Don't assume there are no issues with more modern languages:

The nicely validated OCAML xenstore backend had a security issue
within less than 5 years. (And that stuff gets looked at a lot)
No maintenance effort forecast:

If you have 20 sw components, assume those 20 need to be patched more than
once. And that's not counting updates, only out-of-band security patches.

It will also not be at the same time for each component...

if you want to be better than EQUIHOAX you'll need a bi-weekly
or weekly patch scheme.
Yes, I know docker exists. I also know there is neglible speed difference 
in whether I restart a very large java app server by re-fetching a 
new container from the registry or just restarting it inside WLS.
It's likely faster on WLS, and in either case we're talking anywhere
between 2 and 25 minutes for large stuff.
I've run and patched more of those than I ever thought possible.

You don't need to give the numbers, but you need to list the stuff that needs to be done.

If you don't, management will understand nothing needs to be done.

They will assign as many ressources (none), and the ops teams later will
constantly fight to change this, but never be able to, since this was what management was told nothing is sufficient.

And Ops will never know why they're hittting a wall.
Availability impacts:

Assume patching takes a few minutes only, you suddenly need three
application servers for HA where your application needed two:
you will be in maintenance this much:
25weeks*5 minutes*2 nodes (with rolling restarts)
that's just a bit shy of 99.95% time where you would have two nodes up.

any outage during that time, everything goes down.

they had a revenue of $6.64bn in 2016.

In their business, Your outage cost would be almost the complete for
the outage period. ignoring peak times, you get around $750k per hour.

Also ignoring the suddenly idling 9500 employees, minus the 3-4 handling the outage.

(You could definitely add another $100k lost in wages)
Does it pay:

Almost $1m seems enough to pay a few devs to build new packages, right? you would think!
except there's The management view:

Thinking that avoiding potential outages saves money is naive.

Yes outages cost money and no you don't get free money just from avoiding
them. You get exactly nothing.
This puts very simply how much management will at first assign for 'patching'.

So right now, we have $0 for this. Let's see if we can change it.
What happens with those numbers:

It is the responsibility of the designer to come up with estimates
how much effort it will take to run the thing!

Those numbers need to be included, and if they are not, the alloted number
will be something like "3 default web app servers at 24x7", grabbing the number
from my random prices box, I'd say $3k for default mass operations for a year.
Each time someone understates a risk to push a project, THIS is what will happen.

Of course, if customer A sees vendor B and C, out of which C is the one that
states $150k of maintenace because a dev and a sec team should be at hand
for 0.3 resource units, the customer will buy from vendor B, who simply lied.
To that end, only basic standards/regulation can help. Liability alone will
not. The vendor can average the penalties over all sales.

The company management just needs to plausibly "forget" to ask about this.

Regulation: does help. They got this kind of bullshit covered for decades.
Monitoring untested


Make sure your test plans include testing the monitoring interfaces

(I know a /api/status URL in a ruby app that will hang eternally on
backend errors...). Reported, not fixed. :-)

Don't call it 'status' if it can't deliver a status under all circumstances.

This is at least the second most important interface you got.

If you take a high-level (KPI) business view, it is the most important.

Segmentation


For security and isolation purposes, try to work with brokers
not just prepared statements but intermediary daemons.


versioned protocols should be a given, but honestly I've not seen
that quality outside of - guess what - 24x7 safety critical systems(*).


(Those that can be reinstalled with a single button-press without interruption of service and keep 10y+ of uptime)
Anyway, it's not that much extra work to have a protocol handshake.
Be aware of languages / models that were made for robustness

I'm not saying we can rewrite the planet using ERLANG.
But I'm saying you should be aware that our common choice of other languages WILL limit
the results you see on an Ops end.
There is no discussing this: Operability was a core idea in some, and
if we go with something else, don't think that has no consequences.

Layering into k8s is a very helpful band-aid to stop the bleeding,
but it's not an equal solution.
(Why admins dig unikernels)
Include procedures

You might already invest a lot into proper testing. Make sure
that includes normal operations testing. If you've not seen people go
through(*) with an application update, you can't assume it will be OK.
If you're leaving after the design phase, then you better
PUT THAT TEST IN THE SPEC.
(*from dev downloading struts right off the internet to the next guy
grabbing a config snippet from an mail because it never made to the docs)
Reading recommendation (it almost made me cry if I think of normal IT):
https://www.amazon.com/Mission-Critical-Safety-Critical-Systems-Handbook-Applications-ebook/dp/B00486UK2A
That stuff is TOO MUCH for normal purposes, but it is helpful to be aware
what is done to ensure stuff can be maintained once you don't just look
at the typical java web app.
But it's interesting how you get 100 page documents with interface contracts.
Full of DB schemas and stuff. But you don't get a single goddamn line on
where the logs go, and which are indicative of a failing (not failed!) system.
And, seriously, if that stuff is not in there, we got a 80% design and
noone needs to go "don't tell me it's hard" if the most basic info is
totally lacking.
I work with a standard template now.
Example Service Template

- Nagios Service Groups:

- Nagios Checks
  - "Docker_Health"
  - "Process DockerD"

- Server:
  - int-buildserver
  - prod-appserver

- Master / Cluster VIP:
  - none

- Trigger of installation:
  - Node Property:  docker == "yes"

- Monitoring Config:
  - none

- Firewall Ports:
  - none # needed for swarm mode and overlays, but not opened since not used

- Service control:
  - stop: systemctl stop docker
  - start: systemctl start docker
  - restart: systemctl restart docker
  - configtest: docker info

- Processes:
  - /usr/bin/dockerd
  - docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim docker-containerd-shim --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --runtime docker-runc


- Storage Directories:
  - /var/lib/docker

- Logfile Directories:
  - /var/log

- Functional Role:
  - backend 

- Langdeps:
  - python

- Rudder objects:
  - "Docker install", 
  - "Docker Nagios Plugin", 
  - "UNIX Group for Docker", 

- Ansible playbooks:
  - none

Standards being ignored

There are official (ISO or whatever) international standards for sw reliability design.
But I would say: just try to be more like RAC or Galera than homegrown garage-design HA.
No contracts outside of interfaces

For all what you'd like to see to happen, there need be an internal contract in
a company. In classic DevOpsy Style it needs to be betweeen
Security	- will inform, and will put deadline
Management	- will fund security fixes via a reserved budget, no discussions, and will accept downtimes.
Development	- will prioritize any security fix and have their CI 
	in shape so they can indeed build instead of being lost under versions
Ops		- will immediately apply a delivered sec fix

This is a part for the project manager and designer, and needs to happen,
before the thing is finally put on the internet.
Management also needs the INFO which parts of the application are highly explosive.
They have to be asked if sec+ops are allowed to take the application offline or
at least disable features.
here, again, a modern development model seems helpful. just turn parts off.
in reality, this is mostly irrelevant. A stale security issue won't be in the
new sexy features you just turned on. It will be in the ones that are
in widespread use.
Finally: Data models, Checkpoints. Configs.

You know why patching is done so super fucking conservatively?
Because so many applications fuck up their schemas.
Because you can't rely on them to notice if they don't match.
it is just too likely they'll tear through the live data and destroy it all.
closing:

"Ops" can't fix any of that shit. If we would, we would get sued, and fired, or the other way round.
If you don't know that, please eat a little cookie of modesty because apparently you indeed are allowed to change things.
Not understanding does not make you clueful.
So, please try to help instead of blaming the outdated content of the tarball on those who didn't even make it.
It's uninformed and malicious.
People HATE running insecure stuff. People HATE running outdated stuff.
We waste our lifes trying to run this broken shit and we would LOVE to run safe and fast stuff.
But none of that matters, because that's not attainable.
New deployment methods have us brought, idk? 30% of all dockerhub images have open sec issues?
And of course, I know, everyone rebuilds in-house, except tools to automatically do that like Watchtower dangled around with a few dozen likes for years.
Don't lie in my face, thanks.

There will always be layering panaceas that don't change shit, but we need to migrate to, because then it'll be better.
There will always be business constraints, people who think they're clever to save some money on this.
There will always be that one bug that blocks the release.
There will always be that one issue noone saw because we're now "efficient" and it's no longer hip to just check a sec list daily.
And even if that were OK, there will be that one day with the big outage, where noone looked at the list.

Instead, make it safe to patch.
What would really matter, if it became commonplace to give us

checkpointed data updates
versioning on all interfaces (see above it would be so cool to not have outages from updating)
thorough documentation

Then we could talk about doing the sexy stuff.
Also then we can just patch.
Likely still at night when noone is looking because someone would, still, say "no".