50 years of reliability engineering
Large-scale systems like power grids, chemical processing, nuclear power,
traffic control or air and space travel have been around for quite some time.
Most of these are still larger than what we manage, but we are catching up.
Modern deployment methods and better schedulers allow for more scalability,
and to get there, we constantly make the gears turn faster.
Large-scale use of microservices or even unikernel "fog computing" promises to
add billions of short-lived components into our day-to-day environments.
Naturally the complexity of those systems is bound to increase...
It's time to look outside our bubble: we need to review professional
practices that other industries have been improving on for over 50 years.
In this ignite you'll hear a few lessons from my dive into
- human reliability analysis
- classical reliability engineering
- safety-critical sw development.
What is only for NASA to afford?
What methods might be viable for us?
What could happen if we just adopted them?
Where could we draw a line between time-to-market and literally risking life?
Florian Heigl