Skip to content

Instantly share code, notes, and snippets.

@y-o-u
Created December 8, 2021 07:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save y-o-u/86fd529dbdc678de661d4a59d06a14a5 to your computer and use it in GitHub Desktop.
Save y-o-u/86fd529dbdc678de661d4a59d06a14a5 to your computer and use it in GitHub Desktop.
pdfminerで英語pdfのテキストを抽出
Chaos Engineering:
Finding Failures Before
They Become Outages
Chaos Engineering: Breaking Your Systems for Fun and Profit
D
iane Glazman will never fly British Airways (BA)
again. Glazman and her husband were among
the 75,000 people affected by the three-day BA
system failure summer 2017. On their way from San
Francisco to their son’s college graduation in Edinburgh,
they were stranded in London—without their luggage—at
the beginning of what was to be the three-week dream
tour of Scotland. “Listening to the excuses was
frustrating because nothing explained why BA was so
unprepared for such a catastrophic failure,” says
Glazman.
BA lost an estimated $135 million due to that outage.
The culprit turned out to be a faulty uninterruptable
power supply device (UPS)—the corporate cousin to the
$10 gadget you can find in your corner Radio Shack. And
that loss figure doesn’t count the forever-gone trust of
customers like Glazman, who will look elsewhere for
transatlantic flights next time she travels.
BA of course isn’t alone for having suffered financially
for having its systems down. There were also United
Airlines (200 flights delayed for 2.5 hours, thousands of
passengers stranded or missed connections),
Starbucks (couldn’t accept any payments but cash in
affected stores), Facebook (millions of users offline and
tens of millions of ads not served during the 2.5 hours of
downtime), and WhatsApp (600 million users affected, 5
billion messages lost). And when Amazon S3 went down
in March 2017, it collectively cost Amazon's customers
$150 million.
1
Chaos Engineering: Breaking Your Systems for Fun and Profit
Suffered major outages in 2017
In fact, 2017 was a banner year for systems outages—
and for the cost of them.
The 2017 ITIC Cost of Downtime survey finds that 98%
of organizations say a single hour of downtime costs
more than $100,000. More than eight in 10 companies
indicated that 60 minutes of downtime costs their
business more $300,000. And a record one-third of
enterprises report that one hour of downtime costs
their firms $1 million to more than $5 million (see Figure
1). The average cost of a single hour of unplanned
downtime has risen by 25% to 30% since 2008 when
ITIC first began tracking these figures.
2
Chaos Engineering: Breaking Your Systems for Fun and Profit
$1M – $5M+
$300,000+
98%
$100,000+
% of Companies
Reporting
Cost of 1hr of
Downtime
33%
80%
$1M – $5M+
$300,000+
Figure 1: Cost of 60 minutes of downtime
So how can organizations cut the risk of downtime? The
answer: break your systems on purpose. Find out their
weaknesses and fix them before they break when least
expected.
Break your systems on
purpose. Find out their
weaknesses and fix
them before they break
when least expected.
It’s called chaos engineering, and it’s being adopted by
leading financial institutions, internet companies, and
manufacturing firms throughout the world. Such
businesses understand that the trillions of dollars lost
annually due to downtime is not acceptable to their
customers, their stockholders, and their employees.
3
Chaos Engineering: Breaking Your Systems for Fun and Profit
A more complex,
distributed world
In the traditional corporate computer environment of 30
years ago, software ran in a highly controlled
environment that had few moving parts or variables.
But in the new business world that depends on the
internet, globally connected systems, a mix of cloud and
bare-metal infrastructure, and more moving parts than
you can count, your software depends on a lot of
infrastructure and services that are outside your control
to run smoothly.
Already Moved
Planning to Move
86%
In Cloud by End of
2017
4
70%
16%
Already Moved
to the Cloud
Planning to Move
Chaos Engineering: Breaking Your Systems for Fun and Profit
First and foremost, there’s the rapid shift to the cloud. A
full 70% of companies have already moved at least one
application to the cloud, according to IDC, with 16%
more planning to do so by the end of 2017.
Then there’s the rise of microservices. Most enterprises
are developing their software today on a microservice
architecture. Applications are built as small and
independent but interconnected modular services. Each
service runs a unique process meant to meet a
particular business goal. For example, one microservice
might track inventory levels of products. Another might
handle serving personalized recommendations to
customers.
Then there are all the web servers, databases, load
balancers, routers, and more that must work together
to form a coherent whole. This is not easy.
The good news is that as this modular, distributed
infrastructure continues to evolve, businesses can do
things with software that simply weren’t possible
before. By shifting to what are also called ”loosely
coupled services” that can be developed and released
independently of each other, developed by similarly
organized teams who are empowered to make changes,
time to market for businesses can be radically reduced.
But these new capabilities come at a price.
5
Chaos Engineering: Breaking Your Systems for Fun and Profit
Today, businesses face a serious “complexity gap.” It’s
difficult for even the most technically astute individuals
to understand how all the different bits and pieces work
together. The ability of your IT professionals to manage
the ever-evolving sophistication of computer
infrastructure falls short of what’s needed. With so
many points of possible failure in your systems, this
makes your business extremely vulnerable.
This is happening at a time when people—
businesspeople as well as consumers—are increasingly
dependent on the internet and the services that
corporations deliver that depend on it. The number of
internet users worldwide in 2017 is 3.58 billion, up from
3.39 billion in 2016.
Even fewer people today can operate without a phone.
In 2017 the number of mobile phones reached 4.77
billion, and is expected to pass the five billion mark by
2019. And all these users--both consumers and
business--are demanding: Doubleclick found in 2016
that 53% of mobile site visits are abandoned if a web
page takes longer than three seconds to load.
As a result, the stakes have never been higher for
companies to maintain the uptime of their systems.
6
53% of mobile users
abandon webpages that
take more than three
seconds to load.
Chaos Engineering: Breaking Your Systems for Fun and Profit
Chaos engineering,
a primer
Imagine that it’s 1796, and you have been selected to be
injected with Edward Jenner’s brand-new smallpox
vaccine. You are told that he is going to put the actual
virus into your bloodstream. You are also told that you
won’t get sick. Instead, it will make it impossible for you
to get sick with this particular virus, because it will make
your system stronger. You might have recoiled, thinking
that the risk was too great. Yet you allow yourself to be
vaccinated. And, of course, you are much better off than
those who refuse the treatment.
This is exactly what chaos engineering does. All
computers have limits, and possible points of failure. By
injecting a system with something that has the potential
to disrupt it, you can identify where the system may be
weak, and can take steps to make it more resilient.
That covers the systems under your purview. Problems
that occur with your cloud service providers are out of
your control, and you can’t resolve outages by adding
extra boxes or power supplies.
By injecting a system
with something that has
the potential to disrupt
it, you can identify
where the system may
be weak, and can take
steps to make it more
resilient.
7
Chaos Engineering: Breaking Your Systems for Fun and Profit
Think of all your data living in the cloud, in Amazon S3
or DynamoDB, and the hosted services you depend on,
such as Salesforce or Workday. If they fail, you're at
their mercy. Chaos engineering isn’t just essential for
your applications, it’s essential for the companies
behind those applications, which is why Netflix, Uber,
and Amazon all have teams dedicated to chaos and
reliability: they know they cannot afford to let their
customers down.
Here’s where chaos engineering comes in: you know you
have these potential points of failure and vulnerabilities.
So why wait until there’s a problem?
Imagine attempting to break your systems. On purpose.
Before they fail on their own. Because that is what
chaos engineering does. By triggering failures
intentionally in a controlled way, you gain confidence
that your systems can deal with those failures before
they occur in production.
The goal of chaos engineering is to teach you something
new about your systems’ vulnerabilities by performing
experiments on them. You seek to identify hidden
problems that could arise in production prior to them
causing an outage. Only then will you be able to address
systemic weaknesses and make your systems fault-
tolerant.
8
Chaos Engineering: Breaking Your Systems for Fun and Profit
Critical to chaos engineering is that it is treated as a
scientific discipline. It uses precise engineering
processes to work. Four steps in particular are followed.
1.
Form a hypothesis: Ask yourself, "What could go
wrong?”
2.
Plan your experiment: Determine how you can
recreate that problem in a safe way that won’t
impact users (internal or external).
3.
Minimize the blast radius: Start with the
smallest experiment that will teach you
something.
4.
Run the experiment: Make sure to carefully
observe the results.
5.
Celebrate the outcome: If things didn't work as
they should, you found a bug! Success! If
everything went as planned, increase the blast
radius and start over at #1.
6.
Complete the mission: You’re done once you
have run the experiment at full scale in
production, and everything works as expected.
9
Chaos Engineering: Breaking Your Systems for Fun and Profit
Some examples of what you might do to the hypothetic
system when performing a chaos engineering
experiment:
Reboot or halt the host operating system. This
allows you to test things like how your system
reacts when losing one or more cluster machines.
Change the host’s system time. This can be used
to test your system’s capability to adjust to
daylight saving time and other time-related
events.
crashes.
Simulate an attack that kills a process. This can
be used to simulate application or dependency
Naturally, you immediately address any potential
problems that you uncover with chaos engineering.
Indeed, the point of simulating potentially catastrophic
events is to make them non-events  that are irrelevant to
our infrastructure’s ability to perform as required.
Chaos engineering differs from the regular testing that
everyone does as a matter of course in several ways.
Normal testing is done during build / compile activities,
and doesn't test for different configurations or
behaviors or factors beyond your control. Additionally,
routine testing doesn’t account for people--for training
and preparing them for the failures they will be
responsible for fixing live, in the middle of the night.
10
The point of simulating
potentially catastrophic
events is to make them
non-events  that are
irrelevant to our
infrastructure’s ability to
perform as required.
Chaos Engineering: Breaking Your Systems for Fun and Profit
Benefits of
chaos engineering
Companies like Amazon, Netflix, Salesforce, and Uber
have been using chaos engineering for years to make
their systems more reliable. For internet companies
whose very existence depends on their ability to be “up”
at all times, chaos engineering was a necessity. Now
businesses in other industries—financial services in
particular—are starting to follow suit, and implement
chaos engineering programs of their own.
The benefits of chaos engineering include the following:
Help technology professionals see how systems
behave in the face of failure, as their assumptions
are often incomplete or inaccurate
Validate that hypothetical defenses against
failure will work when needed by exercising them
at scale in production environments
Provide the ability to revert systems back to their
original states without impacting customers,
employees, or consumers
Save time and money spent responding to
systems outages
11
Chaos Engineering: Breaking Your Systems for Fun and Profit
How do you know if a chaos engineering program is
working? The top-level measure is overall system
availability. For example, companies like Amazon or
Netflix measure how available they are by whether their
customers can use their product. They define availability
in terms of “9s.” Four nines availability mean that a
system is available 99.99% of the time. Five nines
availability is better, meaning the system is available
99.999% of the time. Six 9s are even better. Specific
applications and sub-services are often measured using
this metric as well.
Translate these numbers into actual outage time, and
you see why it matters (see Figure 2). You can see why
six nines is today considered the gold standard of
reliability.
Another metric is the frequency and duration of
outages. Yet another metric is measuring the
operational burden of staff of system outages. How
often did you have to page an IT support professional?
How frequently did they have to answer a call at 2am to
firefight an issue?
Chaos engineering is also good for disaster recovery
(DR) efforts. If you regularly break your systems using
tight experimental controls, then when your systems go
down unexpectedly, you’re in a much better position to
recover quickly. You have your people trained, and you
can respond more promptly. You can even put self-
healing properties in place so you can continue to
maintain service with minimal disruption.
12
Chaos Engineering: Breaking Your Systems for Fun and Profit
If we break a system in a controlled and careful manner
and we make sure we can recover from it, then, when
outages happen unexpectedly, we're in a better position
because our human workers are being trained to
respond systematically to failure, as opposed to being
called at 2am.
Systems can also self-heal, auto-recover so that they
can operate in a degraded state and still maintain their
service levels. The goal is resilience, rather than
stability. Resilience meaning systems can gracefully
handle inevitable failure without impacting users.
Using chaos engineering for disaster recovery is also
important for compliance reasons. Sarbanes Oxley II as
well as industry- or geography-specific regulatory
mandates require that you can recover quickly from a
disaster. But efforts to comply are often done at the
theoretical level, as so-called “table-top” exercises, and
are therefore incomplete.
No. of 9s
Amount of downtime
Four nines (99.99% availability)
52 minutes 36 seconds
Five nines (99.999% availability)
5 minutes 15 seconds
Six nines (99.9999%
availability)
22 seconds
Figure 2: Translating 9s of availability into minutes of downtime
13
Chaos Engineering: Breaking Your Systems for Fun and Profit
Best practices in
chaos engineering
Many businesses are skeptical that deliberately trying to
crash systems will make them stronger. And they are
correct that there are risks. However, there are also
best practices that mitigate those risks.
Minimize the “blast radius.” Start with the smallest
chaos experiment you can perform that will teach you
something about your system. See what happens. Then
increase the scope as you learn and as your confidence
grows.
Don’t be a chaos monkey. Chaos Monkey was Netflix's
famous—or infamous—tool that randomly rebooted
servers. Unfortunately, today many people believe that
chaos engineering means randomly breaking things. The
reason this is not an optimal approach is that “random”
is difficult to measure. You are not approaching the
problem using experimental methods. The idea behind
chaos engineering is to perform thoughtful, planned,
and scientific experiments instead of simply random
failure testing.
14
Chaos Engineering: Breaking Your Systems for Fun and Profit
Build (using open source)
Buy
Limited set of tools available
Growing availability of solutions
Everyone who does it is reinventing the
wheel
Piggyback on earlier successes
Costly and time-consuming to train
internal team
Avoid engineer and administrator burnout
Unsupported open source releases open
up security vulnerabilities
Security embedded in solution
No “kill switch” or safety valve to stop out-
of-control experiments from taking down
production systems
Kill switch to avoid impacting users
Figure 3: The debate between building and buying chaos engineering tools
Start in a staging environment. Yes, you must
eventually test in your production system, but it makes
sense to start in a staging or development environment
and work your way up. Start with a single host,
container, or microservice in your test environment.
Then try to crash several of them. Once you've hit 100%
in your test environment, you reset to the smallest bit
possible in production, and take it from there.
15
Chaos Engineering: Breaking Your Systems for Fun and Profit
Avoid the “drift into failure.” This concept, invented by
flight accident investigation expert Sydney Dekker,
refers to the fact that tension always exists in systems
between efficiency and safety. Since businesses need to
be mindful of costs, they tend to operate on the edge of
safety. So once you understand a particular kind of
failure and you've tested it, you want to automate the
testing of it in a continuous deployment pipeline so you
maintain that competence.
Always have a kill switch. This is akin to an “undo”
button or safety valve. Make sure you have a way to stop
all chaos engineering experiments immediately, on the
spot, and return all systems to their normal state. If
your chaos engineering causes a high-severity incident
(SEV), then track it carefully and do a full post-mortem
analysis of what went wrong.
Fix known problems first. Never conduct a chaos
experiment in production if you already know that it will
cause severe damage, possibly affecting customers — 
and your reputation. Always try to fix known problems
first.
16
Chaos Engineering: Breaking Your Systems for Fun and Profit
Conclusion:
eschew complacency
Systems will break. And as systems continue to grow in
complexity, they will break more often. If you're not
prepared for that, they'll break in unexpected ways at
unexpected times and bring your software or service
down. You’ll have unhappy customers and unhappy
employees, and the costs are probably higher than you
think.
Too many businesses are complacent. They think that
just because they haven’t had a major system outage
before--or one that directly impacted customers--that
they’re safe. Or they think that the cost of deploying
chaos engineering is more than the cost of simply fixing
any problems that arise. They’re wrong. Companies that
don’t address resiliency issues with chaos engineering
may end up hiring tens or even hundreds of systems
administrators just to maintain system uptime. That
adds up.
In today’s interconnected, internet-based world, no one
is safe from system failure. The only way for it to not
impact your customers, employees, partners, your
reputation, and your bottom line, is to proactively
address it upfront. Chaos engineering is the optimal way
to do this.
17
Chaos Engineering: Breaking Your Systems for Fun and Profit
18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment