y-o-u/pdfminerで英語pdfのテキストを抽出

## pdfminerで英語pdfのテキストを抽出


Chaos Engineering:
Finding Failures Before
They Become Outages


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

D

iane Glazman will never ﬂy British Airways (BA)

again. Glazman and her husband were among

the 75,000 people aﬀected by the three-day BA

system failure summer 2017. On their way from San

Francisco to their son’s college graduation in Edinburgh,

they were stranded in London—without their luggage—at

the beginning of what was to be the three-week dream

tour of Scotland.  “Listening to the excuses was

frustrating because nothing explained why BA was so

unprepared for such a catastrophic failure,” says

Glazman.

BA lost an estimated $135 million due to that outage.

The culprit turned out to be a faulty uninterruptable

power supply device (UPS)—the corporate cousin to the

$10 gadget you can ﬁnd in your corner Radio Shack. And

that loss ﬁgure doesn’t count the forever-gone trust of

customers like Glazman, who will look elsewhere for

transatlantic ﬂights next time she travels.

BA of course isn’t alone for having suﬀered ﬁnancially

for having its systems down. There were also United

Airlines (200 ﬂights delayed for 2.5 hours, thousands of

passengers stranded or missed connections),

Starbucks (couldn’t accept any payments but cash in

aﬀected stores), Facebook (millions of users oﬄine and

tens of millions of ads not served during the 2.5 hours of

downtime), and WhatsApp (600 million users aﬀected, 5

billion messages lost). And when Amazon S3 went down

in March 2017, it collectively cost Amazon's customers

$150 million.

 1


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Suffered major outages in 2017

In fact, 2017 was a banner year for systems outages—

and for the cost of them.

The 2017 ITIC Cost of Downtime survey ﬁnds that 98%

of organizations say a single hour of downtime costs

more than $100,000. More than eight in 10 companies

indicated that 60 minutes of downtime costs their

business more $300,000. And a record one-third of

enterprises report that one hour of downtime costs

their ﬁrms $1 million to more than $5 million (see Figure

1). The average cost of a single hour of unplanned

downtime has risen by 25% to 30% since 2008 when

ITIC ﬁrst began tracking these ﬁgures.

 2


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

$1M – $5M+

$300,000+

98%

$100,000+

% of Companies
Reporting

Cost of 1hr of
Downtime

33%

80%

$1M – $5M+

$300,000+

Figure 1: Cost of 60 minutes of downtime

So how can organizations cut the risk of downtime? The

answer: break your systems on purpose. Find out their

weaknesses and ﬁx them before they break when least

expected.

Break your systems on
purpose. Find out their
weaknesses and fix
them before they break
when least expected.

It’s called chaos engineering, and it’s being adopted by

leading ﬁnancial institutions, internet companies, and

manufacturing ﬁrms throughout the world. Such

businesses understand that the trillions of dollars lost

annually due to downtime is not acceptable to their

customers, their stockholders, and their employees.

 3


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

A more complex,
distributed world

In the traditional corporate computer environment of 30

years ago, software ran in a highly controlled

environment that had few moving parts or variables.

But in the new business world that depends on the

internet, globally connected systems, a mix of cloud and

bare-metal infrastructure, and more moving parts than

you can count, your software depends on a lot of

infrastructure and services that are outside your control

to run smoothly.

Already Moved

Planning to Move

86%

In Cloud by End of
2017

 4

70%

16%

Already Moved
to the Cloud

Planning to Move


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

First and foremost, there’s the rapid shift to the cloud. A

full 70% of companies have already moved at least one

application to the cloud, according to IDC, with 16%

more planning to do so by the end of 2017.

Then there’s the rise of microservices. Most enterprises

are developing their software today on a microservice

architecture.  Applications are built as small and

independent but interconnected modular services. Each

service runs a unique process meant to meet a

particular business goal. For example, one microservice

might track inventory levels of products. Another might

handle serving personalized recommendations to

customers.

Then there are all the web servers, databases, load

balancers, routers, and more that must work together

to form a coherent whole. This is not easy.

The good news is that as this modular, distributed

infrastructure continues to evolve, businesses can do

things with software that simply weren’t possible

before. By shifting to what are also called ”loosely

coupled services” that can be developed and released

independently of each other, developed by similarly

organized teams who are empowered to make changes,

time to market for businesses can be radically reduced.

But these new capabilities come at a price.

 5


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Today, businesses face a serious “complexity gap.” It’s

diﬃcult for even the most technically astute individuals

to understand how all the diﬀerent bits and pieces work

together. The ability of your IT professionals to manage

the ever-evolving sophistication of computer

infrastructure falls short of what’s needed. With so

many points of possible failure in your systems, this

makes your business extremely vulnerable.

This is happening at a time when people—

businesspeople as well as consumers—are increasingly

dependent on the internet and the services that

corporations deliver that depend on it. The number of

internet users worldwide in 2017 is 3.58 billion, up from

3.39 billion in 2016.

Even fewer people today can operate without a phone.

In 2017 the number of mobile phones reached 4.77

billion, and is expected to pass the ﬁve billion mark by

2019. And all these users--both consumers and

business--are demanding: Doubleclick found in 2016

that 53% of mobile site visits are abandoned if a web

page takes longer than three seconds to load.

As a result, the stakes have never been higher for

companies to maintain the uptime of their systems.

 6

53% of mobile users
abandon webpages that
take more than three
seconds to load.


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Chaos engineering,
a primer

Imagine that it’s 1796, and you have been selected to be

injected with Edward Jenner’s brand-new smallpox

vaccine. You are told that he is going to put the actual

virus into your bloodstream. You are also told that you

won’t get sick. Instead, it will make it impossible for you

to get sick with this particular virus, because it will make

your system stronger. You might have recoiled, thinking

that the risk was too great. Yet you allow yourself to be

vaccinated. And, of course, you are much better oﬀ than

those who refuse the treatment.

This is exactly what chaos engineering does. All

computers have limits, and possible points of failure. By

injecting a system with something that has the potential

to disrupt it, you can identify where the system may be

weak, and can take steps to make it more resilient.

That covers the systems under your purview. Problems

that occur with your cloud service providers are out of

your control, and you can’t resolve outages by adding

extra boxes or power supplies.

By injecting a system
with something that has
the potential to disrupt
it, you can identify
where the system may
be weak, and can take
steps to make it more
resilient.

 7


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Think of all your data living in the cloud, in Amazon S3

or DynamoDB, and the hosted services you depend on,

such as Salesforce or Workday. If they fail, you're at

their mercy. Chaos engineering isn’t just essential for

your applications, it’s essential for the companies

behind those applications, which is why Netﬂix, Uber,

and Amazon all have teams dedicated to chaos and

reliability: they know they cannot aﬀord to let their

customers down.

Here’s where chaos engineering comes in: you know you

have these potential points of failure and vulnerabilities.

So why wait until there’s a problem?

Imagine attempting to break your systems. On purpose.

Before they fail on their own. Because that is what

chaos engineering does. By triggering failures

intentionally in a controlled way, you gain conﬁdence

that your systems can deal with those failures before

they occur in production.

The goal of chaos engineering is to teach you something

new about your systems’ vulnerabilities by performing

experiments on them. You seek to identify hidden

problems that could arise in production prior to them

causing an outage. Only then will you be able to address

systemic weaknesses and make your systems fault-

tolerant.

 8


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Critical to chaos engineering is that it is treated as a

scientiﬁc discipline. It uses precise engineering

processes to work. Four steps in particular are followed.

1.

Form a hypothesis:  Ask yourself, "What could go

wrong?”

2.

Plan your experiment: Determine how you can

recreate that problem in a safe way that won’t

impact users (internal or external).

3.

Minimize the blast radius: Start with the

smallest experiment that will teach you

something.

4.

Run the experiment: Make sure to carefully

observe the results.

5.

Celebrate the outcome: If things didn't work as

they should, you found a bug! Success! If

everything went as planned, increase the blast

radius and start over at #1.

6.

Complete the mission: You’re done once you

have run the experiment at full scale in

production, and everything works as expected.

 9


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Some examples of what you might do to the hypothetic

system when performing a chaos engineering

experiment:

•

•

•

Reboot or halt the host operating system. This

allows you to test things like how your system

reacts when losing one or more cluster machines.

Change the host’s system time. This can be used

to test your system’s capability to adjust to

daylight saving time and other time-related

events.

crashes.

Simulate an attack that kills a process. This can

be used to simulate application or dependency

Naturally, you immediately address any potential

problems that you uncover with chaos engineering.

Indeed, the point of simulating potentially catastrophic
events is to make them non-events  that are irrelevant to

our infrastructure’s ability to perform as required.

Chaos engineering diﬀers from the regular testing that

everyone does as a matter of course in several ways.

Normal testing is done during build / compile activities,

and doesn't test for diﬀerent conﬁgurations or

behaviors or factors beyond your control. Additionally,

routine testing doesn’t account for people--for training

and preparing them for the failures they will be

responsible for ﬁxing live, in the middle of the night.

 10

The point of simulating
potentially catastrophic
events is to make them
non-events  that are
irrelevant to our
infrastructure’s ability to
perform as required.


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Benefits of
chaos engineering

Companies like Amazon, Netﬂix, Salesforce, and Uber

have been using chaos engineering for years to make

their systems more reliable. For internet companies

whose very existence depends on their ability to be “up”

at all times, chaos engineering was a necessity. Now

businesses in other industries—ﬁnancial services in

particular—are starting to follow suit, and implement

chaos engineering programs of their own.

The beneﬁts of chaos engineering include the following:

•

•

•

•

Help technology professionals see how systems

behave in the face of failure, as their assumptions

are often incomplete or inaccurate

Validate that hypothetical defenses against

failure will work when needed by exercising them

at scale in production environments

Provide the ability to revert systems back to their

original states without impacting customers,

employees, or consumers

Save time and money spent responding to

systems outages

 11


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

How do you know if a chaos engineering program is

working? The top-level measure is overall system

availability. For example, companies like Amazon or

Netﬂix measure how available they are by whether their

customers can use their product. They deﬁne availability

in terms of “9s.” Four nines availability mean that a

system is available 99.99% of the time. Five nines

availability is better, meaning the system is available

99.999% of the time. Six 9s are even better. Speciﬁc

applications and sub-services are often measured using

this metric as well.

Translate these numbers into actual outage time, and

you see why it matters (see Figure 2). You can see why

six nines is today considered the gold standard of

reliability.

Another metric is the frequency and duration of

outages. Yet another metric is measuring the

operational burden of staﬀ of system outages. How

often did you have to page an IT support professional?

How frequently did they have to answer a call at 2am to

ﬁreﬁght an issue?

Chaos engineering is also good for disaster recovery

(DR) eﬀorts. If you regularly break your systems using

tight experimental controls, then when your systems go

down unexpectedly, you’re in a much better position to

recover quickly. You have your people trained, and you

can respond more promptly. You can even put self-

healing properties in place so you can continue to

maintain service with minimal disruption.

 12


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

If we break a system in a controlled and careful manner

and we make sure we can recover from it, then, when

outages happen unexpectedly, we're in a better position

because our human workers are being trained to

respond systematically to failure, as opposed to being

called at 2am.

Systems can also self-heal, auto-recover so that they

can operate in a degraded state and still maintain their

service levels. The goal is resilience, rather than

stability. Resilience meaning systems can gracefully

handle inevitable failure without impacting users.

Using chaos engineering for disaster recovery is also

important for compliance reasons. Sarbanes Oxley II as

well as industry- or geography-speciﬁc regulatory

mandates require that you can recover quickly from a

disaster. But eﬀorts to comply are often done at the

theoretical level, as so-called “table-top” exercises, and

are therefore incomplete.

No. of 9s

Amount of downtime

Four nines (99.99% availability)

52 minutes 36 seconds

Five nines (99.999% availability)

5 minutes 15 seconds

Six nines (99.9999%
availability)

22 seconds

Figure 2: Translating 9s of availability into minutes of downtime

 13


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Best practices in
chaos engineering

Many businesses are skeptical that deliberately trying to

crash systems will make them stronger. And they are

correct that there are risks. However, there are also

best practices that mitigate those risks.

Minimize the “blast radius.” Start with the smallest

chaos experiment you can perform that will teach you

something about your system. See what happens. Then

increase the scope as you learn and as your conﬁdence

grows.

Don’t be a chaos monkey. Chaos Monkey was Netﬂix's

famous—or infamous—tool  that randomly rebooted

servers. Unfortunately, today many people believe that

chaos engineering means randomly breaking things. The

reason this is not an optimal approach is that “random”

is diﬃcult to measure. You are not approaching the

problem using experimental methods. The idea behind

chaos engineering is to perform thoughtful, planned,

and scientiﬁc experiments instead of simply random

failure testing.

 14


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Build (using open source)

Buy

Limited set of tools available

Growing availability of solutions

Everyone who does it is reinventing the
wheel

Piggyback on earlier successes

Costly and time-consuming to train
internal team

Avoid engineer and administrator burnout

Unsupported open source releases open
up security vulnerabilities

Security embedded in solution

No “kill switch” or safety valve to stop out-
of-control experiments from taking down
production systems

Kill switch to avoid impacting users

Figure 3: The debate between building and buying chaos engineering tools

Start in a staging environment. Yes, you must

eventually test in your production system, but it makes

sense to start in a staging or development environment

and work your way up. Start with a single host,

container, or microservice in your test environment.

Then try to crash several of them. Once you've hit 100%

in your test environment, you reset to the smallest bit

possible in production, and take it from there.

 15


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Avoid the “drift into failure.” This concept, invented by

ﬂight accident investigation expert Sydney Dekker,

refers to the fact that tension always exists in systems

between eﬃciency and safety. Since businesses need to

be mindful of costs, they tend to operate on the edge of

safety. So once you understand a particular kind of

failure and you've tested it, you want to automate the

testing of it in a continuous deployment pipeline so you

maintain that competence.

Always have a kill switch. This is akin to an “undo”

button or safety valve. Make sure you have a way to stop

all chaos engineering experiments immediately, on the

spot, and return all systems to their normal state. If

your chaos engineering causes a high-severity incident

(SEV), then track it carefully and do a full post-mortem

analysis of what went wrong.

Fix known problems ﬁrst. Never conduct a chaos

experiment in production if you already know that it will
cause severe damage, possibly aﬀecting customers —

and your reputation. Always try to ﬁx known problems

ﬁrst.

 16


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

Conclusion:
eschew complacency

Systems will break. And as systems continue to grow in

complexity, they will break more often. If you're not

prepared for that, they'll break in unexpected ways at

unexpected times and bring your software or service

down. You’ll have unhappy customers and unhappy

employees, and the costs are probably higher than you

think.

Too many businesses are complacent. They think that

just because they haven’t had a major system outage

before--or one that directly impacted customers--that

they’re safe. Or they think that the cost of deploying

chaos engineering is more than the cost of simply ﬁxing

any problems that arise. They’re wrong. Companies that

don’t address resiliency issues with chaos engineering

may end up hiring tens or even hundreds of systems

administrators just to maintain system uptime. That

adds up.

In today’s interconnected, internet-based world, no one

is safe from system failure. The only way for it to not

impact your customers, employees, partners, your

reputation, and your bottom line, is to proactively

address it upfront. Chaos engineering is the optimal way

to do this.

 17


Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

 18


	Chaos Engineering:
	Finding Failures Before
	They Become Outages


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	D

	iane Glazman will never ﬂy British Airways (BA)

	again. Glazman and her husband were among

	the 75,000 people aﬀected by the three-day BA

	system failure summer 2017. On their way from San

	Francisco to their son’s college graduation in Edinburgh,

	they were stranded in London—without their luggage—at

	the beginning of what was to be the three-week dream

	tour of Scotland. “Listening to the excuses was

	frustrating because nothing explained why BA was so

	unprepared for such a catastrophic failure,” says

	Glazman.

	BA lost an estimated $135 million due to that outage.

	The culprit turned out to be a faulty uninterruptable

	power supply device (UPS)—the corporate cousin to the

	$10 gadget you can ﬁnd in your corner Radio Shack. And

	that loss ﬁgure doesn’t count the forever-gone trust of

	customers like Glazman, who will look elsewhere for

	transatlantic ﬂights next time she travels.

	BA of course isn’t alone for having suﬀered ﬁnancially

	for having its systems down. There were also United

	Airlines (200 ﬂights delayed for 2.5 hours, thousands of

	passengers stranded or missed connections),

	Starbucks (couldn’t accept any payments but cash in

	aﬀected stores), Facebook (millions of users oﬄine and

	tens of millions of ads not served during the 2.5 hours of

	downtime), and WhatsApp (600 million users aﬀected, 5

	billion messages lost). And when Amazon S3 went down

	in March 2017, it collectively cost Amazon's customers

	$150 million.

	1


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Suffered major outages in 2017

	In fact, 2017 was a banner year for systems outages—

	and for the cost of them.

	The 2017 ITIC Cost of Downtime survey ﬁnds that 98%

	of organizations say a single hour of downtime costs

	more than $100,000. More than eight in 10 companies

	indicated that 60 minutes of downtime costs their

	business more $300,000. And a record one-third of

	enterprises report that one hour of downtime costs

	their ﬁrms $1 million to more than $5 million (see Figure

	1). The average cost of a single hour of unplanned

	downtime has risen by 25% to 30% since 2008 when

	ITIC ﬁrst began tracking these ﬁgures.

	2


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	$1M – $5M+

	$300,000+

	98%

	$100,000+

	% of Companies
	Reporting

	Cost of 1hr of
	Downtime

	33%

	80%

	$1M – $5M+

	$300,000+

	Figure 1: Cost of 60 minutes of downtime

	So how can organizations cut the risk of downtime? The

	answer: break your systems on purpose. Find out their

	weaknesses and ﬁx them before they break when least

	expected.

	Break your systems on
	purpose. Find out their
	weaknesses and fix
	them before they break
	when least expected.

	It’s called chaos engineering, and it’s being adopted by

	leading ﬁnancial institutions, internet companies, and

	manufacturing ﬁrms throughout the world. Such

	businesses understand that the trillions of dollars lost

	annually due to downtime is not acceptable to their

	customers, their stockholders, and their employees.

	3


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	A more complex,
	distributed world

	In the traditional corporate computer environment of 30

	years ago, software ran in a highly controlled

	environment that had few moving parts or variables.

	But in the new business world that depends on the

	internet, globally connected systems, a mix of cloud and

	bare-metal infrastructure, and more moving parts than

	you can count, your software depends on a lot of

	infrastructure and services that are outside your control

	to run smoothly.

	Already Moved

	Planning to Move

	86%

	In Cloud by End of
	2017

	4

	70%

	16%

	Already Moved
	to the Cloud

	Planning to Move


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	First and foremost, there’s the rapid shift to the cloud. A

	full 70% of companies have already moved at least one

	application to the cloud, according to IDC, with 16%

	more planning to do so by the end of 2017.

	Then there’s the rise of microservices. Most enterprises

	are developing their software today on a microservice

	architecture. Applications are built as small and

	independent but interconnected modular services. Each

	service runs a unique process meant to meet a

	particular business goal. For example, one microservice

	might track inventory levels of products. Another might

	handle serving personalized recommendations to

	customers.

	Then there are all the web servers, databases, load

	balancers, routers, and more that must work together

	to form a coherent whole. This is not easy.

	The good news is that as this modular, distributed

	infrastructure continues to evolve, businesses can do

	things with software that simply weren’t possible

	before. By shifting to what are also called ”loosely

	coupled services” that can be developed and released

	independently of each other, developed by similarly

	organized teams who are empowered to make changes,

	time to market for businesses can be radically reduced.

	But these new capabilities come at a price.

	5


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Today, businesses face a serious “complexity gap.” It’s

	diﬃcult for even the most technically astute individuals

	to understand how all the diﬀerent bits and pieces work

	together. The ability of your IT professionals to manage

	the ever-evolving sophistication of computer

	infrastructure falls short of what’s needed. With so

	many points of possible failure in your systems, this

	makes your business extremely vulnerable.

	This is happening at a time when people—

	businesspeople as well as consumers—are increasingly

	dependent on the internet and the services that

	corporations deliver that depend on it. The number of

	internet users worldwide in 2017 is 3.58 billion, up from

	3.39 billion in 2016.

	Even fewer people today can operate without a phone.

	In 2017 the number of mobile phones reached 4.77

	billion, and is expected to pass the ﬁve billion mark by

	2019. And all these users--both consumers and

	business--are demanding: Doubleclick found in 2016

	that 53% of mobile site visits are abandoned if a web

	page takes longer than three seconds to load.

	As a result, the stakes have never been higher for

	companies to maintain the uptime of their systems.

	6

	53% of mobile users
	abandon webpages that
	take more than three
	seconds to load.


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Chaos engineering,
	a primer

	Imagine that it’s 1796, and you have been selected to be

	injected with Edward Jenner’s brand-new smallpox

	vaccine. You are told that he is going to put the actual

	virus into your bloodstream. You are also told that you

	won’t get sick. Instead, it will make it impossible for you

	to get sick with this particular virus, because it will make

	your system stronger. You might have recoiled, thinking

	that the risk was too great. Yet you allow yourself to be

	vaccinated. And, of course, you are much better oﬀ than

	those who refuse the treatment.

	This is exactly what chaos engineering does. All

	computers have limits, and possible points of failure. By

	injecting a system with something that has the potential

	to disrupt it, you can identify where the system may be

	weak, and can take steps to make it more resilient.

	That covers the systems under your purview. Problems

	that occur with your cloud service providers are out of

	your control, and you can’t resolve outages by adding

	extra boxes or power supplies.

	By injecting a system
	with something that has
	the potential to disrupt
	it, you can identify
	where the system may
	be weak, and can take
	steps to make it more
	resilient.

	7


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Think of all your data living in the cloud, in Amazon S3

	or DynamoDB, and the hosted services you depend on,

	such as Salesforce or Workday. If they fail, you're at

	their mercy. Chaos engineering isn’t just essential for

	your applications, it’s essential for the companies

	behind those applications, which is why Netﬂix, Uber,

	and Amazon all have teams dedicated to chaos and

	reliability: they know they cannot aﬀord to let their

	customers down.

	Here’s where chaos engineering comes in: you know you

	have these potential points of failure and vulnerabilities.

	So why wait until there’s a problem?

	Imagine attempting to break your systems. On purpose.

	Before they fail on their own. Because that is what

	chaos engineering does. By triggering failures

	intentionally in a controlled way, you gain conﬁdence

	that your systems can deal with those failures before

	they occur in production.

	The goal of chaos engineering is to teach you something

	new about your systems’ vulnerabilities by performing

	experiments on them. You seek to identify hidden

	problems that could arise in production prior to them

	causing an outage. Only then will you be able to address

	systemic weaknesses and make your systems fault-

	tolerant.

	8


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Critical to chaos engineering is that it is treated as a

	scientiﬁc discipline. It uses precise engineering

	processes to work. Four steps in particular are followed.

	1.

	Form a hypothesis: Ask yourself, "What could go

	wrong?”

	2.

	Plan your experiment: Determine how you can

	recreate that problem in a safe way that won’t

	impact users (internal or external).

	3.

	Minimize the blast radius: Start with the

	smallest experiment that will teach you

	something.

	4.

	Run the experiment: Make sure to carefully

	observe the results.

	5.

	Celebrate the outcome: If things didn't work as

	they should, you found a bug! Success! If

	everything went as planned, increase the blast

	radius and start over at #1.

	6.

	Complete the mission: You’re done once you

	have run the experiment at full scale in

	production, and everything works as expected.

	9


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Some examples of what you might do to the hypothetic

	system when performing a chaos engineering

	experiment:

	•

	•

	•

	Reboot or halt the host operating system. This

	allows you to test things like how your system

	reacts when losing one or more cluster machines.

	Change the host’s system time. This can be used

	to test your system’s capability to adjust to

	daylight saving time and other time-related

	events.

	crashes.

	Simulate an attack that kills a process. This can

	be used to simulate application or dependency

	Naturally, you immediately address any potential

	problems that you uncover with chaos engineering.

	Indeed, the point of simulating potentially catastrophic
	events is to make them non-events that are irrelevant to

	our infrastructure’s ability to perform as required.

	Chaos engineering diﬀers from the regular testing that

	everyone does as a matter of course in several ways.

	Normal testing is done during build / compile activities,

	and doesn't test for diﬀerent conﬁgurations or

	behaviors or factors beyond your control. Additionally,

	routine testing doesn’t account for people--for training

	and preparing them for the failures they will be

	responsible for ﬁxing live, in the middle of the night.

	10

	The point of simulating
	potentially catastrophic
	events is to make them
	non-events that are
	irrelevant to our
	infrastructure’s ability to
	perform as required.


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Benefits of
	chaos engineering

	Companies like Amazon, Netﬂix, Salesforce, and Uber

	have been using chaos engineering for years to make

	their systems more reliable. For internet companies

	whose very existence depends on their ability to be “up”

	at all times, chaos engineering was a necessity. Now

	businesses in other industries—ﬁnancial services in

	particular—are starting to follow suit, and implement

	chaos engineering programs of their own.

	The beneﬁts of chaos engineering include the following:

	•

	•

	•

	•

	Help technology professionals see how systems

	behave in the face of failure, as their assumptions

	are often incomplete or inaccurate

	Validate that hypothetical defenses against

	failure will work when needed by exercising them

	at scale in production environments

	Provide the ability to revert systems back to their

	original states without impacting customers,

	employees, or consumers

	Save time and money spent responding to

	systems outages

	11


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	How do you know if a chaos engineering program is

	working? The top-level measure is overall system

	availability. For example, companies like Amazon or

	Netﬂix measure how available they are by whether their

	customers can use their product. They deﬁne availability

	in terms of “9s.” Four nines availability mean that a

	system is available 99.99% of the time. Five nines

	availability is better, meaning the system is available

	99.999% of the time. Six 9s are even better. Speciﬁc

	applications and sub-services are often measured using

	this metric as well.

	Translate these numbers into actual outage time, and

	you see why it matters (see Figure 2). You can see why

	six nines is today considered the gold standard of

	reliability.

	Another metric is the frequency and duration of

	outages. Yet another metric is measuring the

	operational burden of staﬀ of system outages. How

	often did you have to page an IT support professional?

	How frequently did they have to answer a call at 2am to

	ﬁreﬁght an issue?

	Chaos engineering is also good for disaster recovery

	(DR) eﬀorts. If you regularly break your systems using

	tight experimental controls, then when your systems go

	down unexpectedly, you’re in a much better position to

	recover quickly. You have your people trained, and you

	can respond more promptly. You can even put self-

	healing properties in place so you can continue to

	maintain service with minimal disruption.

	12


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	If we break a system in a controlled and careful manner

	and we make sure we can recover from it, then, when

	outages happen unexpectedly, we're in a better position

	because our human workers are being trained to

	respond systematically to failure, as opposed to being

	called at 2am.

	Systems can also self-heal, auto-recover so that they

	can operate in a degraded state and still maintain their

	service levels. The goal is resilience, rather than

	stability. Resilience meaning systems can gracefully

	handle inevitable failure without impacting users.

	Using chaos engineering for disaster recovery is also

	important for compliance reasons. Sarbanes Oxley II as

	well as industry- or geography-speciﬁc regulatory

	mandates require that you can recover quickly from a

	disaster. But eﬀorts to comply are often done at the

	theoretical level, as so-called “table-top” exercises, and

	are therefore incomplete.

	No. of 9s

	Amount of downtime

	Four nines (99.99% availability)

	52 minutes 36 seconds

	Five nines (99.999% availability)

	5 minutes 15 seconds

	Six nines (99.9999%
	availability)

	22 seconds

	Figure 2: Translating 9s of availability into minutes of downtime

	13


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Best practices in
	chaos engineering

	Many businesses are skeptical that deliberately trying to

	crash systems will make them stronger. And they are

	correct that there are risks. However, there are also

	best practices that mitigate those risks.

	Minimize the “blast radius.” Start with the smallest

	chaos experiment you can perform that will teach you

	something about your system. See what happens. Then

	increase the scope as you learn and as your conﬁdence

	grows.

	Don’t be a chaos monkey. Chaos Monkey was Netﬂix's

	famous—or infamous—tool that randomly rebooted

	servers. Unfortunately, today many people believe that

	chaos engineering means randomly breaking things. The

	reason this is not an optimal approach is that “random”

	is diﬃcult to measure. You are not approaching the

	problem using experimental methods. The idea behind

	chaos engineering is to perform thoughtful, planned,

	and scientiﬁc experiments instead of simply random

	failure testing.

	14


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Build (using open source)

	Buy

	Limited set of tools available

	Growing availability of solutions

	Everyone who does it is reinventing the
	wheel

	Piggyback on earlier successes

	Costly and time-consuming to train
	internal team

	Avoid engineer and administrator burnout

	Unsupported open source releases open
	up security vulnerabilities

	Security embedded in solution

	No “kill switch” or safety valve to stop out-
	of-control experiments from taking down
	production systems

	Kill switch to avoid impacting users

	Figure 3: The debate between building and buying chaos engineering tools

	Start in a staging environment. Yes, you must

	eventually test in your production system, but it makes

	sense to start in a staging or development environment

	and work your way up. Start with a single host,

	container, or microservice in your test environment.

	Then try to crash several of them. Once you've hit 100%

	in your test environment, you reset to the smallest bit

	possible in production, and take it from there.

	15


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Avoid the “drift into failure.” This concept, invented by

	ﬂight accident investigation expert Sydney Dekker,

	refers to the fact that tension always exists in systems

	between eﬃciency and safety. Since businesses need to

	be mindful of costs, they tend to operate on the edge of

	safety. So once you understand a particular kind of

	failure and you've tested it, you want to automate the

	testing of it in a continuous deployment pipeline so you

	maintain that competence.

	Always have a kill switch. This is akin to an “undo”

	button or safety valve. Make sure you have a way to stop

	all chaos engineering experiments immediately, on the

	spot, and return all systems to their normal state. If

	your chaos engineering causes a high-severity incident

	(SEV), then track it carefully and do a full post-mortem

	analysis of what went wrong.

	Fix known problems ﬁrst. Never conduct a chaos

	experiment in production if you already know that it will
	cause severe damage, possibly aﬀecting customers —

	and your reputation. Always try to ﬁx known problems

	ﬁrst.

	16


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	Conclusion:
	eschew complacency

	Systems will break. And as systems continue to grow in

	complexity, they will break more often. If you're not

	prepared for that, they'll break in unexpected ways at

	unexpected times and bring your software or service

	down. You’ll have unhappy customers and unhappy

	employees, and the costs are probably higher than you

	think.

	Too many businesses are complacent. They think that

	just because they haven’t had a major system outage

	before--or one that directly impacted customers--that

	they’re safe. Or they think that the cost of deploying

	chaos engineering is more than the cost of simply ﬁxing

	any problems that arise. They’re wrong. Companies that

	don’t address resiliency issues with chaos engineering

	may end up hiring tens or even hundreds of systems

	administrators just to maintain system uptime. That

	adds up.

	In today’s interconnected, internet-based world, no one

	is safe from system failure. The only way for it to not

	impact your customers, employees, partners, your

	reputation, and your bottom line, is to proactively

	address it upfront. Chaos engineering is the optimal way

	to do this.

	17


	Chaos Engineering: Breaking Your Systems for Fun and Proﬁt

	18