Embracing Failure - Fault Injection and Service Resilience at Netflix

Embracing Failure
Fault Injection and Service Resilience at Netflix
Josh Evans – Director of Operations Engineering
Naresh Gopalani – Software Engineer and Architect
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Netflix Ecosystem
• ~50 million members, ~50 countries
• > 1 billion hours per month
• > 1000 device types
• 3 AWS Regions, hundreds of services
• Hundreds of thousands of requests/second
• CDN serves petabytes of data at terabits/second
Static
Content
Akamai
Netflix CDN
Service
Partners
AWS/Netfli
x
Control
Plane
Internet

Availability means that members can
● sign up
● activate a device
● browse
● watch

Failures can happen any time
• Disks fail
• Power outages
• Natural disasters
• Software bugs
• Human error

We design for failure
• Exception handling
• Fault tolerance and isolation
• Fall-backs and degraded experiences
• Auto-scaling clusters
• Redundancy

Testing for failure is hard
• Web-scale traffic
• Massive, changing data sets
• Complex interactions and request patterns
• Asynchronous, concurrent requests
• Complete and partial failure modes
Constant innovation and change

What if we regularly inject failures
into our systems under controlled
circumstances?

Embracing Failure - Fault Injection and Service Resilience at Netflix

Blast Radius
• Unit of isolation
• Scope of an outage
• Scope a chaos exercise
Instance
Zone
Region
Global

An Instance Fails
Edge Cluster
Cluster A
Cluster B
Cluster C
Cluster D

Chaos Monkey
• Monkey loose in your DC
• Run during business hours
• What we learned
– Auto-replacement works
– State is problematic

A State of Xen - Chaos Monkey & Cassandra
Out of our 2700+ Cassandra nodes
• 218 rebooted
• 22 did not reboot successfully
• Automation replaced failed nodes
• 0 downtime due to reboot

An Availability Zone Fails
EU-West
US-West US-East
AZ1
AZ2

Chaos Gorilla
Simulate an availability zone
outage
• 3-zone configuration
• Eliminate one zone
• Ensure that others can
handle the load and
nothing breaks
Chaos Gorilla

Challenges
• Rapidly shifting traffic
– LBs must expire connections quickly
– Lingering connections to caches must be addressed
• Service configuration
– Not all clusters auto-scaled or pinned
– Services not configured for cross-zone calls
– Mismatched timeouts – fallbacks prevented fail-over

A Region Fails
US-West US-East EU-West

Regional Load Balancers
Zuul – Traffic Shaping/Routing
AZ1 AZ2 AZ3
Data Data Data
Geo-located
Chaos Kong
Chaos Kong
Regional Load Balancers
Zuul – Traffic Shaping/Routing
AZ1 AZ2 AZ3
Data Data Data
Customer
Device

Challenges
● Rapidly shifting traffic
○ Auto-scaling configuration
○ Static configuration/pinning
○ Instance start time
○ Cache fill time

Challenges
● Service Configuration
○ Timeout configurations
○ Fallbacks fail or don’t provide the
desired experience
● No minimal (critical) stack
○ Any service may be critical!

A Service Fails
Service
Zone
Region
Global

Services Slow Down and Fail
Simulate latent/failed service
calls
• Inject arbitrary latency and errors at
the service level
• Observe for effects
Latency Monkey

Latency Monkey
Device ELB Zuul Edge Service B
Service C
Internet
Service A

Challenges
• Startup resiliency is an issue
• Services owners don’t know all dependencies
• Fallbacks can fail too
• Second order effects not easily tested
• Dependencies are in constant flux
• Latency Monkey tests function and scale
– Not a staged approach
– Lots of opt-outs

Distributed Systems Fail
● Complex interactions at scale
● Variability across services
● Byzantine failures
● Combinatorial complexity

Any service can cause cascading failures
ELB

Fault Injection Testing (FIT)
Device Service B
Service C
Internet Zuul
Edge
Device or Account Override
Service A
Request-level simulations
ELB

Failure Injection Points
IPC Cassandra Client Memcached Client Service Container Fault Tolerance

FIT Details
● Common Simulation Syntax
● Single Simulation Interface
● Transported via Http Request header

Integrating Failure
request
[sendRequestHeader] >>fit.failure: 1|fit.Serializer|
2|[[{"name”:”failSocial,
Filter
Service
Ribbon
Filter
Service
Ribbon
ServerRcv
ClientSend
ServerRcv
Service A
response
Service B
”whitelist":false,
"injectionPoints”:
[“SocialService”]},{}
]],
{"Id":
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]

Failure Scenarios
● Set of injection points to fail
● Defined based on
○ Past outages
○ Specific dependency interactions
○ Whitelist of a set of critical services
○ Dynamic tracing of dependencies

FIT Insights : Salp
● Distributed tracing inspired by Dapper paper
● Provides insight into dependencies
● Helps define & visualize scenarios

Dialing Up Failure
Functional Validation
● Isolated synthetic transactions
○ Set of devices
Validation at Scale
● Dial up customer traffic - % based
● Simulation of full service failure
Chaos!

Continuous Validation
Critical
Services
Non-critical
Services
Synthetic
Transactions

Take-aways
• Don’t wait for random failures
– Cause failure to validate resiliency
– Remove uncertainty by forcing failures regularly
– Better to fail at 2pm than 2am
• Test design assumptions by stressing them
Embrace Failure

The Simian Army is part of
the Netflix open source
cloud platform
http://netflix.github.com

Netflix talks at re:Invent
Talk Time Title
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix
PFC-306 Wednesday, 3:30pm Performance Tuning EC2
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source
Tools can accelerate and scale your services
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The
Pros and Cons of Micro Services Architectures
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems
APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud

Please give us your feedback on this
presentation
Josh Evans
jevans@netflix.com
@josh_evans_nflx
Naresh Gopalani
ngopalani@netflix.com
Join the conversation on Twitter with #reinvent
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Embracing Failure - Fault Injection and Service Resilience at Netflix

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Embracing Failure - Fault Injection and Service Resilience at Netflix

Similar to Embracing Failure - Fault Injection and Service Resilience at Netflix (20)

Recently uploaded

Recently uploaded (20)

Embracing Failure - Fault Injection and Service Resilience at Netflix