Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Netflix Development Patterns for
Rapid Iteration, Scale, Performance, & Availability
Neil Hunt, Netflix
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Are You Designing Systems That Are:
•
•
•
•

Web-scale
Global
Highly-available
Consumer-facing

• Cloud Native

Cloud Native
•
•
•
•
•

Service oriented architecture
Redundancy
Statelessness
NoSQL
Eventual consistency

Assumptions
Everything is Broken

Hardware will fail

Scale

Slowly Changing
Large Scale

Rapid Change
Large Scale

Telcos Web-Scale
Enterprise IT Startups
Slowly Changing
Small Scale

Rapid Change
Small Scale

Everything works

Software will fail
Speed

Netflix Cloud Goals:
Availability, Scale, Performance

Performance
• Reduce session start by 1s
Save 1 human lifetime per day!
Win more moments of truth
• Suggest choices 1% better
500k hours/day additional value delivered

Scale
•
•
•
•
•

50% y/y traffic growth
50 Countries, 3 continents
Tens of thousands of instances at peak
4 AWS regions, 12 datacenters
~$.001 per start

Availability
• Aspire to 4 x nines (99.99% of starts successful)
• Per Quarter:
– Downtime: < 3 mins (peak time)
– Successful starts: 9.999B
– Failures: 1M
 frustration, calls, lost business

Availabilities Compound
N Service
Dependencies
2

…

N dependencies

99.99%

.99

1000

99.99%

.999

100

99.99%

.9998

10

99.99N%

Availability

.9

Availabilities Compound
To achieve 99.99% availability
with 1000 components
requires:
or
99.9999% availability
for each dependency

Isolation for
independence

Component failure leads
to system failure

Component failure leads
to degradation rather than
system failure

Availability, Scale, Performance
Are Not Enough!

Rapid Iteration – Rate of Change
• Running tests
• Rolling out tests
– Engineering the winning test experience for scale

• Adding features
• Scaling up
• Removing features, simplifying, minimizing

Testing
• Up to 1,000 changes per day!

Rate of Change
• Change leads to bugs
–
–
–
–

New features
New configurations
New types of inputs
Scaling up

• Availability is in tension with rate of change

Availability / Rate of Change Tradeoff
Availability

99.999%

99.99%
Frontier of
availability/change
99.9%

99%
1

10

100

Rate of Change

1000

Shifting the Curve…
Availability

99.999%

99.99%

99.9%

99%
1

10

100

Rate of Change

1000

Shifting the Curve
• Must break the chained dependencies
that compound in cascading system failure
• Subsystem isolation:
– Failure in one component
should never result in cascading system failure

Isolating Subsystems
Redundant systems with timeout & failover
• Failure of instance
• Failure of network
• Latency monkey to
test

Dependent
System
Timeout

Dependence

Redundant systems with timeout & failover
• Failure of instance
• Failure of network

Higher Tier
System
Longer
timeout
Dependent
System
Short
timeout

• Latency monkey to
test
Dependence

Timeout with fallback default response
• Network failure
• Software bug

{ status=mem,
plan=4,
device=true }

Dependent
System
Timeout &
Default response

Dependence

Canary Push
• Network failure
• Software bug

Dependent
System
Timeout

Canary
instance
new code

Dependence

Red/Black deployment
• Software bugs

Dependent
System
Fail back to
old code

Bad code
pushed

Dependence
V2.3

Dependence
V2.2

Standby Blue system
• Independent
implementation
• Simplified logic

Dependent
System
Fail to static
version

Static reference
implementation
Dependence
V2.3


Load
Balancer

Zone isolation
• Infrastructure failure
(e.g. power outage)

Zone A

Zone B

Dependent
System

Dependent
System

• Chaos Gorilla
Dependence

Dependence

Region isolation
DNS

• Infrastructure
software bugs
(e.g. load
balancer fail)
• Chaos Kong

Region E

Region W

Load
Balancer

Load
Balancer

Zone A

Zone B

Zone A

Zone B

Dependen
t System

Dependen
t System

Dependen
t System

Dependen
t System

Dependence

Dependence

Dependence

Dependence

Dependency Mode

Isolating Technique

Instance Failure
Network failure

Redundant systems with failover and timeout
Timeout with default response

Network failure
Software bug

Canary push
Red-black deployment
Blue systems

Infrastructure failure

Zone isolation

Cross-zone software bugs

Region isolation

Trying Harder Won’t Cut It
• Trying harder gets a linear return on an exponential
problem
• Need to be great at execution
AND
Have the right architecture
• What architectural features are you using to ensure
availability, scale, performance, & rapid rate of change?

Please give us your feedback on this
presentation

DMG206
As a thank you, we will select prize
winners daily for completed surveys!

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Similar to Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013