Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!! with Opher Dubrovsky and Ido Nadler | Kafka Summit London 2022

Image by kimura2 from Pixabay
Scaling your Kafka
streaming pipeline can be a pain…
But it doesn’t have to be !!

Data is Like Cheese
If you wait too long, it SPOILS !

Expiration in a Data Pipeline
Consumer
New
messages
expiration
Lost data:
Expiration TTL (time to live): 24 hours

Slow Recovery == Ticking Timebomb

This is a true story...
The events described in
this talk, took place at
Nielsen Marketing
Cloud in 2020

I Get an Emergency Call from My Boss
Check Slack !!

September 24,
2020 -
Consequences
1 day of downtime
Data loss
5 working days to resolve
Unpleasant customer calls
$3,000 damages

o Kafka’s strengths and weaknesses
o Scaling your pipe up/down
o What could go wrong L
o Insights and tactics
In this talk.…

Who Are We ?
Opher Dubrovsky
Director Big Data Engineering
Ido Nadler
Big Data Team Lead
Data Pipelines, Kafka, Serverless, Spark

Build marketing audiences
and device graphs
Data is used for
› Running campaigns
› Business decisions
Marketing Cloud

Audiences and IDs
1. 639A714C-14AE-4571-AA1C-A1A852AC604A
2. 953FB9CC-E520-471B-8BBF-C652717FFE9C
3. 384A7C4D-4E5B-4C48-A751-226597E71B61
4. C3A38789-8DF7-4400-AF45-9080BB3DC2C3
5. 146137AB-60F4-41E1-A3C6-CA8A54D9BFA0
6. 2E7B1B28-DF19-4399-8692-B438AA7A5C43
7. 53397B0D-339F-4CF1-9B0D-D3711CF770A7
8. B729F5C7-F89D-4FE0-BC90-A68E61FFBC14
9. 32E89809-3A66-4F32-948C-CEF13CF58896
10. A5632598-22DD-4E03-9EC0-B384CA04FCD6
11. 88F156E8-35B9-48B2-95C4-2FC1D57BA55B
12. 99BBBF5B-EB00-4570-B474-17FAE14F2B28
13. 9BDCEE3C-09D0-4F79-A0E7-B6CDD067111D
14. DA43B1D0-F801-4D55-A930-5BE3C5402FF8
15. 2B59828E-566E-4914-BFEA-F10905D9DFAB
16. 32DE0E38-BA79-48C0-80E7-EE141F3C238C
Pizza Running Cycling
1. 639A714C-14AE-4571-AA1C-A1A852AC604A
2. 79F5D0FD-ED5E-4865-8C6C-96679D1D3E2E
3. 55BBD6A6-8BF2-4663-A596-FF81C506E9AC
4. FF3019E5-79D5-45AB-B79B-42D4CED37B19
5. C8375767-8188-483C-A112-3A4B0299D9BF
6. 4CE55A6E-B1E1-4038-B8A0-D80AF43A048D
7. F9C7C04F-F847-4DB7-803F-A1E44C14951D
1. 639A714C-14AE-4571-AA1C-A1A852AC604A
2. 694EB0DF-0F00-4770-813C-01E763F9B4A3
3. FC4E2648-85E6-46B9-8FDF-19C6F00909B3
4. 11E03256-D51D-48D7-953E-8D13FBC2EB4D
5. 06748B57-E716-436E-B57F-A6B6CDC40BB4
6. 014C0E92-0952-4918-8BB8-5E80AB58E6E9
7. 9B451E97-DC97-4222-8BEC-2B0CEFE993BB
8. 799CCDBD-F051-4598-AFBE-B09EF684DD82
9. 689A1F9D-B6DE-4967-9572-50439A5C6AF0
10. EB73033C-050E-4197-AC11-0A32945AEDFA
11. 22EB7822-CFDA-435A-AAF8-7FA73C87B7AC
12. E63D4FC7-B498-43D6-955C-3693197F0CCF
9m 12m
24m
24m
Pizza
12m
Cycling

About Nielsen Marketing Cloud
Cloud native
~6,000 nodes/day
Ingress ~60 TB/day
5 PetaByte
~6 million/day

Our Kafka Scale
Events
• 25 billion / day
• 300 K / sec
Size
• 200 TB / day
• 2.5 GB / sec
Peak 5.3 GB / sec
Peak 5 million / sec

Our Kafka in a Nutshell
Data Split over
• 7 Clusters
• 40 Topics
• 9000 Partitions
• 60 Brokers (instances)
Total Cost
• Monthly $45k
• Yearly $540k

Kafka in a nutshell
Producers Consumers
Publish-Subscribe messaging system

Topic A
Topic B
Data Structure – Topics and Partitions

Why we Love Kafka ?
Optimized for high-throughput
Data retention
Multiple consumers
Acts as a buffer between different systems
Allows reprocessing

Our Old System
Data Lake
AWS S3
200 TB/Day
Reading Data In Stream – 6min Windows

Reality
Check
Photo by Olia Nayda on Unsplash
Just because nobody complains
doesn't mean all parachutes are perfect…

Issue 1 – Glass Ceiling
Topic Consumers
Min Consumers = 1
Topic Consumers
Max Consumers = 3
Consuming Throughput is Limited by Partitions !
Max Throughput = (# Partitions) X (Consumer Throughput)

300
400
500
600
700
800
900
1,000
1,100
1,200
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
GB / Hour
Issue 2 – Wasted Resources
Traffic Moodiness

Issue 3 – Skew
Processing time
Partition Data
1 hour 2 hours 3 hours
Consumers
Processing
complete
100%
25%
50%
15%
50%

Issue 4 - Slow Recovery
System Restored
Downtime
start
Recovery
Process
Retention
Timeout
Bomb Exploded. Data Lost !!!
Recovery
completed
Lost
Data

Recovery Time
~7 hour recovery
~4 hour outage

Why is Recovery Slow ?
Topic
Consumers
During steady state

After an Outage
Topic
Consumers
Consumers are limited
After Outage
Slow Recovery !!!

Common sense solutions:
Scaling the Pipeline
No Choice But to Add Partitions !
○ Add more consumer machines à requires partitions
○ Move to larger machines

Adding More Partitions
Max Throughput = (# Partitions) x (Consumer Throughput)
BUT
1. Implications for Kafka cluster performance
2. Small data files
3. Hard to scale down partitions
More Partitions More Consumers

Assume we need 2
days data retention
The Cost of Slow Recovery à Storage
Faster Burst Processing à Save $$$
Reprocessing 2 days
takes 3 days
We will need 5 days of data retention.
2.5X the cluster cost !!
$810,000 /y

300
400
500
600
700
800
900
1,000
1,100
1,200
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
0
:
0
0
2
:
0
0
4
:
0
0
6
:
0
0
8
:
0
0
1
0
:
0
0
1
2
:
0
0
1
4
:
0
0
1
6
:
0
0
1
8
:
0
0
2
0
:
0
0
2
2
:
0
0
GB / Hour
Consumer Low Hours
Cluster is
Idle
Wasted $$$

Data Bursts (& reprocessing)
Cluster is
underpowered
Cluster
Capacity
Long Queues !

Efficiency
Cluster
Capacity
Wasted
Processing
Efficiency is 30% ONLY !!

Rearchitecting
Rearchitecting the stream to batch……

1. Auto scaling
2. Quick burst processing
3. Cost savings
Project Goals

Topic
Will Not Work !
Common Sense Option
Consumers

Looking for a solution
Break the 1-1 relationship !
Topic Consumers
Scale

The Main Concept
Scaling Up / Down

Manage & Control Consumers’ Offsets
Offsets[1000, 10000]
Consumers
Partition

Scale Up
Offsets[1000, 5500]
Consumers
Partition
Offsets[5501, 10000]
Scale Up

Scale Up
Offsets[1000, 4000]
Consumers
Partition
Scale Up
Offsets[4001, 7000]
Offsets[7001, 10000]

Scale Down – Next Batch
Consumers
Partition
Offsets[1000, 4000]
Offsets[4001, 7000]
Offsets[7001, 10000]
new
data Offsets[10001, 12000]
Scale Down

1st Solution
Scale out
using isolated clusters

Using Discreet Time Slots
time
4:00 5:00 6:00 7:00
3:00 8:00 9:00
Task
Task
Task

● Consumer independence
○ No need to syncronize offsets
● Cost savings
○ Work at max throuput
○ Terminate when done
Benefits

Task (& Cluster) Independence
Processing
Hour
4:00
5:00
6:00
8:00
3:00
7:00
9:00
1 hour 2 hours 3 hours
Some tasks are short
No more paying for
idle time !
Some tasks >1 hour
Next task

Pay as You Go
time
Hour
4:00
5:00
6:00
8:00
3:00
7:00
9:00
3:00 4:00 5:00
Savings
6:00 7:00

Cluster Efficiency and Cost
Warm-up
~0% load
Processing
~90% load
Processing time
Cluster
termination
Total efficiency ~75%
Improve warm-up time à higher efficiency !!

Comparing to Previous System
Cluster
Capacity
Wasted
Processing
Efficiency was 30% ONLY !!
New System is 60% Cheaper !

Single Flow View
Airflow AWS S3
{ period: 05:00 - 06:00 }
Terminates on
completion
Leverage EMR
autoscaling to terminate
idle instances

Multi Flow View
Airflow AWS S3
{ period: 05:00 - 06:00 }
{ period: 06:00 - 07:00 }
{ period: 07:00 - 08:00 }

Single Process – Many Pipelines !
Airflow DAGs

From Micro-Batch to Batch
1 hour
1 hour
1 hour 1 hour 1 hour

60% Drop in
Costs
Cost Drop
BEFORE AFTER

Pay as You Go
Bytes per Hour
$ per Hour
$400k /year à $160k /year

Outage Handling
Recovery start
7 hours Recovery
completed
Old System
New System
Recovery start Recovery
completed
40 min
5 Concurrent
isolated workers
~85% Faster Recovery

Another Burst Example
~28 hour outage
~1.6 hours
recovery

Cluster Utilization
Clusters are Highly Utilized !

2nd Solution
Scale up
within the cluster

Can We Use Multiple Consumers?
Consumer
Partition
Consumer group
NO !

Scale Up – Hand Out the Offsets
Offsets[1000, 4000]
Consumers
Partition
Scale Up
Offsets[4001, 7000]
Offsets[7001, 10000]

Process Time Single Partition
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8
Seconds
# of Consumers
Processing time / Consumers
Doubling of cores ~~ ½ Process time

Logarithmic Scale
y = -0.93x + 18.42
15.0
15.5
16.0
16.5
17.0
17.5
18.0
18.5
19.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Log
2
(processing
time)
Log2 (# of consumers)
Log(processing time)
𝐏𝐫𝐨𝐜𝐞𝐬𝐬 𝐭𝐢𝐦𝐞 ≈
𝐤
# 𝐜𝐨𝐫𝐞𝐬

Goal - Fully Scalable Consumers
time
4:00 5:00 6:00 7:00
3:00 8:00 9:00

Recap - What we’ve talked about
THE COST OF NOT SCALING UP / DOWN
KAFKA PIPELINE SCALING DIFFICULTIES
OPTIMIZING A KAFKA PIPELINE
ARCHITECTURE INSIGHTS

Recap - What we’ve talked about
o Streaming is expensive
o Fixed clusters are never perfect
o Off hours - $$$
o High loads - Slow
o Recovery is critical
o Isolation and parallelism allows you to
o Easily scale
o Save on costs
o Deal with loads

Take outs
Prepare for the
worse
Consider your
stream….
Split your work
into isolated
tasks
Remove
dependencies

/opher-dubrovsky
@jumpja
/ido-nadler
@IdoNadler
You Can Reach Us At:
Opher Dubrovsky
Ido Nadler

Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!! with Opher Dubrovsky and Ido Nadler | Kafka Summit London 2022

More Related Content

Similar to Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!! with Opher Dubrovsky and Ido Nadler | Kafka Summit London 2022

Similar to Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!! with Opher Dubrovsky and Ido Nadler | Kafka Summit London 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to be!! with Opher Dubrovsky and Ido Nadler | Kafka Summit London 2022