2018 02 20-jeg_index

Building an Enterprise/Cloud
Analytics Platform with Jupyter
Notebooks and Apache Spark
Fred Reiss
Chief Architect, IBM Spark Technology Center

2
Hi!
Fred Reiss
• 2014-present: Chief Architect,
IBM Spark Technology Center.
• 2006-2014: Worked for IBM
Research.
• 2006: Ph.D. from U.C.
Berkeley.

3
The Jupyter Project
• Open Source project that builds software to enable
interactive notebooks for data science
– Started in 2014
– Grew out of the IPython project

4
What is IPython?
https://upload.wikimedia.org/wikipedia/commons/4/47/IPython-shell.png
By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Interactive
console for
Python
Can open a
window to
display
graphics

5
IPython Notebooks
https://upload.wikimedia.org/wikipedia/commons/a/af/IPython-notebook.png
By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Text and
graphics in the
same browser
window

6
Jupyter Notebooks Today
https://developer.ibm.com/code/patterns/create-visualizations-to-understand-food-insecurity/
User
Input
System
Output
Tables
Graphs
Text output
Cells

7
Jupyter Notebooks
• Jupyter notebooks are widely used by data scientists, social
scientists, physical scientists, engineers, and others
• Useful for many tasks
– Analyzing data
– Developing and debugging software
– Running experiments
– Keeping track of experimental results
– Presenting results
• Jupyter is a central part of the IBM Data Science Experience
(http://datascience.ibm.com)

8
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis (problems that don’t fit in
a laptop)
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control

9
Isn’t this just shipping strings around?
JavaScript
“1+1”
Server
“1+1”
Python
Process
“1+1”
“2”“2”“2”

10
Isn’t this just shipping strings around?
JavaScript
“1+1”
FancyNewSystem
“1+1”
Python
Process
“1+1”
“2”“2”“2”
Security
Multitenancy
Authentication
Spark
Kubernetes

11
The Five Stages of Enterprise Jupyter Deployment

12
1. Denial

13
Jupyter does more than just pass strings around.
• Quite a bit more!

14
Asynchronous Operations
• Queue up multiple cells for
execution
– …in arbitrary order
• Stream output while a cell is
running
• Interrupt any operation
Fifteenth cell
that executed in
this session

15
Jupyter’s Display System: Much More than Text
https://nbviewer.jupyter.org/github/ipython/ipython/bl
ob/master/examples/IPython%20Kernel/Custom%2
0Display%20Logic.ipynb

17
Magics
• Jupyter’s
standard
Python kernel
has over 90
built-in magic
commands
http://ipython.readthedocs.io/en/stable/interactive/magics.html

18
Extensions
• Many additional
extensions in the
iPython project’s
Github repository
– https://github.com/ip
ython-
contrib/jupyter_contri
b_nbextensions

19
PixieDust
https://developer.ibm.com/code/patterns/analyze-san-francisco-traffic-data-with-ibm-pixiedust-and-data-science-experience/

20
Brunel
https://developer.ibm.com/open/videos/brunel-visualization-update-tech-talk/

21
The Actual Architecture of Jupyter Notebooks

Notebook Server Process
22
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem

23
1. Denial

24
1. Denial
2. Anger

25
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem

26
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem

27
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem

28
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Five ZeroMQ
message queues over
unencrypted TCP
sockets…
…per kernel

29
Third-Party Kernels
• The IPython kernel is
the most common…
• …but there is a long tail
of other Jupyter kernels
– 103 kernels currently
listed on the Jupyter
project’s wiki

30
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To share
notebooks among
users, need to
share notebook
server

31
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To use Apache
Spark™ on YARN,
need to be inside
the YARN cluster’s
network.

32
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
Bringing these properties to the Jupyter stack is hard!

33
1. Denial
2. Anger

34
1. Denial
2. Anger
3. Bargaining

35
Bargaining
• Meeting all the enterprise requirements is expensive
• Compromise to bring down the cost

36
Compromise #1: Gigantic Server
• Find the biggest machine or container you can get
• Run the entire Jupyter stack on that one machine
• Issues:
– Machine needs to be sized for the maximum aggregate
memory of all active users’ active kernels
• Hard upper limit of 256GB-1TB in most organizations
• Very problematic if you have many users and big data
– Need to authenticate all these users to the same machine
and notebook server

37
Compromise #2: Notebook Server Per User
• Proxy server manages a pool of containers, one per active user
• Each container contains an entire Jupyter notebook stack
• JupyterHub project provides a pre-built implementation of this
approach
• Issues:
– Container needs to be big enough for all the user’s kernels
• What size container to allocate when the user logs in?
• Does a big enough container even exist?
– Disables collaboration features
– Many more moving parts  More failure modes

38
Compromise #3: Replace the Kernel
iPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat

KernelProxy
Proxy
39
Compromise #3: Replace the Kernel
• Replace the IPython kernel with a proxy
• Put something enterprise-friendly on
the other side of the proxy
• Apache Livy implements this approach
– https://github.com/jupyter-
incubator/sparkmagic
• Issues:
– Breaks Jupyter’s magics and extensions
– Breaks data visualization libraries
– Breaks third-party kernels
– Less control over code execution
Shell
IOPub
stdin
control
heartbeat
RESTfulwebservice

40
1. Denial
2. Anger
3. Bargaining

41
1. Denial
2. Anger
3. Bargaining
4. Depression

42
1. Denial
2. Anger
3. Bargaining
4. Depression
5. Jupyter Enterprise Gateway

43
The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy,
scalability, security, etc.)
• Had reached the “Bargaining” stage
– Mix of compromises 1, 2, and 3

YARN Cluster
Initial
Prototype
44
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Notebook Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat

YARN Cluster
Initial
Prototype
45
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Notebook Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
Issue #2: All
Spark jobs
run as same
user ID
Issue #1: All kernels
and Spark drivers
run on a single node

Issue #1: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
46

Jupyter Enterprise Gateway: Initial Goals
• Optimized Resource Allocation
– Run Spark in YARN Cluster Mode to better utilize cluster resources.
– Pluggable architecture for additional Resource Managers
• Multiuser support with user impersonation
– Enhance security and sandboxing by enabling user impersonation
when running kernels (using Kerberos).
– Individual HDFS home folder for each notebook user.
– Use the same user ID for notebook and batch jobs.
• Enhanced Security
– Secure socket communications
– Any network communication should be encrypted
47

YARN Cluster
Jupyter Enterprise Gateway
48
Security
Layer
YARN
Workers
Jupyter EnterpriseGateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.

Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG
49

Jupyter Enterprise Gateway: Open Source
50
• Released through the
Jupyter Incubator
– BSD License
incubator/enterprise_gatew
ay
– Current release: 0.7.0

Jupyter Enterprise Gateway: Supported Platforms
• Python/Spark 2.x using IPython kernel
– With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
– With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
51

Jupyter Enterprise Gateway – Roadmap
• Add support for other resource managers
– Kubernetes support
• Kernel Configuration Profile
– Enable client to request different resource configuration for kernels (e.g. small,
medium, large)
– Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
– Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• User Environments
• High Availability
52

• Jupyter Enterprise Gateway at IBM Code
– https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
• Jupyter Enterprise Gateway source code at GitHub
– https://github.com/jupyter-incubator/enterprise_gateway
• Docker images
incubator/enterprise_gateway/tree/master/etc/docker
• Jupyter Enterprise Gateway 0.7 release
– https://github.com/jupyter-incubator/enterprise_gateway/releases/tag/v0.7.0
• Jupyter Enterprise Gateway Documentation
– http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
53
Free
IBM Data Science
trial
https://ibm.biz/BdZceR

54
Thank you!
And special thanks to the Jupyter
Enterprise Gateway team: Luciano Resende,
Kevin Bates, Kun Liu, Christian Kadner,
Sanjay Saxena, Alan Chin, Sherry Guo, Alex
Bozarth, Zee Chen

56
Building your own test
environment with

Jupyter Enterprise Gateway: Deployment
57
Management
Node
Powered by
Ambari
EG
Compute Engine based on Apache Spark

Jupyter Enterprise Gateway: Deployment
• Ansible deployment scripts
– https://github.com/lresende/spark-cluster-install
• One click deployment of the Spark Cluster
– Configure your host inventory (see example on git repository)
– Run the ”setup-ambari.yml” playbook
• $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko
• One click deployment of the Jupyter Enterprise Engine
– Run the ”setup-enterprise-gateway.yml” playbook
• $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c
paramiko
58

Jupyter Enterprise Gateway - Deployment
• Docker images
– yarn-spark: Basic one node Spark on Yarn configuration
– enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the
yarn-spark image
– nb2kg: Minimal Jupyter Notebook client configured with hooks to access the
Enterprise Gateway
– https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker
• Building the latest docker images
– git checkout https://github.com/jupyter-incubator/enterprise_gateway
– make docker-clean docker-images
– Note: Make also have individual targets to clean and build individual images
(type make for help)
59

Jupyter Enterprise Gateway - Deployment
• Connecting to a Spark Cluster using a docker image
docker run -t --rm
-e KG_URL='http://<Enterprise Gateway IP>:8888'
-p 8888:8888
-e VALIDATE_KG_CERT='no'
-e LOG_LEVEL=DEBUG
-e KG_REQUEST_TIMEOUT=40
-e KG_CONNECT_TIMEOUT=40
-v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks
-w /tmp/notebooks
elyra/nb2kg:dev
60

2018 02 20-jeg_index

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to 2018 02 20-jeg_index

Similar to 2018 02 20-jeg_index (20)

More from Chester Chen

More from Chester Chen (20)

Recently uploaded

Recently uploaded (20)

2018 02 20-jeg_index

Editor's Notes