The document discusses building an enterprise/cloud analytics platform using Jupyter notebooks and Apache Spark. It describes the challenges of deploying Jupyter notebooks at an enterprise scale, including collaboration, large-scale data analysis, security, and authentication. It outlines various approaches taken to address these challenges, such as running the entire Jupyter stack on a single large machine or giving each user their own container. However, these approaches have limitations. The document then introduces the Jupyter Enterprise Gateway as a solution developed by IBM to optimize resource allocation, support multi-users securely through impersonation, and enhance security overall when deploying Jupyter at an enterprise scale.
Flaky tests and bugs in Apache software (e.g. Hadoop)Akihiro Suda
The document discusses flaky tests in Apache software projects. It notes that about half of development time in Hadoop projects is spent debugging test failures, and many builds fail due to tests. Flaky tests are a barrier to continuous integration and contribution. Common causes of flaky tests include asynchronous operations without proper timeouts, host configuration issues, and performance variations. Tools to identify and reproduce flaky tests are discussed.
Finding and Organizing a Great Cloud Foundry User GroupDaniel Krook
Slides from the 2015 Cloud Foundry Summit on May 12.
http://sched.co/2tGc
Virtualization and global distribution are great when it comes to cloud computing and open source. In both cases, physical location is irrelevant. But one of the best ways to join the Cloud Foundry community is to participate in a local meetup. The presenters will share their experience running user groups over the past decade and lessons learned from recent Cloud Foundry events.
This session will teach you how to:
1. Find an active Cloud Foundry (or related cloud computing) user group
2. Contribute your own knowledge at an upcoming event
3. Organize - and sustain - a strong Cloud Foundry community
After this presentation, you will:
1. Appreciate the professional (and social) benefits of attending a meetup
2. Know how to share your expertise and establish your eminence as a Cloud Foundry expert
3. Be prepared to effectively organize a sustainable Cloud Foundry user group
Puppet, Jenkins, and continuous integration (CI) were discussed. The presentation covered installing Jenkins as a master and slaves using Puppet, integrating Github pull requests with Jenkins and Mergeatron, and testing Puppet code with tools like Puppet Lint, Rspec-Puppet, and eventually running Puppet code on VMs. Future work may involve catalog checking and running Puppet code against real systems.
Tackling non-determinism in Hadoop - Testing and debugging distributed system...Akihiro Suda
Earthquake is a tool for controlling non-determinism in distributed systems testing. It can schedule disk access, network packet, and function call events in a programmable way. Earthquake has found bugs in systems like ZooKeeper, YARN, and HDFS by reproducing rare non-deterministic execution paths. It achieves higher test coverage and bug reproduction rates compared to traditional testing approaches. Earthquake aims to be non-invasive, incrementally adoptable as understanding improves, and language independent.
Slack Bot: upload NUGET package to ArtifactorySergey Dzyuban
What it Jenkins CI automation to upload file to Artifactory failed? And users need some quick and safe mechanism do do upload manually ? Slack Bot will help to archive user experience and will add a bit of automation.
Continuous Deployment at Disqus (Pylons Minicon)zeeg
The document discusses Disqus' approach to continuous deployment. It describes how code is automatically deployed as soon as tests pass, with the goal of releasing features incrementally. It outlines the workflow, pros and cons, and techniques used to simplify local development and ensure stability through testing. Pain points like test speed and database migrations are addressed along with tools developed in-house like Mule for distributed testing.
Xen Project Evangelist Russell Pavlicek talks about how the growing area of hypervisor-leveraging unikernels will help redefine the cloud.
MAJOR UPDATE: Deck is now the result of 2015 Ohio Linuxfest, about a year after the initial talk. Deck now contains almost twice as much information as the original talk.
SCALE13x: Next Generation of the Cloud - Rise of the UnikernelThe Linux Foundation
Russell Pavlicek discusses the potential of unikernels to improve cloud computing efficiency and security. Unikernels are specialized virtual machines containing just enough software to run an application, making them much smaller and faster than traditional virtual machines. Several unikernel projects are presented, including MirageOS, HaLVM, LING, ClickOS, OSv, and Rumprun. Unikernels could enable 1000s of lightweight virtual machines per server and transient microservices with lifetimes measured in fractions of a second. Open source projects are leading the development of unikernels and their supporting ecosystems.
This document provides an overview of the FusionInventory project. It discusses that FusionInventory is an open source inventory and asset management solution that integrates with the GLPI asset management platform. It allows for network discovery, inventory collection, Wake-on-LAN functionality, software deployment, and VMware ESXi inventory via APIs. The document outlines the project timeline, contributors, supported operating systems, information gathered, statistics on code size and tests, roadmap, and a use case of how FusionInventory has helped consolidate inventory needs for a school district.
This document discusses Docker and OpenStack integration. It begins with introductions to OpenStack and Docker, explaining that OpenStack is an open source cloud operating system and Docker is a container-based virtualization framework. It then discusses how Docker can be used with OpenStack, with Nova supporting Docker as a hypervisor starting in Havana. It concludes with mentioning a demo of Docker + OpenStack integration and inviting questions.
OpenStack is an open source cloud computing platform that controls pools of compute, storage, and networking resources throughout a datacenter, managed through a dashboard that is exposed through APIs. It is made up of interrelated projects that handle functions like computing, networking, storage, imaging, orchestration, and more. The platform provides tools to provision resources to users in a simple and automated manner at scale.
The lessons I learned is that Open source quickly becomes the natural choice wherever commoditization is happening in the software stack. Thus we expect business-to-business open source, which is already a significant trend in recent history, to become an increasingly common form of open source collaboration. Companies who understand the ground rules of business-to-business open source will be better positioned to identify and take advantage of open source opportunities in the competitive spaces that they share with other companies.
So I will share why open strategy is import for the enterprise. And how to do contributions for the open source projects n today’s topic.
(Embedded Linux Conference Europe 2014)
Linux uses many kind of embedded products. The products include not only consumer electronics but also control systems such as programmable logic controllers. There are many type of infrastructure systems and each system has different technical requirements. The requirements include not only real-time performance but also reliability-related functions. The infrastructure systems have to meet all the requirements. This presentation gives a summary of our study and development to adapt the Linux to infrastructure systems. Then we discuss the direction of future development. Please note, this presentation doesn't focus on a specific product.
TryStack.cn is a non-profit OpenStack testbed and community project in China that aims to promote OpenStack adoption. It operates the largest OpenStack testbed in China with hardware from various vendors. TryStack.cn provides reference architectures, best practices, and contributes code back to the community. It also organizes OpenStack meetups and training to help grow the OpenStack ecosystem in China.
This document discusses Eclipse 4.0 and the e4 project. It provides an overview of why e4 was created, including to innovate Eclipse and prepare it for the web. It describes the key aspects of e4, including the modeled workbench, dependency injection, declarative styling using CSS, and a compatibility layer for Eclipse 3.x plugins. The presentation concludes by discussing where to learn more about e4.
The document discusses how open source technology is enabling the growth of the Internet of Things (IoT). It notes that IoT devices are growing rapidly in both popularity and scale. Many companies are using open source software, hardware, and standards to develop solutions for IoT markets. The document highlights several open source projects that are helping to support IoT development, including those focused on edge devices, operating systems, containerization, connectivity standards, and more. It provides examples of how these open source tools are enabling the scalability, security, and flexibility needed as IoT devices grow into the billions.
This document discusses containers and related technologies:
1. Containers provide isolated, portable environments for running applications and their dependencies. Docker is a popular container platform that packages applications into containers using Linux kernel features like namespaces and cgroups.
2. The Open Container Initiative (OCI) aims to develop standards around container formats and runtime. Technologies like Docker, rkt, and AppC implement the OCI specifications.
3. Container orchestration systems like Kubernetes and Mesos manage the deployment and lifecycles of containers at scale across clusters of hosts.
Building specialized container-based systems with Moby: a few use cases
This talk will explain how you can leverage the Moby project to assemble your own specialized container-based system, whether for IoT, cloud or bare metal scenarios. We will cover Moby itself, the framework, and tooling around the project, as well as many of it’s components: LinuxKit, InfraKit, containerd, SwarmKit, Notary. Then we will present a few use cases and demos of how different companies have leveraged Moby and some of the Moby components to create their own container-based systems.
Neo4j works very well in cloud environments. However, with such variance in compute, network, and storage options, the job of configuring a production database environment is getting complex. In this demo-oriented session, Patrick and David Makogon will introducing straightforward ways to configure and deploy Neo4j with Docker containers, as well as showing how to use automated cloud resource configuration with the new Azure Resource Manager.
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017Luciano Resende
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale. Behind this service, there are various components that power this platform, including Jupyter Notebooks, an enterprise gateway that manages the execution of the Jupyter Kernels and an Apache Spark cluster that power the computation. In this session we will describe our experience and best practices putting together this analytical platform as a service based on Jupyter Notebooks and Apache Spark, in particular how we built the Enterprise Gateway that enables all the Notebooks to share the Spark cluster computational resources.
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The document discusses software and programming concepts for IoT systems. It introduces the Raspberry Pi single board computer and how it can be used for IoT applications. Blockly and Python are presented as programming tools for IoT. Finally, a model IoT home automation system is demonstrated using sensors, actuators and single board computers connected through a home gateway.
Scaling notebooks for Deep Learning workloadsLuciano Resende
Deep learning workloads are computing intensive, and training these type of models is better done with specialized hardware like GPUs. Luciano Resende outlines a pattern for building deep learning models using the Jupyter Notebook’s interactive development in commodity hardware and leveraging platforms and services such as Fabric for Deep Learning (FfDL) for cost-effective full dataset training of deep learning models.
Jupyter con meetup extended jupyter kernel gatewayLuciano Resende
Data Scientists are becoming a necessity of every company in the data-centric world of today, and with them comes the requirement to make available a elastic and interactive analytics platform. This session will describe our experience and best practices putting together an Analytical platform based on Jupyter stack and different kernels running in a distributed Apache Spark cluster.
Strata - Scaling Jupyter with Jupyter Enterprise GatewayLuciano Resende
Born in academia, Jupyter notebooks are prevalent in both learning and research environments throughout the scientific community. Due to the widespread adoption of big data, AI, and deep learning frameworks, notebooks are also finding their way into the enterprise, which introduces a different set of requirements.
Alan Chin and Luciano Resende explain how to introduce Jupyter Enterprise Gateway into new and existing notebook environments to enable a “bring your own notebook” model while simultaneously optimizing resources consumed by the notebook kernels running across managed clusters within the enterprise. Along the way, they detail how to use different frameworks with Enterprise Gateway to meet the needs of data scientists operating within the AI and deep learning ecosystems.
A Jupyter kernel for Scala and Apache Spark.pdfLuciano Resende
Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks. Apache Toree (incubating) is a Jupyter kernel designed that enables data scientists and data engineers to easily connect and leverage Apache Spark and its powerful APIs from a standard Jupyter notebook to execute their analytics workloads. In this talk, we will go over what's new with the most recent Apache Toree release. We will cover available magics and visualizations extensions that can be integrated with Toree to enable better data exploration and data visualizations. We will also describe some high-level designs of Toree and how users can extend the functionality of Apache Toree powerful plugin system. And all of these with multiple live demos that demonstrate how Toree can help with your analytics workloads in an Apache Spark environment.
The document summarizes Ulrich Krause's presentation on the latest developments from OpenNTF. The presentation covered:
- An overview of OpenNTF, its 800+ open source projects and 200k annual downloads.
- Current OpenNTF initiatives like CollaborationToday, XPages.info, contests and webinars.
- Specific projects like Bootstrap4XPages, org.openntf.domino, Tika for XPages, and Unplugged XPages mobile controls.
- The OpenNTF intellectual property policy and ways for developers to get involved.
This document provides an introduction to Jupyter Notebook and Azure Machine Learning Studio. It discusses popular programming languages like Python, R, and Julia that can be used with these tools. It also summarizes key features of Jupyter Notebook like code cells, kernels, and cloud deployment. Examples are given of using Python and R with Azure ML to fetch and transform data in Jupyter notebooks.
This document provides an introduction to Jupyter Notebook and Azure Machine Learning Studio. It discusses popular programming languages like Python, R, and Julia that can be used with these tools. It also summarizes key features of Jupyter Notebook like code cells, kernels, and cloud deployment. Demo code examples are shown for integrating Python and R with Azure ML to fetch and transform data.
Come to this session to get an update about everything related to OpenNTF, the open source community for IBM Collaboration Solutions.
See the contest winning XPages projects live and learn about the new open source projects for IBM Connections.
The session will also cover the IBM Social Business Toolkit SDK which allows XPages, Java and JavaScript developers to easily access IBM Connections and IBM SmartCloud for Social Business from custom applications. Attend this session to see demos of the latest functionality and new samples of the toolkit.
Building analytical microservices powered by jupyter kernelsLuciano Resende
This document discusses building analytical microservices powered by Jupyter kernels. It provides an overview of Jupyter notebooks and their architecture. It then introduces the Jupyter Enterprise Gateway, which allows running Jupyter kernels on a distributed cluster for improved scalability and security. Finally, it demonstrates a use case of a sentiment analysis microservice that leverages PySpark on a Hadoop cluster via Jupyter kernels.
Big analytics meetup - Extended Jupyter Kernel GatewayLuciano Resende
Luciano Resende from IBM's Spark Technology Center presented on building an enterprise/cloud analytics platform with Jupyter Notebooks and Apache Spark. The Spark Technology Center focuses on contributions to open source Apache Spark projects. Resende discussed limitations of the current Jupyter Notebook setup for multi-user shared clusters and demonstrated an Extended Jupyter Kernel Gateway that allows running kernels remotely in a cluster with enhanced security, resource optimization, and multi-user support through user impersonation. The Extended Jupyter Kernel Gateway is planned for open source release.
In this session, Luciano will be walking you through a real use case pipeline that uses Elyra features to help analyze COVID-19 related datasets. He will introduce Elyra, a project built to extend JupyterLab with AI-centric capabilities. He'll showcase the extensions that allow you to build Notebook Pipelines and execute these in a Kubeflow environment, execute notebooks as batch jobs, the ability to create, edit and execute Python scripts directly from JupyterLab
This document provides an overview and tutorial of using Google Colab. It discusses:
- The speaker's background and experience in big data, AI, and machine learning
- An introduction to Google Colab and its key features like GPU/TPU acceleration and hardware limitations
- A tutorial on connecting to Colab, accessing files from Google Drive, and comparing CPU and GPU performance
- Examples of using Colab for flower classification with Keras and TPU as well as homework on iris classification
Community works for muli core embedded image processingJeongpyo Kong
1. The presentation discusses multi-core embedded image processing and the speaker's work with ETRI and KESSIA on related projects.
2. It provides technical backgrounds on requirements for embedded image processing like low power and high performance. Approaches discussed include hardware based using multi-core processors and software based using efficient algorithms and frameworks.
3. The speaker's current works involve porting OpenCV to various hardware platforms from ETRI and conducting performance tests, and future work may include developing specific applications for smart devices.
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Codemotion
The Jupyter Notebook Stack has become the "de facto" platform used by data scientists to interactively work on big data problems. With the popularity of deep learning, there is also an increasing need for resources to make deep learning effective. In this session, we will discuss how we brought support for Kubernetes into Jupyter Enterprise Gateway and touch on some best practices on how to scale an interactive big data workloads across a Kubernets managed cluster.
Leveraging Docker for Hadoop build automation and Big Data stack provisioningDataWorks Summit
Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we'll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.
Iteratively introducing Puppet technologies in the brownfield; Jeffrey MillerPuppet
This document summarizes a presentation about iteratively introducing Puppet technologies to manage infrastructure at Oak Ridge National Laboratory (ORNL). It discusses starting with basic automation tools like Bolt and working up to provisioning systems and configuring them using Puppet. The approach emphasizes testing changes using no-op runs before enforcing them and using iterative development practices with tools like Git. The overall goal is to gradually bring order and automation to a complex existing "brownfield" environment through a collaborative team effort.
This document summarizes a presentation about accelerating Apache Spark workloads using NVIDIA's RAPIDS accelerator. It notes that global data generation is expected to grow exponentially to 221 zettabytes by 2026. RAPIDS can provide significant speedups and cost savings for Spark workloads by leveraging GPUs. Benchmark results show a NVIDIA decision support benchmark running 5.7x faster and with 4.5x lower costs on a GPU cluster compared to CPU. The document outlines RAPIDS integration with Spark and provides information on qualification, configuration, and future developments.
Talk at SF Big Analytics https://www.meetup.com/sf-big-analytics/events/285731741/
Distributed systems are made up of many components such as authentication, a persistence layer, stateless services, load balancers, and stateful coordination services. These coordination services are central to the operation of the system, performing tasks such as maintaining system configuration state, ensuring service availability, name resolution, and storing other system metadata. Given their central role in the system it is essential that these systems remain available, fault tolerant and consistent. By providing a highly available file system-like abstraction as well as powerful recipes such as leader election, Apache Zookeeper is often used to implement these services. Although powerful, the Zookeeper interface may not be flexible enough or provide sufficient performance for all applications and many systems are replacing Zookeeper based solutions with Raft which provides a more generic interface to high availability and fault tolerance through the use of State Machine replication. This talk will go over a generic example of stateful coordination service moving from Zookeeper to Raft.
Speaker: Tyler Crain ( Alluxio)
Tyler Crain is a software engineer at Alluxio, working on distributed systems within the Alluxio core team. Before this, Tyler held Post-Doc positions at the University of Sydney and Sorbonne Universities where he performed research on topics including distributed key-value stores, distributed consensus and blockchain. Tyler received his PhD from the University of Rennes where he worked on Transactional Memory. He also holds a Masters degree in Computer Science from University of California Santa Barbara.
talk at SF Big Analytics:
Related Blog: https://www.alluxio.io/blog/from-zookeeper-to-raft-how-alluxio-stores-file-system-state-with-high-availability-and-fault-tolerance/
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
Recent years have witnessed an exponential growth of the model scale in recommendation/Ads/search—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes people believe the era of 100 trillion parameters is around the corner. To prepare the exponential growth of the model size, an efficient distributed training system is in urgent need. However, the training of such huge models is challenging even within industrial scale data centers. In this talk, I will introduce Persia -- an open training system developed by my team -- to resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Persia admits nearly linear speedup properties while scaling the number of workers and the model size. Beside the capability of training 100 trillion parameters, it also shows a clear advantage in efficiency over other open sourced engines.
paper link:
https://arxiv.org/pdf/2111.05897.pdf
Speaker: Ji Liu
Dr. Ji Liu received his Ph.D in computer science and his bachelor degree in automation from University of Wisconsin-Madison and University of Science and Technology of China, respectively. After graduation, he joined the University of Rochester as an assistant professor, conducting research in machine learning, optimization, and reinforcement learning. The developed asynchronous and decentralized algorithms were widely used in industry, such as IBM, Microsoft, etc. He left academia and joined Tencent in 2017, exploring AI’s boundary. The developing AI agent Tstarbot was considered to be a milestone for mastering the most challenging RTS game -- Starcraft II. His second stop in industry is Kwai - the second largest short video company in China. He founded and led multiple international teams with different functionalities: platform team, product team, and research team. His team Contributed to 15+% annual revenue growth in Ads. He published 100+ papers in top-tier CS conferences and journals, and received multiple best paper awards (e.g., SIGKDD 2010 and UAI 2015 Facebook best paper). He was an awardee of MIT TR 35 under 35 in China and IBM faculty award in 2017. He was nominated to be one of China top 5 AI innovators under 35 in 2018
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
Topic:
NVIDIA FLARE: Federated Learning Application Runtime Environment for Developing Robust AI Models
Summary:
Federated learning (FL) enables building robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without moving data. We created NVIDIA FLARE as an open-source SDK to make it easier for data scientists to use FL in their research. The SDK allows existing machine learning and deep learning workflows adapted for distributed learning across enterprises and enables platform developers to build a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, or even NumPy), and apply them in real-world FL settings. This talk will introduce the key design principles of NVIDIA FLARE and illustrate use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms.
Speaker: Dr. Holger Roth ( Nvidia)
Holger Roth is a Sr. Applied Research Scientist at NVIDIA focusing on deep learning for medical imaging. He has been working closely with clinicians and academics over the past several years to develop deep learning based medical image computing and computer-aided detection models for radiological applications. He is an Associate Editor for IEEE Transactions of Medical Imaging and holds a Ph.D. from University College London, UK. In 2018, he was awarded the MICCAI Young Scientist Publication Impact Award.
A missing link in the ML infrastructure stack?Chester Chen
Talk at SF Big Analytics
Machine learning is quickly becoming a product engineering discipline. Although several new categories of infrastructure and tools have emerged to help teams turn their models into production systems, doing so is still extremely challenging for most companies. In this talk, we survey the tooling landscape and point out several parts of the machine learning lifecycle that are still underserved. We propose a new category of tool that could help alleviate these challenges and connect the fragmented production ML tooling ecosystem. We conclude by discussing similarities and differences between our proposed system and those of a few top companies.
Bio: Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.
This document discusses the challenges of data discovery and management from the perspectives of frustrated data scientists and project managers. It explores three main problems with obtaining and working with data. While buying a solution was considered, there were also risks to mitigate. The document asks about the biggest flaw of Artifact and what is next for the company. It concludes by thanking the reader.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
Uncovering performance regressions in the TCP SACKs vulnerability fixes
In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause.
Speaker: Chris Stevens (Databricks)
Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI.
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
Talk 2. Managing Uber’s Data workflow at Scale.
Uber microservices serving millions of rides a day, leading to 100+ PB of data. To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. This talk details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable.
Speaker Alex Kira (Uber)
Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform. In 19-year, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
SFBigAnalytics- hybrid data management using cdapChester Chen
Cloud has emerged as a critical enabler of digital transformation, with the aim of reducing IT overheads and costs. However, cloud
migration is not instantaneous for a variety of reasons including data sensitivity, compliance and application performance. This results in the creation of diverse hybrid and multi-cloud environments and amplifies data management and integration challenges. This talk demonstrates how CDAP’s flexibility can allow you to utilize your existing on-premises infrastructure, as you evolve to the latest Big Data and Cloud services at your own pace, all while providing you a single, unified view of all your data, wherever it resides.
Speaker: Bhooshan Mogal, Google
Bhooshan Mogal is a Product Manager at Google, where he is focused on delivering best-in-class Data and Analytics services to GCP users. Prior to Google, he worked on data systems at Cask Data Inc, Pivotal and Yahoo.
Bighead: Airbnb's end-to-end machine learning platform
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python, Spark, and Kubernetes. The components include a lifecycle management service, an offline training and inference engine, an online inference service, a prototyping environment, and a Docker image customization tool. Each component can be used individually. In addition, Bighead includes a unified model building API that smoothly integrates popular libraries including TensorFlow, XGBoost, and PyTorch. Each model is reproducible and iterable through standardization of data collection and transformation, model training environments, and production deployment. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adopted in Airbnb and we have variety of models running in production. We plan to open source Bighead to allow the wider community to benefit from our work.
Speaker: Andrew Hoh
Andrew Hoh is the Product Manager for the ML Infrastructure and Applied ML teams at Airbnb. Previously, he has spent time building and growing Microsoft Azure's NoSQL distributed database. He holds a degree in computer science from Dartmouth College.
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Spark can be enhanced with data warehouse capabilities to leverage both open source analytics and enterprise data warehouse strengths. This includes incorporating star schema detection and referential integrity constraints to optimize queries. Performance can be improved by pushing down operations like joins, filters, and projections from Spark to underlying data sources using heuristics like star schema patterns. Push downs allow exploiting database indexes and reducing data transfer. Star schema detection and join push downs have shown speedups of 2-31x on TPC-DS benchmark queries.
This document summarizes new features in Apache Spark 2.3, including continuous processing mode for structured streaming, stream-stream joins, running Spark applications on Kubernetes, improved PySpark performance through vectorized UDFs and Pandas integration, and Databricks Delta for reliability and performance in data lakes. The author, an Apache Spark committer and PMC member, provides overviews and code examples of these features.
The document summarizes new features and improvements in Apache Spark 2.3 for machine learning. Key highlights include first-class support for loading image data, enhanced scalability of feature transformers by supporting multiple columns, parallelizing cross-validation for faster hyperparameter tuning, and a new scalable feature hashing transformer. Performance tests demonstrate that the multi-column transformers provide up to 2.7x speedup over the single-column approach. Parallel cross-validation also provides a 2-2.7x speedup using 3 threads. Future areas of focus include completing multi-column support, improving Python APIs, and enhancing techniques like gradient boosted trees.
The document discusses major deep learning frameworks that can be used with Spark, including Deeplearning4J, BigDL, Deep Learning Pipelines, TensorFlowOnSpark, and Microsoft Machine Learning on Spark. It provides brief overviews of each framework's capabilities and support for distributed GPU/CPU training. The document also lists some other frameworks and raises questions about real-world uses of deep learning with Spark, common problems being solved, and challenges integrating Spark with deep learning frameworks.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
I’m excited to finally share my research from last year on the hypnotic effects of mass media and digital platformization. This study explores how our attention is influenced through YouTube’s audio-visual content. Key points:
- **Objective:** Examine the hypnotic side effects of media on attention.
- **Focus:** Sound and visual experiences on YouTube.
- **Methodology:** Mixed digital approach with quantitative and qualitative analysis.
- **Findings:** Observations on techniques in attention-based economies and their cognitive impact.
- **Implications:** Considerations for future research in media and mind interactions, especially within OSINT-oriented communities.
Curious about the details? Check out my slide deck and let’s discuss the future possibilities.
#Research #AttentionEconomy #YouTube #DigitalMedia #MediaStudies #VisualNetworkAnalysis #HypnodelicMedia
2. 2
Hi!
Fred Reiss
• 2014-present: Chief Architect,
IBM Spark Technology Center.
• 2006-2014: Worked for IBM
Research.
• 2006: Ph.D. from U.C.
Berkeley.
3. 3
The Jupyter Project
• Open Source project that builds software to enable
interactive notebooks for data science
– Started in 2014
– Grew out of the IPython project
7. 7
Jupyter Notebooks
• Jupyter notebooks are widely used by data scientists, social
scientists, physical scientists, engineers, and others
• Useful for many tasks
– Analyzing data
– Developing and debugging software
– Running experiments
– Keeping track of experimental results
– Presenting results
• Jupyter is a central part of the IBM Data Science Experience
(http://datascience.ibm.com)
8. 8
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis (problems that don’t fit in
a laptop)
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
9. 9
Isn’t this just shipping strings around?
JavaScript
“1+1”
Server
“1+1”
Python
Process
“1+1”
“2”“2”“2”
10. 10
Isn’t this just shipping strings around?
JavaScript
“1+1”
FancyNewSystem
“1+1”
Python
Process
“1+1”
“2”“2”“2”
Security
Multitenancy
Authentication
Spark
Kubernetes
14. 14
Asynchronous Operations
• Queue up multiple cells for
execution
– …in arbitrary order
• Stream output while a cell is
running
• Interrupt any operation
Fifteenth cell
that executed in
this session
15. 15
Jupyter’s Display System: Much More than Text
https://nbviewer.jupyter.org/github/ipython/ipython/bl
ob/master/examples/IPython%20Kernel/Custom%2
0Display%20Logic.ipynb
22. Notebook Server Process
22
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
25. Notebook Server Process
25
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
26. Notebook Server Process
26
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
27. Notebook Server Process
27
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
28. Notebook Server Process
28
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Five ZeroMQ
message queues over
unencrypted TCP
sockets…
…per kernel
29. 29
Third-Party Kernels
• The IPython kernel is
the most common…
• …but there is a long tail
of other Jupyter kernels
– 103 kernels currently
listed on the Jupyter
project’s wiki
30. Notebook Server Process
30
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To share
notebooks among
users, need to
share notebook
server
31. Notebook Server Process
31
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To use Apache
Spark™ on YARN,
need to be inside
the YARN cluster’s
network.
32. 32
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
Bringing these properties to the Jupyter stack is hard!
36. 36
Compromise #1: Gigantic Server
• Find the biggest machine or container you can get
• Run the entire Jupyter stack on that one machine
• Issues:
– Machine needs to be sized for the maximum aggregate
memory of all active users’ active kernels
• Hard upper limit of 256GB-1TB in most organizations
• Very problematic if you have many users and big data
– Need to authenticate all these users to the same machine
and notebook server
37. 37
Compromise #2: Notebook Server Per User
• Proxy server manages a pool of containers, one per active user
• Each container contains an entire Jupyter notebook stack
• JupyterHub project provides a pre-built implementation of this
approach
• Issues:
– Container needs to be big enough for all the user’s kernels
• What size container to allocate when the user logs in?
• Does a big enough container even exist?
– Disables collaboration features
– Many more moving parts More failure modes
39. KernelProxy
Proxy
39
Compromise #3: Replace the Kernel
• Replace the IPython kernel with a proxy
• Put something enterprise-friendly on
the other side of the proxy
• Apache Livy implements this approach
– https://github.com/jupyter-
incubator/sparkmagic
• Issues:
– Breaks Jupyter’s magics and extensions
– Breaks data visualization libraries
– Breaks third-party kernels
– Less control over code execution
Shell
IOPub
stdin
control
heartbeat
RESTfulwebservice
40. 40
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
41. 41
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
42. 42
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
5. Jupyter Enterprise Gateway
43. 43
The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy,
scalability, security, etc.)
• Had reached the “Bargaining” stage
– Mix of compromises 1, 2, and 3
46. Issue #1: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
46
47. Jupyter Enterprise Gateway: Initial Goals
• Optimized Resource Allocation
– Run Spark in YARN Cluster Mode to better utilize cluster resources.
– Pluggable architecture for additional Resource Managers
• Multiuser support with user impersonation
– Enhance security and sandboxing by enabling user impersonation
when running kernels (using Kerberos).
– Individual HDFS home folder for each notebook user.
– Use the same user ID for notebook and batch jobs.
• Enhanced Security
– Secure socket communications
– Any network communication should be encrypted
47
49. Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG
49
50. Jupyter Enterprise Gateway: Open Source
50
• Released through the
Jupyter Incubator
– BSD License
– https://github.com/jupyter-
incubator/enterprise_gatew
ay
– Current release: 0.7.0
51. Jupyter Enterprise Gateway: Supported Platforms
• Python/Spark 2.x using IPython kernel
– With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
– With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
51
52. Jupyter Enterprise Gateway – Roadmap
• Add support for other resource managers
– Kubernetes support
• Kernel Configuration Profile
– Enable client to request different resource configuration for kernels (e.g. small,
medium, large)
– Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
– Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• User Environments
• High Availability
52
54. 54
Thank you!
And special thanks to the Jupyter
Enterprise Gateway team: Luciano Resende,
Kevin Bates, Kun Liu, Christian Kadner,
Sanjay Saxena, Alan Chin, Sherry Guo, Alex
Bozarth, Zee Chen
57. Jupyter Enterprise Gateway: Deployment
57
Management
Node
Powered by
Ambari
EG
Compute Engine based on Apache Spark
58. Jupyter Enterprise Gateway: Deployment
• Ansible deployment scripts
– https://github.com/lresende/spark-cluster-install
• One click deployment of the Spark Cluster
– Configure your host inventory (see example on git repository)
– Run the ”setup-ambari.yml” playbook
• $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko
• One click deployment of the Jupyter Enterprise Engine
– Run the ”setup-enterprise-gateway.yml” playbook
• $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c
paramiko
58
59. Jupyter Enterprise Gateway - Deployment
• Docker images
– yarn-spark: Basic one node Spark on Yarn configuration
– enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the
yarn-spark image
– nb2kg: Minimal Jupyter Notebook client configured with hooks to access the
Enterprise Gateway
– https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker
• Building the latest docker images
– git checkout https://github.com/jupyter-incubator/enterprise_gateway
– make docker-clean docker-images
– Note: Make also have individual targets to clean and build individual images
(type make for help)
59
60. Jupyter Enterprise Gateway - Deployment
• Connecting to a Spark Cluster using a docker image
docker run -t --rm
-e KG_URL='http://<Enterprise Gateway IP>:8888'
-p 8888:8888
-e VALIDATE_KG_CERT='no'
-e LOG_LEVEL=DEBUG
-e KG_REQUEST_TIMEOUT=40
-e KG_CONNECT_TIMEOUT=40
-v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks
-w /tmp/notebooks
elyra/nb2kg:dev
60
Editor's Notes
Now, when I first saw these requirements, my initial reaction was, “sounds easy”. I mean, to a first approximation, all that Jupyter is doing is passing strings around.
This is what I initially thought, and I’ve met a good number of other people who were in the same situation and came up with the same design. The problem with this design is that it’s actually only the first stage of a much longer process that I like to call…
And in particular, the first stage of this process is called…
Let me explain.
All these cool features of Jupyter notebooks rely on an architecture that is substantially more baroque than the cartoon picture from ten slides back…
When an enterprise architect becomes aware of all this complexity, that’s when he or she moves from stage 1 to stage 2, which is…
Let me explain.
This architecture was designed for an academic setting. When you try to transplant it into an enterprise environment and layer enterprise requirements on top of it, things go downhill rather quickly.
…and the purpose of this talk is to help you to work through this fourth stage as quickly as possible and move on to stage 5, which is…
…Jupyter Enterprise Gateway. (Bet you thought I was going to say “acceptance”). So, what is Jupyter Enterprise Gateway?