Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage
OpenPOWER Webinar on Machine Learning for Academic Research Ganesan Narayanasamy
The document discusses machine learning and deep learning techniques. It provides examples of different machine learning algorithms like decision trees, linear regression, neural networks and deep learning models. It also discusses applications of machine learning in areas like computer vision, natural language processing and bioinformatics. Finally, it talks about technologies that can help democratize machine learning like distributed computing frameworks and open source libraries.
IBM provides infrastructure to accelerate medical research tasks like genomics, molecular simulation, diagnostics, and quality inspection. This infrastructure delivers faster insights through high-performance data and AI deployed at massive scale on IBM Power Systems and Storage. Case studies show the infrastructure reduces time to results for tasks like processing millions of cryogenic electron microscope images from days to hours.
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
This document provides a summary of the IBM POWER9 AC922 system with 6 GPUs. It includes details on the POWER9 processor which features 24 cores per die, an enhanced cache hierarchy up to 120MB, and on-chip accelerators. The AC922 system utilizes two POWER9 processors, supports up to 512GB memory via 16 DDR4 DIMMs, and has three Nvidia Volta GPUs per socket connected via NVLink 2.0. It also discusses the POWER ISA v3.0 instruction set and how POWER9 serves as a premier acceleration platform with technologies like CAPI, OpenCAPI, and NVLink.
The document discusses IBM AI solutions on Power systems. It provides an overview of key features including OpenPOWER collaboration, IBM machine learning and deep learning solutions designed for faster results, and Power9 servers adopted by research institutions. It then discusses specific IBM Power systems like the IBM Power AC922 that are optimized for AI workloads through features like CPU-GPU NVLink and large model support in TensorFlow.
Heterogeneous computing refers to systems that use more than one type of processor or core. It allows integration of CPUs and GPUs on the same bus, with shared memory and tasks. This is called the Heterogeneous System Architecture (HSA). The HSA aims to reduce latency between devices and make them more compatible for programming. Programming models for HSA include OpenCL, CUDA, and hUMA. Heterogeneous computing is used in platforms like smartphones, laptops, game consoles, and APUs from AMD. It provides benefits like increased performance, lower costs, and better battery life over traditional CPUs, but discrete CPUs and GPUs can provide more power and new software models are needed.
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
This document introduces hardware acceleration using FPGAs with OpenCAPI. It discusses how classic FPGA acceleration has issues like slow CPU-managed memory access and lack of data coherency. OpenCAPI allows FPGAs to directly access host memory, providing faster memory access and data coherency. It also introduces the OC-Accel framework that allows programming FPGAs using C/C++ instead of HDL languages, addressing issues like long development times. Example applications demonstrated significant performance improvements using this approach over CPU-only or classic FPGA acceleration methods.
In this deck from the HPC User Forum in Tucson, Jeff Stuecheli from IBM presents: POWER9 for AI & HPC.
"Built from the ground-up for data intensive workloads, POWER9 is the only processor with state-of-the-art I/O subsystem technology, including next generation NVIDIA NVLink, PCIe Gen4, and OpenCAPI."
Watch the video: https://wp.me/p3RLHQ-isJ
Learn more: https://www.ibm.com/it-infrastructure/power/power9
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
Xilinx provides adaptable acceleration platforms for data centers. Their Alveo product lineup includes the U280, U250, U200, and low-profile U50 accelerator cards. The cards feature FPGAs with up to 1.3 million logic cells and high-speed memory. Xilinx also offers the U25 SmartNIC which combines an FPGA, ARM CPU, and dual 25GbE ports. These platforms accelerate workloads such as AI, databases, storage, and networking using reconfigurable and adaptable hardware. Xilinx supports deployment from their devices to cloud platforms using a unified software stack.
This document discusses IBM's involvement in artificial intelligence and deep learning. It includes:
- An introduction to IBM's Cognitive Systems team working in AI.
- A brief history of IBM's AI projects including Deep Blue, Blue Gene, and Watson.
- Explanations of concepts like machine learning, deep learning, and how they relate to high performance computing.
- Details of IBM's current hardware, software, and services for AI workloads including the Power9 processor, PowerAI tools, and storage solutions.
The document provides an overview of IBM's expertise and offerings in the field of artificial intelligence.
IBM announced new Power Systems servers powered by the POWER8 processor. Power Systems with POWER8 offer faster performance, support for big data workloads, and an open innovation platform. New Power E880 and Power E870 servers provide up to 128 Power8 cores, 16TB of memory, and superior price/performance for cloud, analytics, and complex applications. Power Systems are designed to handle data-intensive workloads and provide flexibility, availability, and efficiency through features like Elastic Capacity on Demand and Power Enterprise Pools.
Snap ML is a machine learning framework for fast training of generalized linear models (GLMs) that can scale to large datasets. It uses multi-level parallelism across nodes and GPUs. Snap ML implementations include snap-ml-local for single nodes, snap-ml-mpi for multi-node HPC environments, and snap-ml-spark for Apache Spark clusters. Experimental results show Snap ML can train a logistic regression model on a 3TB Criteo dataset within 1.5 minutes using 16 GPUs.
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
The document discusses IBM's POWER9 processor and OpenPOWER ecosystem. It provides an overview of the POWER9 features such as its new core microarchitecture, enhanced cache hierarchy, and acceleration capabilities through technologies like NVLink 2.0 and CAPI 2.0. It also discusses the OpenCAPI open standard and IBM's efforts to build supercomputers for the US Department of Energy using POWER, NVIDIA GPUs, and Mellanox networking technologies.
Design Considerations, Installation, and Commissioning of the RedRaider Cluster at the Texas Tech University
High Performance Computing Center
Outline of this talk
HPCC Staff and Students
Previous clusters
• History, Performance, usage Patterns, and Experience
Motivation for Upgrades
• Compute Capacity Goals
• Related Considerations
Installation and Benchmarks Conclusions and Q&A
In this video from the HPC User Forum in Tucson, Gregory Stoner from AMD presents: It's Time to ROC.
"With the announcement of the Boltzmann Initiative and the recent releases of ROCK and ROCR, AMD has ushered in a new era of Heterogeneous Computing. The Boltzmann initiative exposes cutting edge compute capabilities and features on targeted AMD/ATI Radeon discrete GPUs through an open source software stack. The Boltzmann stack is comprised of several components based on open standards, but extended so important hardware capabilities are not hidden by the implementation."
Learn more: http://gpuopen.com/getting-started-with-boltzmann-components-platforms-installation/
and
http://hpcuserforum.com
Watch the video presentation: http://wp.me/p3RLHQ-fcJ
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document summarizes the Cell processor architecture, which was developed as a collaboration between IBM, Sony, and Toshiba to address limitations in processor performance. The Cell consists of 9 cores - 1 PowerPC core called the PPE and 8 synergistic processor elements (SPEs) optimized for SIMD operations. It has a peak performance of over 200 GFLOPS and was used in the PlayStation 3 game console to enable graphics-intensive applications. The document outlines the Cell architecture and how it aims to overcome performance walls related to power, memory, and frequency limitations.
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise Linux - Learn about the new IBM Power8 architecture, about Red Hat Enterprise Linux 7 for Power Systems and additional information on EnterpriseDB on how to migrate from Oracle to PostgreSQL.
UPDATED!
The LEGaTO project received funding from the EU's Horizon 2020 program to develop a heterogeneous hardware platform called RECS for cloud to edge computing. RECS uses a modular microserver approach integrating CPUs, GPUs, FPGAs, and SOCs. It allows for flexible node composition through virtual functions to enable different compute and communication topologies.
Socionext is developing low power ARM server solutions including the SC2A11 multicore processor and SC2A20 SoC switch. They aim to build scalable small core systems with optimized performance and power efficiency compared to traditional servers. Socionext has integrated their solutions into a prototype low power scalable server and is developing the necessary software including UEFI, Linux, and applications to support various server workloads.
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackOPNFV
Service Provider is evolving and competing with leaner Over the Top Providers (OTP) providers such as Google and Amazon to provide mobile services. Furture SP network has ot be agile, resilient and auto salable. SPs are leaning towards using COTS infra, open networking (OPNFV, ONOS) and VNF to run routers, switches, mobile gateways, firewall, NAT, DPI functions. Session covers design and deployment of virtualizing the mobile infra such as Virtual Evolved Packet Core, GiLAN and VoLTE as well as 5G core. We will also cover performance fine tuning using DPDK, SR-IOV etc. WE will present case study using Cisco (VNF Manager and NFVO), Redhat (NFVI), Openstack and block storage using CEPH technology. Participants will be able to understand complexities of mobile packet core, evolution NFV based solution and architecture framework for 5G mobile packet core.
From Rack scale computers to Warehouse scale computersRyousei Takano
This document discusses the transition from rack-scale computers to warehouse-scale computers through the disaggregation of technologies. It provides examples of rack-scale architectures like Open Compute Project and Intel Rack Scale Architecture. For warehouse-scale computers, it examines HP's The Machine project using application-specific cores, universal memory, and photonics fabric. It also outlines UC Berkeley's FireBox project utilizing 1 terabit/sec optical fibers, many-core systems-on-chip, and non-volatile memory modules connected via high-radix photonic switches.
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
PCIe Gen3 presentation by PLDA at 4th FPGA Camp in Santa Clara, CA. For more details visit http://www.fpgacentral.com/fpgacamp or http://www.fpgacentral.com
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerRebekah Rodriguez
In this webinar, members of the Server Solution Team as well as a member of Supermicro’s Product Office will discuss Supermicro’s Universal GPU Server, the server’s modular, standards-based design, the important role of OCP Accelerator Module (OAM) form factor, and Universal Baseboard (UBB) in the system, as well as touching on AMD's next generation HPC accelerator. In addition, we will get some insights into trends in the HPC and AI/Machine Learning space, including the different software platforms and best practices that are driving innovation in our industry and daily lives. In particular: • Tools to enable use of the high performance hardware for HPC and Deep Learning applications • Tools to enable use of multiple GPUs, including RDMA, to solve highly demanding HPC and deep learning models, such as BERT • Running applications in containers with AMD’s next generation GPU system
Ecosystem Alliance Manager Michael Ocampo talks about the CXL industry's effort to break through the memory wall, memory bound use cases, CXL for modular shared infrastructure, and critical CXL collaboration that's happening now.
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerRebekah Rodriguez
In this webinar, members of the Server Solution Team as well as a member of Supermicro’s Product Office will discuss Supermicro’s Universal GPU Server, the server’s modular, standards-based design, the important role of OCP Accelerator Module (OAM) form factor, and Universal Baseboard (UBB) in the system, as well as touching on AMD's next generation HPC accelerator. In addition, we will get some insights into trends in the HPC and AI/Machine Learning space, including the different software platforms and best practices that are driving innovation in our industry and daily lives. In particular: • Tools to enable use of the high performance hardware for HPC and Deep Learning applications • Tools to enable use of multiple GPUs, including RDMA, to solve highly demanding HPC and deep learning models, such as BERT • Running applications in containers with AMD’s next generation GPU system
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
Join us for an exciting and informative preview of the broadest range of next-generation systems optimized for tomorrow’s data center workloads, Powered by 4th Gen Intel® Xeon® Scalable Processors (formerly codenamed Sapphire Rapids).
Experts from Supermicro and Intel will discuss how the upcoming Supermicro X13 systems will enable new performance levels utilizing state-of-the-art technology, including DDR5, PCIe 5.0, Compute Express Link™ 1.1, and Intel® Advanced Matrix Extensions (Intel AMX).
In this deck from the 2018 Swiss HPC Conference, Alexander Ruebensaal from ABC Systems AG presents: NVMe Takes It All, SCSI Has To Fall.
"NVMe has beome the main focus of storage developments when it comes to latency, bandwidth, IOPS. There is already a broad range of standard products available - server or network based."
Watch the video: https://insidehpc.com/2018/06/video-nvme-takes-scsi-fall/
Learn more: http://www.abcsystems.ch/
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
- POWER9 delivers 2x the compute resources per socket through new cores optimized for stronger thread performance and efficiency.
- It features direct memory attach with up to 8 DDR4 ports and buffered memory with 8 channels for scale-out and scale-up configurations.
- The processor provides leadership hardware acceleration through enhanced on-chip acceleration, NVLink 2.0, CAPI 2.0, and a new open CAPI interface using 25G signaling for high bandwidth and low latency attachment of accelerators.
Yesterday's thinking may still believe NVMe (NVM Express) is in transition to a production ready solution. In this session, we will discuss how the evolution of NVMe is ready for production, the history and evolution of NVMe and the Linux stack to address where NVMe has progressed today to become the low latency, highly reliable database key value store mechanism that will drive the future of cloud expansion. Examples of protocol efficiencies and types of storage engines that are optimizing for NVMe will be discussed. Please join us for an exciting session where in-memory computing and persistence have evolved.
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Shuquan Huang
Today data scientist is turning to cloud for AI and HPC workloads. However, AI/HPC applications require high computational throughput where generic cloud resources would not suffice. There is a strong demand for OpenStack to support hardware accelerated devices in a dynamic model.
In this session, we will introduce OpenStack Acceleration Service – Cyborg, which provides a management framework for accelerator devices (e.g. FPGA, GPU, NVMe SSD). We will also discuss Rack Scale Design (RSD) technology and explain how physical hardware resources can be dynamically aggregated to meet the AI/HPC requirements. The ability to “compose on the fly” with workload-optimized hardware and accelerator devices through an API allow data center managers to manage these resources in an efficient automated manner.
We will also introduce an enhanced telemetry solution with Gnnochi, bandwidth discovery and smart scheduling, by leveraging RSD technology, for efficient workloads management in HPC/AI cloud.
This document discusses accelerated computing using GPUs and OpenCL. It begins by covering the evolution of x86 processors towards multi-core designs and the use of GPUs as accelerators. It then introduces accelerated processing units that combine CPU and GPU components. The document concludes by introducing OpenCL as an open standard for programming GPUs and heterogeneous systems that allows developers to write code that scales across CPUs and GPUs.
The document discusses hardware platforms and accelerators for VEDLIoT. It describes the VEDLIoT Hardware Platform as a heterogeneous, modular, and scalable microserver system that supports the IoT spectrum from embedded to edge to cloud. It then provides details on several platforms: the RECS|Box platform which uses Computer-on-Module standards to achieve flexibility and performance; the t.RECS platform optimized for local edge applications; and the uRECS embedded device platform that supports machine learning acceleration and communication interfaces. Diagrams and specifications are given for the architectures of these platforms.
Similar to Heterogeneous Computing : The Future of Systems (20)
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...Anand Haridass
An unprecedented increase in the use of digital devices is causing an explosion in the amount of data generated & captured by businesses. The need to extract economic value from all this "Big Data", that has the potential to transform businesses completely, is immense and drives a whole slew of new workloads. Organizations need to continuously align strategy, business processes and infrastructure investments to derive these insights. This session will talk to how solutions based on POWER deliver this in a cost-effective, open, scalable, high performing and reliable manner.
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016Anand Haridass
This document describes the IBM Data Engine for Hadoop and Spark (IDEHS) - Power Systems Edition, an IBM integrated solution. This solution features a technical-computing architecture that supports running Big Data-related workloads more easily and with higher performance. It includes the servers, network switches, and software needed to run MapReduce and Spark-based workloads.
Moore's law scaling in the sub-100nm technology nodes, while providing for increased circuit density, is no longer driving sufficient
cost/performance improvements generation to generation. The industry is driving towards tightly coupled pieces of the entire stack
- from technology, processors, memory, firmware, operating systems, accelerators, io, hardware/software co-optimization to get there.
The Open Compute Project & the OpenPOWER consortium are two examples of collaborative innovation that could define
open hardware development to address this cost/performance requirement.
FreqLeak: A frequency step based method for efficient leakage power characterization in a system,
20th International Symposium on Low Power Electronics and Design (ISLPED 2015)
Arun Joseph, Anand Haridass, Charles Lefurgy, Sreekanth Pai, Spandana Rachamalla, Francesco Campisano, IBM Systems
Best Paper Nominee @ ISLPED15.
Abstract: Accurate estimation of leakage power at runtime requires post-silicon power measurements across a wide range of temperature and voltage conditions. Testing individual chips, especially at high-temperature corner conditions, is expensive in cost and time. We examine this problem in an industrial context and introduce FreqLeak, a frequency step based method for inexpensive and efficient leakage power characterization in a system. It enables a more thorough characterization than can be accomplished on a wafer prober alone due to time and equipment costs. Experimental evaluation on IBM POWER8 based systems demonstrates the efficiency of the proposed method, within an error of 5%. Further, we discuss the application of FreqLeak in system level power management.
FirmLeak: A framework for efficient and accurate runtime estimation of leakage power by firmware
Arun Joseph, Anand Haridass, Charles Lefurgy, Spandana Rachamalla, Sreekanth Pai, Diyanesh Chinnakkonda, Vidushi Goyal
2015 28th International Conference on VLSI Design (VLSID),
3-7 Jan. 2015
Abstract:Separating the dynamic power and leakage power components from total microprocessor power can enable new optimizations for cloud computing. To this end, we introduce FirmLeak, a new framework that enables accurate, real-time estimation of microprocessor leakage power by system software. FirmLeak accounts for power-gating regions, per-core voltage domains, and manufacturing variation. We present an experimental evaluation of FirmLeak on a POWER7+ microprocessor for a range of hardware parts, voltages and temperatures. We discuss how this can be used in two applications to manage power by 1) improving billing of energy for cloud computing and 2) optimizing fan power consumption.
The document discusses OpenPOWER, an open ecosystem using the POWER architecture to share expertise, investment, and intellectual property. It outlines the goals of the OpenPOWER Foundation to serve evolving customer needs through collaborative innovation and solutions. Examples are provided of innovations developed through partnerships, such as accelerated databases, optimized flash storage, and high performance computing systems. The benefits of the OpenPOWER approach for customers are affirmed through adoption of Linux distributions and cloud deployments.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
Video traffic on the Internet is constantly growing; networked multimedia applications consume a predominant share of the available Internet bandwidth. A major technical breakthrough and enabler in multimedia systems research and of industrial networked multimedia services certainly was the HTTP Adaptive Streaming (HAS) technique. This resulted in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (MPEG-DASH) which, together with HTTP Live Streaming (HLS), is widely used for multimedia delivery in today’s networks. Existing challenges in multimedia systems research deal with the trade-off between (i) the ever-increasing content complexity, (ii) various requirements with respect to time (most importantly, latency), and (iii) quality of experience (QoE). Optimizing towards one aspect usually negatively impacts at least one of the other two aspects if not both. This situation sets the stage for our research work in the ATHENA Christian Doppler (CD) Laboratory (Adaptive Streaming over HTTP and Emerging Networked Multimedia Services; https://athena.itec.aau.at/), jointly funded by public sources and industry. In this talk, we will present selected novel approaches and research results of the first year of the ATHENA CD Lab’s operation. We will highlight HAS-related research on (i) multimedia content provisioning (machine learning for video encoding); (ii) multimedia content delivery (support of edge processing and virtualized network functions for video networking); (iii) multimedia content consumption and end-to-end aspects (player-triggered segment retransmissions to improve video playout quality); and (iv) novel QoE investigations (adaptive point cloud streaming). We will also put the work into the context of international multimedia systems research.
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
What Not to Document and Why_ (North Bay Python 2024)Margaret Fero
We’re hopefully all on board with writing documentation for our projects. However, especially with the rise of supply-chain attacks, there are some aspects of our projects that we really shouldn’t document, and should instead remediate as vulnerabilities. If we do document these aspects of a project, it may help someone compromise the project itself or our users. In this talk, you will learn why some aspects of documentation may help attackers more than users, how to recognize those aspects in your own projects, and what to do when you encounter such an issue.
These are slides as presented at North Bay Python 2024, with one minor modification to add the URL of a tweet screenshotted in the presentation.
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
Blockchain and Cyber Defense Strategies in new genre timesanupriti
Explore robust defense strategies at the intersection of blockchain technology and cybersecurity. This presentation delves into proactive measures and innovative approaches to safeguarding blockchain networks against evolving cyber threats. Discover how secure blockchain implementations can enhance resilience, protect data integrity, and ensure trust in digital transactions. Gain insights into cutting-edge security protocols and best practices essential for mitigating risks in the blockchain ecosystem.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Data Protection in a Connected World: Sovereignty and Cyber Securityanupriti
Delve into the critical intersection of data sovereignty and cyber security in this presentation. Explore unconventional cyber threat vectors and strategies to safeguard data integrity and sovereignty in an increasingly interconnected world. Gain insights into emerging threats and proactive defense measures essential for modern digital ecosystems.
Interaction Latency: Square's User-Centric Mobile Performance MetricScyllaDB
Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run).
However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion).
At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead?
This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
1. IBM Confidential
Heterogeneous Computing
The Future of Systems
Anand Haridass
Senior Technical Staff Member
IBM Cognitive Systems
NITK (KREC) – Batch of ‘95 (E&C)
IBM Academy of Technology
NITK-IBM Computer Systems Research Group (NCSRG)
Seminar Sep/18/2017
2. 2
Agenda
System Overview
Technology Trends – End of Dennard Scaling
Vertical Integration - OpenPOWER
“Feeding the Engine” – Memory / Storage
Need for High Performance Bus – OpenCAPI
GPU Attach - NVLINK
Accelerator Examples
3. 3
Von Neumann Architecture
• First published by John von Neumann in 1945.
• Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs.
• Stored-program computer concept instruction data and program data are stored in the same memory.
• Most Servers & PC’s produced today use this design.
4. 4
Typical 2 Socket Systems [2017]
CPU CPU
Memory Memory
IO/ Storage / NW
AcceleratorAccelerator
IO/ Storage / NW
5. 5
Processor Technology Trends
Moore’’’’s Law
Alive & Kicking
Moore’s Law (1965)
”Number of transistors in a dense integrated circuit
doubles approximately every two years”
6. 6
Dennard Scaling Limits
Dennard scaling As transistors get smaller their power density stays constant, so that the
power use stays in proportion with area: both voltage and current scale (downward) with
length.
Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30%
(0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating
frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x
frequency) by 50%.
• Voltage scaling for high-performance designs is limited
• By leakage issues: can’t reduce threshold voltages
• Need steeper sub-threshold slopes
• Limited by variability, esp VT variability
• Need to minimize random dopant fluctuations
• Limited by gate oxide thickness
• Some relief from high-K materials
• Limited voltage scaling + decreasing feature sizes
Increasing electric fields
• New device structures needed (FinFETs)
• Reliability challenges (devices and wires)
7. 7
CMOS Power - Performance Scaling
Where this curve is flat, can only improve chip frequency by:
a) Pushing core/chip to higher power density (air cooling limits)
b) Design power efficiency improvements (low-hanging fruit all gone)
10
100
0.01 0.1 1 10
Feature pitch (microns)
RelativePerformanceMetric
(Constpowerdensity) When scaling
was good…
12. 12
End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics
Cost/Performance is the metric
Processors
Semiconductor Technology
Industry trends, Challenges & Opportunities
Microprocessors alone no longer drive sufficient Cost/Performance improvements
15. 15
Materials Innovations - Increased Complexity & Cost
Global Foundries projects that a
computer chip manufacturing plant in NY
would cost $14.7 billion to build
16. 16
“Data Access” Performance
(bandwidth & latency) & Cost
(Power) still very challenging
Some techniques to hide
latency/bw/pwr
Caches
Locality optimization
Out-of-order execution
Multithreading
Pre-fetching
“Fat’ pipes / Memory Buffers
ns
StorageMemory
Storage Class Memory
(100 – 1000ns)
Source: SNIA
“Feeding the Engine” Challenge
17. 17
Access latency in
uP cycles
(@ 4GHz)
Source H.Hunter IBM
21 23
211 213 215
219 223
L1/L2(SRAM) HDD
27
L3/L4
25
29 217
221
Flash
“I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store)
DRAM
Memory / Storage
Storage Class of Memory
NVMe - Non-Volatile Memory express (PCIe)
• Standardized high performance interface for PCI Express SSD.
Available today in three different form factors: PCIe Add in Card, SFF
2.5” and M.2
• PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs
[1.5GB/s /port]
• PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs
[3GB/s /port]
NVMe over fabrics (low latency RDMA access) <10us including switches
CAPI based Flash (today) x16 (16GB/s) – at faster access latencies
(more on this later)
HBM (High Bandwidth memory)
• 3D Stacked DRAM from AMD/Hynix/Samsung
• HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked)
• 1024bits x 2GT/s
• HBM3 512GB/sec ~2020 time frame
NVDIMM
• Persistent memory solution on DDR interface
• Combines DRAM, NAND Flash and power source
• Delivers DRAM R/W perf with the persistence & reliability
of NAND
18. 18 Source: SNIA
The Contenders
https://www.snia.org/sites/default/files/NVM/2016/presentations/Panel_1_Combined_NVM_Futures%20Revision.pdf
19. 19
Function offload – greater concurrency & utilization
Power efficiency (performance/watt)
Workloads
Encryption-decryption / Compression-
decompression / Encoding-decoding / Network
Controllers / Math Libraries / DB queries / Search
Deep Learning (Arms race !) for training &
inferencing
Hardware Acceleration
Types of Accelerators
General Purpose GPU / Many Integrated Core (MIC)
Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon
Field Programmable Gate Array (FPGA)
Xilinx, Altera (now Intel)
Purpose Built / Custom ASIC’s
Google’s TPU
Intelligent Network Controllers
Cavium ARM-accelerated NIC
Mellanox NIC+FPGA
Microsoft FPGA-only network adapter
Traditionally (“IO” limited) sequential instructions
on processor / parallel compute offloaded to
accelerator
Penalty for “IO” operations heavy
20. 20
HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth
HPC & Deep learning require more bandwidth between accelerators and memory
PCI Express has limitations (coherence / bandwidth / protocol overhead)
Desired Attributes
Low Latency / High Bandwidth / Coherence
Emergence of complex storage & memory solutions (BW & latency & heterogeneity)
Growing demand for network performance (BW & latency)
Various form factors (e.g., GPUs, FPGAs, ASICs, etc.)
Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in
Volume pricing advantages & Broad software ecosystem growth and adoption
Vendor specific variants
Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport
Open Standards evolving
Cache Coherent Interconnect for Accelerators (CCIX) www.ccixconsortium.com
Gen-Z genzconsortium.org
Open Coherent Accelerator Processor Interface (OpenCAPI) opencapi.org
Need for High Performance Next Generation Bus/Interconnect
21. 21
Coherent Accelerator Processor Interface (CAPI) - 2014
CAPP PCIe
Power Processor
FPGA
Functionn
Function0
Function1
Function2
CAPI
IBM Supplied POWER
Service Layer
Virtual Addressing
Removes the requirement for pinning system memory for PCIe
transfers
Eliminates the copying of data into and out of the pinned DMA buffers
Eliminates the operating system call overhead to pin memory for
DMA
Accelerator can work with same addresses that the processors use
Pointers can be de-referenced same as the host application
- Example: Enables the ability to traverse data structures
Coherent Caching of Data
Enables an accelerator to cache data structures
Enables Cache to Cache transfers between accelerator and processor
Enables the accelerator to participate in “Locks” as a normal thread
Elimination of Device Driver
Direct communication with Application
No requirement to call an OS device driver or Hypervisor function for
mainline processing
Enables Accelerator Features not possible with PCIe
Enables efficient Hybrid Applications
Applications partially implemented in the accelerator and partially on
the host CPU
Visibility to full system memory
Simpler programming model for Application Modules
Coherent Accelerator Processor Proxy (CAPP)
– Proxy for FPGA Accelerator on PowerBus
– Integrated into Processor
– Programmable (Table Driven) Protocol for CAPI
– Shadow Cache Directory for Accelerator
• Up to 1MB Cache Tags (Line based)
• Larger block based Cache
POWER Service Layer (PSL)
– Implemented in FPGA Technology
– Provides Address Translation for Accelerator
• Compatible with POWER Architecture
– Provides Cache for Accelerator
– Facilities for downloading Accelerator Functions
22. 22
PCIe
How CAPI Works
AlgorithmAlgo mrith
POWER8 Processor
Acceleration Portion:
Data or Compute Intensive,
Storage or External I/O
Application Portion:
Data Set-up, Control
Sharing the same memory space
Accelerator is a peer to POWER8 Core
CAPI Developer Kit Card
Coherent Accelerator Processor Interface (CAPI) - 2014
Accelerator is a Full Peer to Processor
Accelerator Function(s) use an unmodified
Effective address
Full access to Real address space
Utilize Processor’s Page Tables Directly
Page Faults handled by System Software
Multiple Functions can exist in a single
Accelerator
23. 23
Memory Subsystem
Virt Addr
IO Attached Accelerator
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA
PCIE
Variables
Input
Data
DD
Device Driver
Storage Area
Variables
Input
Data
Variables
Input
Data
Output
Data
Output
Data
An application called a device driver to utilize an FPGA Accelerator.
The device driver performed a memory mapping operation.
3 versions of the data (not coherent).
1000s of instructions in the device driver.
24. 24
Memory Subsystem
Virt Addr
CAPI Coherency
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA
PCIE
With CAPI, the FPGA shares memory with the cores
PSL
Variable
s
Input
Data
Output
Data
1 coherent version of the data.
No device driver call/instructions.
25. 25
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Application
Dependent, but
Equal to below
Application
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
400 Instructions 100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
26. 26
IBM Accelerated GZIP Compression
An FPGA-based low-latency GZIP Compressor & Decompressor with
single-thread througput of ~2GB/s and a compression rate significantly
better than low-CPU overhead compressors like snappy.
29. 29
CAPI Acceleration
29
Examples: Encryption, Compression, Erasure prior to network or storage
Processor
Chip
Acc
Data
Egress Transform
DLx/TLx
Processor
Chip
Acc
Data
Bi-Directional Transform
Acc
TLx/DLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning potentially using OpenCAPI attached memory
Memory Transform
Processor
Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor
Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Ingress Transform
Processor
Chip
Acc
DataDLx/TLx
Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI),
Data Plane Accelerator (DPA), Video Encoding (H.265), etc
Needle-In-A-Haystack Engine
Haystack
Data
OpenCAPI WINS due to Bandwidth
to/from accelerators, best of breed
latency, and flexibility of an Open
architecture
30. 30
NVLink 1
4 links
20 GBps per link raw bandwidth each
direction
~160GBps total net NVLink bandwidth
NVLink 2
6 links
25GBps per link raw bandwidth each
direction
~300GBps total net NVLink bandwidth
Volta GV100
• 15 TFLOPS FP32
• 16GB HBM2 – 900 GB/s
• 300W TDP
• 50 GFLOPS/W (FP32)
• 12nm process
• 300GB/s NV Link2
• Tensor Core....
Source: Nvidia
NVIDIA GPU
31. 31
“Minsky” S822LC for HPC
• Tight coupling: strong CPU: strong GPU performance
• Equalizing access to memory - for all kinds of programming
• Closer programming to the CPU paradigm
115GB/S 115GB/S
NVLink
DDR4
P8’
DDR4
P8’
Tesla
P100
Tesla
P100
80GB/S Tesla
P100
Tesla
P100
80GB/S
OpenPOWER P8’ Design
PCIe
32GBps
GPUGPU
x86x86
GPUGPU GPUGPU
x86x86
GPUGPU
For x86 Servers: PCIe Bottleneck
No NVLink between CPU & GPU
2.7X faster query response time on “Minsky”
87% of the total speedup (2.35x of 2.7x
improvement) is due to the NVLink Interface
from CPU:GPU
• Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time.
• Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04.
• Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
33. 33
Google TPU 1.0
[Jouppi et al., ISCA 2017]
Relative performance/Watt (TDP) of GPU server (blue) and
TPU server (red) to CPU server, and TPU server to GPU
server (orange).
TPU’ is an improved TPU that uses GDDR5 memory. The
green bar shows its ratio to the CPU server, and the lavender
bar shows its relation to the GPU server.
Total includes host server power, but incremental doesn’t. GM
and WM are the geometric and weighted means.
34. 34
Google TPU performance
Stars are for the TPU
Triangles are for the K80
Circles are for Haswell.
[Jouppi et al., ISCA 2017]
35. 35
Microsoft Azure FPGA Usage
[M.Russinovich, MSBuild 2017]
FPGA for SDN Offload FPGA for Bing
37. 37
Ease of Consumption
Compiler Optimization
Math libraries optimization
Native Support for CUDA / OpenMP / OpenCL ..
Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
42. 42
When to Use FPGAs
Transistor Efficiency & Extreme Parallelism
Bit-level operations
Variable-precision floating point
Power-Performance Advantage
>2x compared to Multicore (MIC) or GPGPU
Unused LUTs are powered off
Technology Scaling better than CPU/GPU
FPGAs are not frequency or power limited yet
3D has great potential
Dynamic reconfiguration
Flexibility for application tuning at run-time vs.
compile-time
Additional advantages when FPGAs are network
connected ...
allows network as well as compute
specialization
Extreme FLOPS & Parallelism
Double-precision floating point leadership
Hundreds of GPGPU cores
Programming Ease & Software Group Interest
CUDA & extensive libraries
OpenCL
IBM Java (coming soon)
Bandwidth Advantage on Power
Start w/PCIe gen3 x16 and then move to
NVLink
Leverage existing GPGPU eco-system and
development base
Lots of existing use-Cases to build on
Heavy HPC investment in GPGPU
When to Use GPGPUs
45. Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the
website
at www.opencapi.org
- Register
- Download
46. OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory
Open standard interface enables to attach wide range of devices
OpenCAPI protocol was architected to minimize latency
Especially advantageous for classic DRAM memory
Extreme bandwidth beyond classical DDR memory interface
Agnostic interface allows extension to evolving memory technologies in the future
(e.g., compute-in-memory)
Ability to handle a memory buffer to decouple raw memory and host interfaces to
optimize power, cost and performance
Common physical interface between non-memory and memory devices
9
47. 47
OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes
• Architecture agnostic bus – Applicable with any system/microprocessor architecture
• Coherency - Attached devices operate natively within application’s user space and coherently with host uP
• High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency
• Point to point construct optimized within a system
• Allows attached device to fully participate in application without kernel involvement/overhead
• 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device
• Supports a wide range of use cases and access semantics
• Hardware accelerators
• High-performance I/O devices
• Advanced memories and Classic memory
• Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.)
• Reduced complexity of design implementation
• Wanted to make this easy for the accelerator, memory and system design teams
• Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify
attached devices and facilitate interoperability across multiple CPU architectures
48. Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits
An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Improves accelerator performance
The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI-attached devices
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access