(Go: >> BACK << -|- >> HOME <<)

SlideShare a Scribd company logo
IBM Confidential
Heterogeneous Computing
The Future of Systems
Anand Haridass
Senior Technical Staff Member
IBM Cognitive Systems
NITK (KREC) – Batch of ‘95 (E&C)
IBM Academy of Technology
NITK-IBM Computer Systems Research Group (NCSRG)
Seminar Sep/18/2017
System Overview
Technology Trends – End of Dennard Scaling
Vertical Integration - OpenPOWER
“Feeding the Engine” – Memory / Storage
Need for High Performance Bus – OpenCAPI
Accelerator Examples
Von Neumann Architecture
• First published by John von Neumann in 1945.
• Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs.
• Stored-program computer concept instruction data and program data are stored in the same memory.
• Most Servers & PC’s produced today use this design.
Typical 2 Socket Systems [2017]
Memory Memory
IO/ Storage / NW
IO/ Storage / NW
Processor Technology Trends
Moore’’’’s Law
Alive & Kicking
Moore’s Law (1965)
”Number of transistors in a dense integrated circuit
doubles approximately every two years”
Dennard Scaling Limits
Dennard scaling As transistors get smaller their power density stays constant, so that the
power use stays in proportion with area: both voltage and current scale (downward) with
Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30%
(0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating
frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x
frequency) by 50%.
• Voltage scaling for high-performance designs is limited
• By leakage issues: can’t reduce threshold voltages
• Need steeper sub-threshold slopes
• Limited by variability, esp VT variability
• Need to minimize random dopant fluctuations
• Limited by gate oxide thickness
• Some relief from high-K materials
• Limited voltage scaling + decreasing feature sizes
Increasing electric fields
• New device structures needed (FinFETs)
• Reliability challenges (devices and wires)
CMOS Power - Performance Scaling
Where this curve is flat, can only improve chip frequency by:
a) Pushing core/chip to higher power density (air cooling limits)
b) Design power efficiency improvements (low-hanging fruit all gone)
0.01 0.1 1 10
Feature pitch (microns)
(Constpowerdensity) When scaling
was good…
Processor Technology Trends
‘‘‘‘Affordable’’’’ Air Cooled
Limit ~120-190W
Dennard Scaling
limiting from 2002-04
Processor Technology Trends
Processor Frequency
peaks at ~6Ghz and
settle between 2-4GHz
Processor Technology Trends
Processor Technology Trends
Multi-Cores (& threads)
Parallel Programming to
End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics
Cost/Performance is the metric
Semiconductor Technology
Industry trends, Challenges & Opportunities
Microprocessors alone no longer drive sufficient Cost/Performance improvements
System stack innovations are required to drive
OpenPOWER Foundation
Materials Innovations - Increased Complexity & Cost
Global Foundries projects that a
computer chip manufacturing plant in NY
would cost $14.7 billion to build
“Data Access” Performance
(bandwidth & latency) & Cost
(Power) still very challenging
Some techniques to hide
Locality optimization
Out-of-order execution
“Fat’ pipes / Memory Buffers
Storage Class Memory
(100 – 1000ns)
Source: SNIA
“Feeding the Engine” Challenge
Access latency in
uP cycles
(@ 4GHz)
Source H.Hunter IBM
21 23
211 213 215
219 223
29 217
“I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store)
Memory / Storage
Storage Class of Memory
NVMe - Non-Volatile Memory express (PCIe)
• Standardized high performance interface for PCI Express SSD.
Available today in three different form factors: PCIe Add in Card, SFF
2.5” and M.2
• PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs
[1.5GB/s /port]
• PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs
[3GB/s /port]
NVMe over fabrics (low latency RDMA access) <10us including switches
CAPI based Flash (today) x16 (16GB/s) – at faster access latencies
(more on this later)
HBM (High Bandwidth memory)
• 3D Stacked DRAM from AMD/Hynix/Samsung
• HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked)
• 1024bits x 2GT/s
• HBM3 512GB/sec ~2020 time frame
• Persistent memory solution on DDR interface
• Combines DRAM, NAND Flash and power source
• Delivers DRAM R/W perf with the persistence & reliability
18 Source: SNIA
The Contenders
Function offload – greater concurrency & utilization
Power efficiency (performance/watt)
Encryption-decryption / Compression-
decompression / Encoding-decoding / Network
Controllers / Math Libraries / DB queries / Search
Deep Learning (Arms race !) for training &
Hardware Acceleration
Types of Accelerators
General Purpose GPU / Many Integrated Core (MIC)
Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon
Field Programmable Gate Array (FPGA)
Xilinx, Altera (now Intel)
Purpose Built / Custom ASIC’s
Google’s TPU
Intelligent Network Controllers
Cavium ARM-accelerated NIC
Mellanox NIC+FPGA
Microsoft FPGA-only network adapter
Traditionally (“IO” limited) sequential instructions
on processor / parallel compute offloaded to
Penalty for “IO” operations heavy
HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth
HPC & Deep learning require more bandwidth between accelerators and memory
PCI Express has limitations (coherence / bandwidth / protocol overhead)
Desired Attributes
Low Latency / High Bandwidth / Coherence
Emergence of complex storage & memory solutions (BW & latency & heterogeneity)
Growing demand for network performance (BW & latency)
Various form factors (e.g., GPUs, FPGAs, ASICs, etc.)
Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in
Volume pricing advantages & Broad software ecosystem growth and adoption
Vendor specific variants
Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport
Open Standards evolving
Cache Coherent Interconnect for Accelerators (CCIX) www.ccixconsortium.com
Gen-Z genzconsortium.org
Open Coherent Accelerator Processor Interface (OpenCAPI) opencapi.org
Need for High Performance Next Generation Bus/Interconnect
Coherent Accelerator Processor Interface (CAPI) - 2014
Power Processor
IBM Supplied POWER
Service Layer
Virtual Addressing
Removes the requirement for pinning system memory for PCIe
Eliminates the copying of data into and out of the pinned DMA buffers
Eliminates the operating system call overhead to pin memory for
Accelerator can work with same addresses that the processors use
Pointers can be de-referenced same as the host application
- Example: Enables the ability to traverse data structures
Coherent Caching of Data
Enables an accelerator to cache data structures
Enables Cache to Cache transfers between accelerator and processor
Enables the accelerator to participate in “Locks” as a normal thread
Elimination of Device Driver
Direct communication with Application
No requirement to call an OS device driver or Hypervisor function for
mainline processing
Enables Accelerator Features not possible with PCIe
Enables efficient Hybrid Applications
Applications partially implemented in the accelerator and partially on
the host CPU
Visibility to full system memory
Simpler programming model for Application Modules
Coherent Accelerator Processor Proxy (CAPP)
– Proxy for FPGA Accelerator on PowerBus
– Integrated into Processor
– Programmable (Table Driven) Protocol for CAPI
– Shadow Cache Directory for Accelerator
• Up to 1MB Cache Tags (Line based)
• Larger block based Cache
POWER Service Layer (PSL)
– Implemented in FPGA Technology
– Provides Address Translation for Accelerator
• Compatible with POWER Architecture
– Provides Cache for Accelerator
– Facilities for downloading Accelerator Functions
How CAPI Works
AlgorithmAlgo mrith
POWER8 Processor
Acceleration Portion:
Data or Compute Intensive,
Storage or External I/O
Application Portion:
Data Set-up, Control
Sharing the same memory space
Accelerator is a peer to POWER8 Core
CAPI Developer Kit Card
Coherent Accelerator Processor Interface (CAPI) - 2014
Accelerator is a Full Peer to Processor
Accelerator Function(s) use an unmodified
Effective address
Full access to Real address space
Utilize Processor’s Page Tables Directly
Page Faults handled by System Software
Multiple Functions can exist in a single
Memory Subsystem
Virt Addr
IO Attached Accelerator
Device Driver
Storage Area
An application called a device driver to utilize an FPGA Accelerator.
The device driver performed a memory mapping operation.
3 versions of the data (not coherent).
1000s of instructions in the device driver.
Memory Subsystem
Virt Addr
CAPI Coherency
With CAPI, the FPGA shares memory with the cores
1 coherent version of the data.
No device driver call/instructions.
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator
Shared Memory
DD Call
Copy or Pin
Source Data
MMIO Notify
Poll / Interrupt
Copy or Unpin
Result Data
Ret. From DD
Dependent, but
Equal to below
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
400 Instructions 100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
IBM Accelerated GZIP Compression
An FPGA-based low-latency GZIP Compressor & Decompressor with
single-thread througput of ~2GB/s and a compression rate significantly
better than low-CPU overhead compressors like snappy.
CAPI Attached Flash
CAPI Acceleration
Examples: Encryption, Compression, Erasure prior to network or storage
Egress Transform
Bi-Directional Transform
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning potentially using OpenCAPI attached memory
Memory Transform
Example: Basic work offload
Examples: Database searches, joins, intersections, merges
Ingress Transform
Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI),
Data Plane Accelerator (DPA), Video Encoding (H.265), etc
Needle-In-A-Haystack Engine
OpenCAPI WINS due to Bandwidth
to/from accelerators, best of breed
latency, and flexibility of an Open
NVLink 1
4 links
20 GBps per link raw bandwidth each
~160GBps total net NVLink bandwidth
NVLink 2
6 links
25GBps per link raw bandwidth each
~300GBps total net NVLink bandwidth
Volta GV100
• 15 TFLOPS FP32
• 16GB HBM2 – 900 GB/s
• 300W TDP
• 50 GFLOPS/W (FP32)
• 12nm process
• 300GB/s NV Link2
• Tensor Core....
Source: Nvidia
“Minsky” S822LC for HPC
• Tight coupling: strong CPU: strong GPU performance
• Equalizing access to memory - for all kinds of programming
• Closer programming to the CPU paradigm
115GB/S 115GB/S
80GB/S Tesla
OpenPOWER P8’ Design
For x86 Servers: PCIe Bottleneck
No NVLink between CPU & GPU
2.7X faster query response time on “Minsky”
87% of the total speedup (2.35x of 2.7x
improvement) is due to the NVLink Interface
from CPU:GPU
• Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time.
• Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04.
• Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
Custom ASIC’s
Reducing Flexibility
Increasing Efficiency
Source: William Dally, Nvidia
Google TPU 1.0
[Jouppi et al., ISCA 2017]
Relative performance/Watt (TDP) of GPU server (blue) and
TPU server (red) to CPU server, and TPU server to GPU
server (orange).
TPU’ is an improved TPU that uses GDDR5 memory. The
green bar shows its ratio to the CPU server, and the lavender
bar shows its relation to the GPU server.
Total includes host server power, but incremental doesn’t. GM
and WM are the geometric and weighted means.
Google TPU performance
Stars are for the TPU
Triangles are for the K80
Circles are for Haswell.
[Jouppi et al., ISCA 2017]
Microsoft Azure FPGA Usage
[M.Russinovich, MSBuild 2017]
FPGA for SDN Offload FPGA for Bing
Hardware Micro-services
A hardware-only self-contained service that can be distributed and
accessed from across the datacenter compute fabric
Ease of Consumption
Compiler Optimization
Math libraries optimization
Native Support for CUDA / OpenMP / OpenCL ..
Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
POWER9 (SO) – Premier Accelerator Platform
2 Socket SMP: 256 GB/s
OpenCAPI and/or NVLink 2.0
200-300 GB/s
3x16 PCIeG4 : 192 GB/s
CAPI 2.0 Links : 128 GB/s
(Uses up to 2 x16 ports)
8 DDR4 ports @ 2667 MT/s
PCIe Device
IBM / Partner
IBM / Partner
Bandwidths shown are bi-directional
Newell POWER9 System - 6 GPU / 2CAPI
Source: SNIA / Flash Summit
When to Use FPGAs
Transistor Efficiency & Extreme Parallelism
Bit-level operations
Variable-precision floating point
Power-Performance Advantage
>2x compared to Multicore (MIC) or GPGPU
Unused LUTs are powered off
Technology Scaling better than CPU/GPU
FPGAs are not frequency or power limited yet
3D has great potential
Dynamic reconfiguration
Flexibility for application tuning at run-time vs.
Additional advantages when FPGAs are network
connected ...
allows network as well as compute
Extreme FLOPS & Parallelism
Double-precision floating point leadership
Hundreds of GPGPU cores
Programming Ease & Software Group Interest
CUDA & extensive libraries
IBM Java (coming soon)
Bandwidth Advantage on Power
Start w/PCIe gen3 x16 and then move to
Leverage existing GPGPU eco-system and
development base
Lots of existing use-Cases to build on
Heavy HPC investment in GPGPU
When to Use GPGPUs
Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the
at www.opencapi.org
- Register
- Download
OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory
Open standard interface enables to attach wide range of devices
OpenCAPI protocol was architected to minimize latency
Especially advantageous for classic DRAM memory
Extreme bandwidth beyond classical DDR memory interface
Agnostic interface allows extension to evolving memory technologies in the future
(e.g., compute-in-memory)
Ability to handle a memory buffer to decouple raw memory and host interfaces to
optimize power, cost and performance
Common physical interface between non-memory and memory devices
OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes
• Architecture agnostic bus – Applicable with any system/microprocessor architecture
• Coherency - Attached devices operate natively within application’s user space and coherently with host uP
• High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency
• Point to point construct optimized within a system
• Allows attached device to fully participate in application without kernel involvement/overhead
• 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device
• Supports a wide range of use cases and access semantics
• Hardware accelerators
• High-performance I/O devices
• Advanced memories and Classic memory
• Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.)
• Reduced complexity of design implementation
• Wanted to make this easy for the accelerator, memory and system design teams
• Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify
attached devices and facilitate interoperability across multiple CPU architectures
Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits
An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Improves accelerator performance
The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI-attached devices
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access

More Related Content

What's hot

OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research
Ganesan Narayanasamy
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
Rashid Ansari
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
Ganesan Narayanasamy
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
Diego Alberto Tamayo
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
Ganesan Narayanasamy
Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)
David Spurway
Ganesan Narayanasamy
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
Ganesan Narayanasamy
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC

What's hot (20)

OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC

Similar to Heterogeneous Computing : The Future of Systems

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
Heiko Joerg Schick
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
Filipe Miranda
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
LEGATO project
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
Ryousei Takano
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
FPGA Central
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Memory Fabric Forum
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
POWER9: IBM’s Next Generation POWER Processor
POWER9: IBM’s Next Generation POWER ProcessorPOWER9: IBM’s Next Generation POWER Processor
POWER9: IBM’s Next Generation POWER Processor
Wilhelm van Belkum
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
Rebekah Rodriguez
NVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To FallNVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To Fall
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Shuquan Huang
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
Roberto Brandao
HiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentationHiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentation
VEDLIoT Project

Similar to Heterogeneous Computing : The Future of Systems (20)

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
POWER9: IBM’s Next Generation POWER Processor
POWER9: IBM’s Next Generation POWER ProcessorPOWER9: IBM’s Next Generation POWER Processor
POWER9: IBM’s Next Generation POWER Processor
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
NVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To FallNVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To Fall
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
HiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentationHiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentation

More from Anand Haridass

2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
Anand Haridass
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
Anand Haridass
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
Anand Haridass
Performance beyond moore's law
Performance beyond moore's lawPerformance beyond moore's law
Performance beyond moore's law
Anand Haridass
ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)
Anand Haridass
VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)
Anand Haridass
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on IT
Anand Haridass
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
Anand Haridass

More from Anand Haridass (8)

2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
Performance beyond moore's law
Performance beyond moore's lawPerformance beyond moore's law
Performance beyond moore's law
ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)
VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on IT
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER

Recently uploaded

Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
What Not to Document and Why_ (North Bay Python 2024)
What Not to Document and Why_ (North Bay Python 2024)What Not to Document and Why_ (North Bay Python 2024)
What Not to Document and Why_ (North Bay Python 2024)
Margaret Fero
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
Blockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre timesBlockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre times
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
Data Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber SecurityData Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber Security
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer

Recently uploaded (20)

Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
What Not to Document and Why_ (North Bay Python 2024)
What Not to Document and Why_ (North Bay Python 2024)What Not to Document and Why_ (North Bay Python 2024)
What Not to Document and Why_ (North Bay Python 2024)
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Blockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre timesBlockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre times
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Data Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber SecurityData Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber Security
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024

Heterogeneous Computing : The Future of Systems

  • 1. IBM Confidential Heterogeneous Computing The Future of Systems Anand Haridass Senior Technical Staff Member IBM Cognitive Systems NITK (KREC) – Batch of ‘95 (E&C) IBM Academy of Technology NITK-IBM Computer Systems Research Group (NCSRG) Seminar Sep/18/2017
  • 2. 2 Agenda System Overview Technology Trends – End of Dennard Scaling Vertical Integration - OpenPOWER “Feeding the Engine” – Memory / Storage Need for High Performance Bus – OpenCAPI GPU Attach - NVLINK Accelerator Examples
  • 3. 3 Von Neumann Architecture • First published by John von Neumann in 1945. • Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs. • Stored-program computer concept instruction data and program data are stored in the same memory. • Most Servers & PC’s produced today use this design.
  • 4. 4 Typical 2 Socket Systems [2017] CPU CPU Memory Memory IO/ Storage / NW AcceleratorAccelerator IO/ Storage / NW
  • 5. 5 Processor Technology Trends Moore’’’’s Law Alive & Kicking Moore’s Law (1965) ”Number of transistors in a dense integrated circuit doubles approximately every two years”
  • 6. 6 Dennard Scaling Limits Dennard scaling As transistors get smaller their power density stays constant, so that the power use stays in proportion with area: both voltage and current scale (downward) with length. Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%. • Voltage scaling for high-performance designs is limited • By leakage issues: can’t reduce threshold voltages • Need steeper sub-threshold slopes • Limited by variability, esp VT variability • Need to minimize random dopant fluctuations • Limited by gate oxide thickness • Some relief from high-K materials • Limited voltage scaling + decreasing feature sizes Increasing electric fields • New device structures needed (FinFETs) • Reliability challenges (devices and wires)
  • 7. 7 CMOS Power - Performance Scaling Where this curve is flat, can only improve chip frequency by: a) Pushing core/chip to higher power density (air cooling limits) b) Design power efficiency improvements (low-hanging fruit all gone) 10 100 0.01 0.1 1 10 Feature pitch (microns) RelativePerformanceMetric (Constpowerdensity) When scaling was good…
  • 8. 8 Processor Technology Trends ‘‘‘‘Affordable’’’’ Air Cooled Limit ~120-190W Dennard Scaling limiting from 2002-04
  • 9. 9 Processor Technology Trends Processor Frequency peaks at ~6Ghz and settle between 2-4GHz
  • 11. 11 Processor Technology Trends Multi-Cores (& threads) Parallel Programming to leverage
  • 12. 12 End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics Cost/Performance is the metric Processors Semiconductor Technology Industry trends, Challenges & Opportunities Microprocessors alone no longer drive sufficient Cost/Performance improvements
  • 13. 13 System stack innovations are required to drive Cost/Performance
  • 15. 15 Materials Innovations - Increased Complexity & Cost Global Foundries projects that a computer chip manufacturing plant in NY would cost $14.7 billion to build
  • 16. 16 “Data Access” Performance (bandwidth & latency) & Cost (Power) still very challenging Some techniques to hide latency/bw/pwr Caches Locality optimization Out-of-order execution Multithreading Pre-fetching “Fat’ pipes / Memory Buffers ns StorageMemory Storage Class Memory (100 – 1000ns) Source: SNIA “Feeding the Engine” Challenge
  • 17. 17 Access latency in uP cycles (@ 4GHz) Source H.Hunter IBM 21 23 211 213 215 219 223 L1/L2(SRAM) HDD 27 L3/L4 25 29 217 221 Flash “I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store) DRAM Memory / Storage Storage Class of Memory NVMe - Non-Volatile Memory express (PCIe) • Standardized high performance interface for PCI Express SSD. Available today in three different form factors: PCIe Add in Card, SFF 2.5” and M.2 • PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs [1.5GB/s /port] • PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs [3GB/s /port] NVMe over fabrics (low latency RDMA access) <10us including switches CAPI based Flash (today) x16 (16GB/s) – at faster access latencies (more on this later) HBM (High Bandwidth memory) • 3D Stacked DRAM from AMD/Hynix/Samsung • HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked) • 1024bits x 2GT/s • HBM3 512GB/sec ~2020 time frame NVDIMM • Persistent memory solution on DDR interface • Combines DRAM, NAND Flash and power source • Delivers DRAM R/W perf with the persistence & reliability of NAND
  • 18. 18 Source: SNIA The Contenders https://www.snia.org/sites/default/files/NVM/2016/presentations/Panel_1_Combined_NVM_Futures%20Revision.pdf
  • 19. 19 Function offload – greater concurrency & utilization Power efficiency (performance/watt) Workloads Encryption-decryption / Compression- decompression / Encoding-decoding / Network Controllers / Math Libraries / DB queries / Search Deep Learning (Arms race !) for training & inferencing Hardware Acceleration Types of Accelerators General Purpose GPU / Many Integrated Core (MIC) Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon Field Programmable Gate Array (FPGA) Xilinx, Altera (now Intel) Purpose Built / Custom ASIC’s Google’s TPU Intelligent Network Controllers Cavium ARM-accelerated NIC Mellanox NIC+FPGA Microsoft FPGA-only network adapter Traditionally (“IO” limited) sequential instructions on processor / parallel compute offloaded to accelerator Penalty for “IO” operations heavy
  • 20. 20 HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth HPC & Deep learning require more bandwidth between accelerators and memory PCI Express has limitations (coherence / bandwidth / protocol overhead) Desired Attributes Low Latency / High Bandwidth / Coherence Emergence of complex storage & memory solutions (BW & latency & heterogeneity) Growing demand for network performance (BW & latency) Various form factors (e.g., GPUs, FPGAs, ASICs, etc.) Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in Volume pricing advantages & Broad software ecosystem growth and adoption Vendor specific variants Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport Open Standards evolving Cache Coherent Interconnect for Accelerators (CCIX) www.ccixconsortium.com Gen-Z genzconsortium.org Open Coherent Accelerator Processor Interface (OpenCAPI) opencapi.org Need for High Performance Next Generation Bus/Interconnect
  • 21. 21 Coherent Accelerator Processor Interface (CAPI) - 2014 CAPP PCIe Power Processor FPGA Functionn Function0 Function1 Function2 CAPI IBM Supplied POWER Service Layer Virtual Addressing Removes the requirement for pinning system memory for PCIe transfers Eliminates the copying of data into and out of the pinned DMA buffers Eliminates the operating system call overhead to pin memory for DMA Accelerator can work with same addresses that the processors use Pointers can be de-referenced same as the host application - Example: Enables the ability to traverse data structures Coherent Caching of Data Enables an accelerator to cache data structures Enables Cache to Cache transfers between accelerator and processor Enables the accelerator to participate in “Locks” as a normal thread Elimination of Device Driver Direct communication with Application No requirement to call an OS device driver or Hypervisor function for mainline processing Enables Accelerator Features not possible with PCIe Enables efficient Hybrid Applications Applications partially implemented in the accelerator and partially on the host CPU Visibility to full system memory Simpler programming model for Application Modules Coherent Accelerator Processor Proxy (CAPP) – Proxy for FPGA Accelerator on PowerBus – Integrated into Processor – Programmable (Table Driven) Protocol for CAPI – Shadow Cache Directory for Accelerator • Up to 1MB Cache Tags (Line based) • Larger block based Cache POWER Service Layer (PSL) – Implemented in FPGA Technology – Provides Address Translation for Accelerator • Compatible with POWER Architecture – Provides Cache for Accelerator – Facilities for downloading Accelerator Functions
  • 22. 22 PCIe How CAPI Works AlgorithmAlgo mrith POWER8 Processor Acceleration Portion: Data or Compute Intensive, Storage or External I/O Application Portion: Data Set-up, Control Sharing the same memory space Accelerator is a peer to POWER8 Core CAPI Developer Kit Card Coherent Accelerator Processor Interface (CAPI) - 2014 Accelerator is a Full Peer to Processor Accelerator Function(s) use an unmodified Effective address Full access to Real address space Utilize Processor’s Page Tables Directly Page Faults handled by System Software Multiple Functions can exist in a single Accelerator
  • 23. 23 Memory Subsystem Virt Addr IO Attached Accelerator POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core App FPGA PCIE Variables Input Data DD Device Driver Storage Area Variables Input Data Variables Input Data Output Data Output Data An application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation. 3 versions of the data (not coherent). 1000s of instructions in the device driver.
  • 24. 24 Memory Subsystem Virt Addr CAPI Coherency POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core App FPGA PCIE With CAPI, the FPGA shares memory with the cores PSL Variable s Input Data Output Data 1 coherent version of the data. No device driver call/instructions.
  • 25. 25 Typical I/O Model Flow: Flow with a Coherent Model: Shared Mem. Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion Application Dependent, but Equal to below Application Dependent, but Equal to above 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions 1,000 Instructions 7.9µs 4.9µs Total ~13µs for data prep 400 Instructions 100 Instructions 0.3µs 0.06µs Total 0.36µs CAPI vs. I/O Device Driver: Data Prep
  • 26. 26 IBM Accelerated GZIP Compression An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread througput of ~2GB/s and a compression rate significantly better than low-CPU overhead compressors like snappy.
  • 28. 28
  • 29. 29 CAPI Acceleration 29 Examples: Encryption, Compression, Erasure prior to network or storage Processor Chip Acc Data Egress Transform DLx/TLx Processor Chip Acc Data Bi-Directional Transform Acc TLx/DLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Needle-in-a-haystack Engine Examples: Machine or Deep Learning potentially using OpenCAPI attached memory Memory Transform Processor Chip Acc DataDLx/TLx Example: Basic work offload Processor Chip Acc NeedlesDLx/TLx Examples: Database searches, joins, intersections, merges Ingress Transform Processor Chip Acc DataDLx/TLx Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI), Data Plane Accelerator (DPA), Video Encoding (H.265), etc Needle-In-A-Haystack Engine Haystack Data OpenCAPI WINS due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  • 30. 30 NVLink 1 4 links 20 GBps per link raw bandwidth each direction ~160GBps total net NVLink bandwidth NVLink 2 6 links 25GBps per link raw bandwidth each direction ~300GBps total net NVLink bandwidth Volta GV100 • 15 TFLOPS FP32 • 16GB HBM2 – 900 GB/s • 300W TDP • 50 GFLOPS/W (FP32) • 12nm process • 300GB/s NV Link2 • Tensor Core.... Source: Nvidia NVIDIA GPU
  • 31. 31 “Minsky” S822LC for HPC • Tight coupling: strong CPU: strong GPU performance • Equalizing access to memory - for all kinds of programming • Closer programming to the CPU paradigm 115GB/S 115GB/S NVLink DDR4 P8’ DDR4 P8’ Tesla P100 Tesla P100 80GB/S Tesla P100 Tesla P100 80GB/S OpenPOWER P8’ Design PCIe 32GBps GPUGPU x86x86 GPUGPU GPUGPU x86x86 GPUGPU For x86 Servers: PCIe Bottleneck No NVLink between CPU & GPU 2.7X faster query response time on “Minsky” 87% of the total speedup (2.35x of 2.7x improvement) is due to the NVLink Interface from CPU:GPU • Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time. • Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04. • Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
  • 32. 32 Custom ASIC’s Reducing Flexibility CPU > GPU > FPGA > ASIC Increasing Efficiency CPU < GPU < FPGA < ASIC Source: William Dally, Nvidia
  • 33. 33 Google TPU 1.0 [Jouppi et al., ISCA 2017] Relative performance/Watt (TDP) of GPU server (blue) and TPU server (red) to CPU server, and TPU server to GPU server (orange). TPU’ is an improved TPU that uses GDDR5 memory. The green bar shows its ratio to the CPU server, and the lavender bar shows its relation to the GPU server. Total includes host server power, but incremental doesn’t. GM and WM are the geometric and weighted means.
  • 34. 34 Google TPU performance Stars are for the TPU Triangles are for the K80 Circles are for Haswell. [Jouppi et al., ISCA 2017]
  • 35. 35 Microsoft Azure FPGA Usage [M.Russinovich, MSBuild 2017] FPGA for SDN Offload FPGA for Bing
  • 36. 36 Hardware Micro-services A hardware-only self-contained service that can be distributed and accessed from across the datacenter compute fabric
  • 37. 37 Ease of Consumption Compiler Optimization Math libraries optimization Native Support for CUDA / OpenMP / OpenCL .. Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
  • 38. 38 POWER9 (SO) – Premier Accelerator Platform …… On-ChipInterconnect PCIeGen4DDR425Gb/s MemoryI/OCAPISMPNV OCAP I On-Chip Accel 16Gb/s 2 Socket SMP: 256 GB/s OpenCAPI and/or NVLink 2.0 200-300 GB/s 3x16 PCIeG4 : 192 GB/s Core POWER9 POWER9 Memory CAPI 2.0 Links : 128 GB/s (Uses up to 2 x16 ports) 8 DDR4 ports @ 2667 MT/s PCIe Device IBM / Partner Device NVIDIA GPU IBM / Partner Device Bandwidths shown are bi-directional 512kL2/SMT8Core+120MBL3NUCACache
  • 39. 39 Newell POWER9 System - 6 GPU / 2CAPI
  • 41. 41 Source: SNIA / Flash Summit
  • 42. 42 When to Use FPGAs Transistor Efficiency & Extreme Parallelism Bit-level operations Variable-precision floating point Power-Performance Advantage >2x compared to Multicore (MIC) or GPGPU Unused LUTs are powered off Technology Scaling better than CPU/GPU FPGAs are not frequency or power limited yet 3D has great potential Dynamic reconfiguration Flexibility for application tuning at run-time vs. compile-time Additional advantages when FPGAs are network connected ... allows network as well as compute specialization Extreme FLOPS & Parallelism Double-precision floating point leadership Hundreds of GPGPU cores Programming Ease & Software Group Interest CUDA & extensive libraries OpenCL IBM Java (coming soon) Bandwidth Advantage on Power Start w/PCIe gen3 x16 and then move to NVLink Leverage existing GPGPU eco-system and development base Lots of existing use-Cases to build on Heavy HPC investment in GPGPU When to Use GPGPUs
  • 43. 43 CCIX Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
  • 44. 44 Gen-Z Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
  • 45. Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI specifications are downloadable from the website at www.opencapi.org - Register - Download
  • 46. OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory Open standard interface enables to attach wide range of devices OpenCAPI protocol was architected to minimize latency Especially advantageous for classic DRAM memory Extreme bandwidth beyond classical DDR memory interface Agnostic interface allows extension to evolving memory technologies in the future (e.g., compute-in-memory) Ability to handle a memory buffer to decouple raw memory and host interfaces to optimize power, cost and performance Common physical interface between non-memory and memory devices 9
  • 47. 47 OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes • Architecture agnostic bus – Applicable with any system/microprocessor architecture • Coherency - Attached devices operate natively within application’s user space and coherently with host uP • High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency • Point to point construct optimized within a system • Allows attached device to fully participate in application without kernel involvement/overhead • 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device • Supports a wide range of use cases and access semantics • Hardware accelerators • High-performance I/O devices • Advanced memories and Classic memory • Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.) • Reduced complexity of design implementation • Wanted to make this easy for the accelerator, memory and system design teams • Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify attached devices and facilitate interoperability across multiple CPU architectures
  • 48. Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits An OpenCAPI device operates in the virtual address spaces of the applications that it supports • Eliminates kernel and device driver software overhead • Allows device to operate on application memory without kernel-level data copies/pinned pages • Simplifies programming effort to integrate accelerators into applications • Improves accelerator performance The Virtual-to-Physical Address Translation occurs in the host CPU • Reduces design complexity of OpenCAPI-attached devices • Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures • Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access