Amazon Elastic Inference Documentation

Amazon Elastic Inference allows you to attach just the amount of GPU-powered inference acceleration you need to any Amazon EC2 instance, Amazon SageMaker instance, or ECS task. This means you can now choose the CPU instance that is best suited to the overall compute, memory, and storage needs of your application, and then separately configure the amount of GPU-powered inference acceleration that you need.

Integrated with Amazon SageMaker, Amazon EC2, and Amazon ECS

There are multiple ways to run inference workloads on AWS: deploy your model on Amazon SageMaker for a fully managed experience, or run it on Amazon EC2 instances or Amazon ECS tasks and manage it yourself. Amazon Elastic Inference is integrated to work with Amazon SageMaker, Amazon EC2, and Amazon ECS, allowing you to add inference acceleration in all scenarios. You can specify the desired amount of inference acceleration when you create your model's HTTPS endpoint in Amazon SageMaker, when you launch your Amazon EC2 instance, and when you define your Amazon ECS task.

TensorFlow, Apache MXNet and PyTorch support

Amazon Elastic Inference is designed to be used with AWS’s enhanced versions of TensorFlow Serving, Apache MXNet and PyTorch. These enhancements enable the frameworks to detect the presence of inference accelerators, optimally distribute the model operations between the accelerator’s GPU and the instance’s CPU, and securely control access to your accelerators using AWS Identity and Access Management (IAM) policies. The enhanced TensorFlow Serving, MXNet and PyTorch libraries are provided in Amazon SageMaker, AWS Deep Learning AMIs, and AWS Deep Learning Containers, so you don't have to make any code change to deploy your models in production.

Open Neural Network Exchange (ONNX) format support

ONNX is an open format that makes it possible to train a model in one deep learning framework and then transfer it to another for inference. This allows you to take advantage of the relative strengths of different frameworks. ONNX is integrated into PyTorch, MXNet, Chainer, Caffe2, and Microsoft Cognitive Toolkit, and there are connectors for many other frameworks including TensorFlow. To use ONNX models with Amazon Elastic Inference, your trained models need to be transferred to the AWS-optimized version of Apache MXNet for production deployment.

Choice of single or mixed precision operations

Amazon Elastic Inference accelerators support both single-precision (32-bit floating point) operations and mixed precision (16-bit floating point) operations. Single precision provides an extremely large numerical range to represent the parameters used by your model. However, most models don’t actually need this much precision and calculating numbers that large results in unnecessary loss of performance. To avoid that problem, mixed-precision operations allow you to reduce the numerical range by half to gain greater inference performance.

Available in multiple amounts of acceleration

Amazon Elastic Inference is available in multiple throughput sizes ranging from 1 to 32 trillion floating point operations per second (TFLOPS) per accelerator, making it efficient for accelerating a wide range of inference models including computer vision, natural language processing, and speech recognition. Compared to standalone Amazon EC2 P3 instances that start at 125 TFLOPS (the smallest P3 instance available), Amazon Elastic Inference starts at a single TFLOPS per accelerator. This allows you to scale up inference acceleration in more appropriate increments. You can also select from larger accelerator sizes, up to 32 TFLOPS per accelerator, for more complex models.

Auto-scaling

Amazon Elastic Inference can be part of the same Amazon EC2 Auto Scaling group you use to scale your Amazon SageMaker, Amazon EC2, and Amazon ECS instances. When EC2 Auto Scaling adds more EC2 instances to meet the demands of your application, it also scales up the accelerator attached to each instance. Similarly, when Auto Scaling reduces your EC2 instances as demand goes down, it also scales down the attached accelerator for each instance. This makes it easier to scale your inference acceleration alongside your application’s compute capacity to meet the demands of your application.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.