Mixed Precision Training Review

2020/09/06
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab
Research Engineer
PR-274 | Mixed Precision Training 1

Contents
• Introduction
• Related Work
• Implementation
• Results
• PyTorch 1.6 AMP New features & Experiment
• Conclusion

Introduction
Increasing the size of a neural network typically improves accuracy
• But also increases the memory and compute requirements for training the model.
• Introduce methodology for training deep neural networks using half-precision floating point numbers,
without losing model accuracy or having to modify hyper-parameters.
• Introduce three techniques to prevent model accuracy loss.
• Using these techniques, demonstrate that a wide variety of network architectures and
applications can be trained to match the accuracy FP32 training.
Main Contributions

Related Works
Network Compression
• Low-precision Training
• Train networks with low precision weights.
• Quantization
• Quantize pretrained model reducing the number of bits.
• Pruning
• Remove connections according to an importance criteria.
• Dedicated architectures
• Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.

Related Works
Network Compression in PR-12 Study
• Total 23 papers were covered! → 23/274 = Almost 8%!
• But, Low-precision training is, as far as I know, the first topic to be covered.

Related Works
Related Works – Low Precision Training
• “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS
• Propose training with binary weights, all other tensors and arithmetic were in full precision.
• “Binarized neural networks.”, 2016 NIPS
• Also binarize the activations, but gradients were stored and computed in single precision.
• “Quantized neural networks: Training neural networks with low precision weights and activations.”,
2016 arXiv
• Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers.
• “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV
• Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy.

Related Works
Main Contributions
• All tensors and arithmetic for forward and backward passes use reduced precision, FP16.
• No hyper-parameters (such as layer width) are adjusted.
• Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines.
• Demonstrate that this technique works across a variety of applications.

Implementation
IEEE 754 Floating Point Representation
• Number can be represented by (−1) 𝑆
∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠)

Implementation
Bonus) New Floating-Point format
IEEE754
FP32
IEEE754
FP16
1bit
1bit
8bit
5bit
23bit
10bit
Google
bfloat16
1bit 8bit 7bit
NVIDIA
TensorFloat
1bit 8bit 10bit
AMD
FP24
1bit 7bit 16bit

Implementation
1. FP32 Master copy of weights
• In mixed precision training, weights, activations, and gradients are stored as FP16.
• In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and
update with the weight gradient during the optimizer step.
Halving the storage and bandwidth

Implementation
1. FP32 Master copy of weights → Why?
• Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented
in FP16. (smaller than 2−24
)
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗
𝜕𝐸
𝜕𝑊

Implementation
1. FP32 Master copy of weights → Experiments
• Train the Mandarin speech model with FP32 master copy and without FP32 master copy.
• Updating FP16 weights results in 80% relative accuracy loss.
Worse than FP master copy

Implementation
2. Loss Scaling
• Activation gradient values tend to be dominated by small magnitudes.
• Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training.
• It means activation gradient values below 2−27
were irrelevant to the training.

Implementation
2. Loss Scaling
• One efficient way to shift the gradient values into FP16-representable range is to scale the loss value
computed in the forward pass, prior to starting back-propagation.
• This can keep the relevant gradient values from becoming zeros.
• Weight gradients must be unscaled before weight update to maintain the update magnitudes.

Implementation
2. Loss Scaling – How to choose the loss scaling factor?
• Simple way is to pick a constant scaling factor empirically.
• Or if gradient statistics are available, directly choosing a factor so that its product with the maximum
absolute gradient value is below 65,504 (the maximum value representable in FP16).
• There is no downside to choosing a large scaling factor as long as it does not cause overflow during
backpropagation.

Implementation
2. Loss Scaling – Automatic Mixed Precision
• More robust way is to choose the loss scaling factor dynamically (Automatically).
• The basic idea is to start with a large scaling factor and then reconsider it in each training iteration.
• If an overflow occurs, skip the weight update and decrease the scaling factor.
• If no overflow occurs for a chosen number of iterations N, increase the scaling factor.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Use N=2000, Increase x2, Decrease x0.5

Implementation
3. Arithmetic Precision
• Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise
operations.
• To maintain model accuracy, we found that some networks require that FP16 vector dot-product
accumulates the partial products into an FP32 value, which is converted to FP16 before writing to
memory.
Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing

Implementation
3. Arithmetic Precision
• Large reductions (sums across elements of a vector) should be carried out in FP32.
• Such reductions mostly come up in batch-normalization layers and softmax layers.
• Both layer types in author’s implementations still read and write FP16 tensors from memory, performing
the arithmetic in FP32. → did not slow down the training process.

Results
Comparison Baseline(FP32) with Mixed Precision

PyTorch 1.6 AMP New features & Experiment
Automatic Mixed Precision in PyTorch
• Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially!
• We can very simply use Automatic Mixed Precision. Just add 5 lines.
Merged into PyTorch / Deprecated!

Automatic Mixed Precision in PyTorch
• Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch!
Before
After
Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch

Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• To verify effect of AMP, perform a simple classification experiment.
• Use Kaggle Intel Image Classification dataset.
• Contains around 25k images of size 150x150 distributed under 6 categories .

• Use ImageNet Pretrained ResNet-18.
• Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core).
• Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).

• We can save GPU Memory almost 30% ~ 40%!
• If use good GPU (with Tensor Core), we can save computational time!
• NVIDIA Tensor Cores provide hardware acceleration for mixed precision training.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html

Conclusion
PR-274 | Mixed Precision Training
• Introduce methodology for training deep neural networks using half-precision floating point.
• Introduce three techniques to prevent model accuracy loss.
• PyTorch officially support Automatic Mixed Precision training.
28

Mixed Precision Training Review

More Related Content

What's hot

What's hot (20)

Similar to Mixed Precision Training Review

Similar to Mixed Precision Training Review (20)

More from LEE HOSEONG

More from LEE HOSEONG (20)

Recently uploaded

Recently uploaded (20)

Mixed Precision Training Review