(Go: >> BACK << -|- >> HOME <<)

SlideShare a Scribd company logo
2020/09/06
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab
Research Engineer
PR-274 | Mixed Precision Training 1
Contents
• Introduction
• Related Work
• Implementation
• Results
• PyTorch 1.6 AMP New features & Experiment
• Conclusion
PR-274 | Mixed Precision Training 2
Introduction
Increasing the size of a neural network typically improves accuracy
• But also increases the memory and compute requirements for training the model.
• Introduce methodology for training deep neural networks using half-precision floating point numbers,
without losing model accuracy or having to modify hyper-parameters.
• Introduce three techniques to prevent model accuracy loss.
• Using these techniques, demonstrate that a wide variety of network architectures and
applications can be trained to match the accuracy FP32 training.
PR-274 | Mixed Precision Training 3
Main Contributions
Related Works
Network Compression
PR-274 | Mixed Precision Training 4
• Low-precision Training
• Train networks with low precision weights.
• Quantization
• Quantize pretrained model reducing the number of bits.
• Pruning
• Remove connections according to an importance criteria.
• Dedicated architectures
• Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
Related Works
Network Compression in PR-12 Study
PR-274 | Mixed Precision Training 5
• Total 23 papers were covered! → 23/274 = Almost 8%!
• But, Low-precision training is, as far as I know, the first topic to be covered.
Related Works
Related Works – Low Precision Training
• “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS
• Propose training with binary weights, all other tensors and arithmetic were in full precision.
• “Binarized neural networks.”, 2016 NIPS
• Also binarize the activations, but gradients were stored and computed in single precision.
• “Quantized neural net- works: Training neural networks with low precision weights and activations.”,
2016 arXiv
• Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers.
• “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV
• Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy.
PR-274 | Mixed Precision Training 6
Related Works
Main Contributions
• All tensors and arithmetic for forward and backward passes use reduced precision, FP16.
• No hyper-parameters (such as layer width) are adjusted.
• Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines.
• Demonstrate that this technique works across a variety of applications.
PR-274 | Mixed Precision Training 7
Implementation
IEEE 754 Floating Point Representation
• Number can be represented by (−1) 𝑆
∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠)
PR-274 | Mixed Precision Training 8
Implementation
PR-274 | Mixed Precision Training 9
Bonus) New Floating-Point format
IEEE754
FP32
IEEE754
FP16
1bit
1bit
8bit
5bit
23bit
10bit
Google
bfloat16
1bit 8bit 7bit
NVIDIA
TensorFloat
1bit 8bit 10bit
AMD
FP24
1bit 7bit 16bit
Implementation
PR-274 | Mixed Precision Training 10
1. FP32 Master copy of weights
• In mixed precision training, weights, activations, and gradients are stored as FP16.
• In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and
update with the weight gradient during the optimizer step.
Halving the storage and bandwidth
Implementation
PR-274 | Mixed Precision Training 11
1. FP32 Master copy of weights → Why?
• Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented
in FP16. (smaller than 2−24
)
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗
𝜕𝐸
𝜕𝑊
Implementation
PR-274 | Mixed Precision Training 12
1. FP32 Master copy of weights → Experiments
• Train the Mandarin speech model with FP32 master copy and without FP32 master copy.
• Updating FP16 weights results in 80% relative accuracy loss.
Worse than FP master copy
Implementation
PR-274 | Mixed Precision Training 13
2. Loss Scaling
• Activation gradient values tend to be dominated by small magnitudes.
• Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training.
• It means activation gradient values below 2−27
were irrelevant to the training.
Implementation
PR-274 | Mixed Precision Training 14
2. Loss Scaling
• One efficient way to shift the gradient values into FP16-representable range is to scale the loss value
computed in the forward pass, prior to starting back-propagation.
• This can keep the relevant gradient values from becoming zeros.
• Weight gradients must be unscaled before weight update to maintain the update magnitudes.
Implementation
PR-274 | Mixed Precision Training 15
2. Loss Scaling – How to choose the loss scaling factor?
• Simple way is to pick a constant scaling factor empirically.
• Or if gradient statistics are available, directly choosing a factor so that its product with the maximum
absolute gradient value is below 65,504 (the maximum value representable in FP16).
• There is no downside to choosing a large scaling factor as long as it does not cause overflow during
backpropagation.
Implementation
PR-274 | Mixed Precision Training 16
2. Loss Scaling – Automatic Mixed Precision
• More robust way is to choose the loss scaling factor dynamically (Automatically).
• The basic idea is to start with a large scaling factor and then reconsider it in each training iteration.
• If an overflow occurs, skip the weight update and decrease the scaling factor.
• If no overflow occurs for a chosen number of iterations N, increase the scaling factor.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Use N=2000, Increase x2, Decrease x0.5
Implementation
PR-274 | Mixed Precision Training 17
3. Arithmetic Precision
• Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise
operations.
• To maintain model accuracy, we found that some networks require that FP16 vector dot-product
accumulates the partial products into an FP32 value, which is converted to FP16 before writing to
memory.
Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
Implementation
PR-274 | Mixed Precision Training 18
3. Arithmetic Precision
• Large reductions (sums across elements of a vector) should be carried out in FP32.
• Such reductions mostly come up in batch-normalization layers and softmax layers.
• Both layer types in author’s implementations still read and write FP16 tensors from memory, performing
the arithmetic in FP32. → did not slow down the training process.
Results
PR-274 | Mixed Precision Training 19
Comparison Baseline(FP32) with Mixed Precision
Results
PR-274 | Mixed Precision Training 20
Comparison Baseline(FP32) with Mixed Precision
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 21
Automatic Mixed Precision in PyTorch
• Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially!
• We can very simply use Automatic Mixed Precision. Just add 5 lines.
Merged into PyTorch / Deprecated!
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 22
Automatic Mixed Precision in PyTorch
• Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch!
Before
After
Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 23
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• To verify effect of AMP, perform a simple classification experiment.
• Use Kaggle Intel Image Classification dataset.
• Contains around 25k images of size 150x150 distributed under 6 categories .
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 24
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• Use ImageNet Pretrained ResNet-18.
• Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core).
• Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 25
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• We can save GPU Memory almost 30% ~ 40%!
• If use good GPU (with Tensor Core), we can save computational time!
• NVIDIA Tensor Cores provide hardware acceleration for mixed precision training.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Conclusion
PR-274 | Mixed Precision Training
• Introduce methodology for training deep neural networks using half-precision floating point.
• Introduce three techniques to prevent model accuracy loss.
• PyTorch officially support Automatic Mixed Precision training.
28

More Related Content

What's hot

Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
健程 杨
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
UNIT-4.pptx
UNIT-4.pptxUNIT-4.pptx
UNIT-4.pptx
NiharikaThakur32
 
AlexNet
AlexNetAlexNet
AlexNet
Bertil Hatt
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
健程 杨
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Overcoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural networkOvercoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural network
Katy Lee
 
PyTorch Introduction
PyTorch IntroductionPyTorch Introduction
PyTorch Introduction
Yash Kawdiya
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
NAVER Engineering
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Roelof Pieters
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지
deepseaswjh
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Jason Tsai
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
홍배 김
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
Fellowship at Vodafone FutureLab
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
Sourya Dey
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Sunghoon Joo
 

What's hot (20)

Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
UNIT-4.pptx
UNIT-4.pptxUNIT-4.pptx
UNIT-4.pptx
 
AlexNet
AlexNetAlexNet
AlexNet
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Overcoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural networkOvercoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural network
 
PyTorch Introduction
PyTorch IntroductionPyTorch Introduction
PyTorch Introduction
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 

Similar to Mixed Precision Training Review

Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 
IRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGA
IRJET Journal
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
LEE HOSEONG
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
taeseon ryu
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
milad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Mehrnaz Faraz
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
Bharath Sudharsan
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Edge AI and Vision Alliance
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
Kamer Ali Yuksel
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
Sabidur Rahman
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training Time
Michael Galarnyk
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdf
ZHUORANGUO2
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
David Tung
 

Similar to Mixed Precision Training Review (20)

Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
IRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGA
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training Time
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdf
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 

More from LEE HOSEONG

Unsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationUnsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillation
LEE HOSEONG
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer better
LEE HOSEONG
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to Z
LEE HOSEONG
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
LEE HOSEONG
 
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen..."The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
LEE HOSEONG
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
LEE HOSEONG
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
LEE HOSEONG
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review
LEE HOSEONG
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
LEE HOSEONG
 
Human uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewHuman uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 Review
LEE HOSEONG
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
LEE HOSEONG
 
2019 ICLR Best Paper Review
2019 ICLR Best Paper Review2019 ICLR Best Paper Review
2019 ICLR Best Paper Review
LEE HOSEONG
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
LEE HOSEONG
 
"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review
LEE HOSEONG
 
"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review
LEE HOSEONG
 
"Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re..."Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re...
LEE HOSEONG
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
LEE HOSEONG
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review
LEE HOSEONG
 
"From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ..."From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ...
LEE HOSEONG
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
LEE HOSEONG
 

More from LEE HOSEONG (20)

Unsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationUnsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillation
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer better
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to Z
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
 
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen..."The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
 
Human uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewHuman uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 Review
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
 
2019 ICLR Best Paper Review
2019 ICLR Best Paper Review2019 ICLR Best Paper Review
2019 ICLR Best Paper Review
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
 
"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review
 
"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review
 
"Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re..."Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re...
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review
 
"From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ..."From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ...
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
 

Recently uploaded

UX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business GoalsUX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business Goals
FIDO Alliance
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
Intel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdfIntel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdf
Tech Guru
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
AimanAthambawa1
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
CheckPoint Firewall Presentation CCSA.pdf
CheckPoint Firewall Presentation CCSA.pdfCheckPoint Firewall Presentation CCSA.pdf
CheckPoint Firewall Presentation CCSA.pdf
ssuser137992
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
Smart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdfSmart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdf
Market.us
 

Recently uploaded (20)

UX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business GoalsUX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business Goals
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
Intel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdfIntel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdf
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
CheckPoint Firewall Presentation CCSA.pdf
CheckPoint Firewall Presentation CCSA.pdfCheckPoint Firewall Presentation CCSA.pdf
CheckPoint Firewall Presentation CCSA.pdf
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
Smart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdfSmart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdf
 

Mixed Precision Training Review

  • 1. 2020/09/06 Ho Seong Lee (hoya012) Cognex Deep Learning Lab Research Engineer PR-274 | Mixed Precision Training 1
  • 2. Contents • Introduction • Related Work • Implementation • Results • PyTorch 1.6 AMP New features & Experiment • Conclusion PR-274 | Mixed Precision Training 2
  • 3. Introduction Increasing the size of a neural network typically improves accuracy • But also increases the memory and compute requirements for training the model. • Introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. • Introduce three techniques to prevent model accuracy loss. • Using these techniques, demonstrate that a wide variety of network architectures and applications can be trained to match the accuracy FP32 training. PR-274 | Mixed Precision Training 3 Main Contributions
  • 4. Related Works Network Compression PR-274 | Mixed Precision Training 4 • Low-precision Training • Train networks with low precision weights. • Quantization • Quantize pretrained model reducing the number of bits. • Pruning • Remove connections according to an importance criteria. • Dedicated architectures • Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
  • 5. Related Works Network Compression in PR-12 Study PR-274 | Mixed Precision Training 5 • Total 23 papers were covered! → 23/274 = Almost 8%! • But, Low-precision training is, as far as I know, the first topic to be covered.
  • 6. Related Works Related Works – Low Precision Training • “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS • Propose training with binary weights, all other tensors and arithmetic were in full precision. • “Binarized neural networks.”, 2016 NIPS • Also binarize the activations, but gradients were stored and computed in single precision. • “Quantized neural net- works: Training neural networks with low precision weights and activations.”, 2016 arXiv • Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers. • “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV • Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy. PR-274 | Mixed Precision Training 6
  • 7. Related Works Main Contributions • All tensors and arithmetic for forward and backward passes use reduced precision, FP16. • No hyper-parameters (such as layer width) are adjusted. • Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines. • Demonstrate that this technique works across a variety of applications. PR-274 | Mixed Precision Training 7
  • 8. Implementation IEEE 754 Floating Point Representation • Number can be represented by (−1) 𝑆 ∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠) PR-274 | Mixed Precision Training 8
  • 9. Implementation PR-274 | Mixed Precision Training 9 Bonus) New Floating-Point format IEEE754 FP32 IEEE754 FP16 1bit 1bit 8bit 5bit 23bit 10bit Google bfloat16 1bit 8bit 7bit NVIDIA TensorFloat 1bit 8bit 10bit AMD FP24 1bit 7bit 16bit
  • 10. Implementation PR-274 | Mixed Precision Training 10 1. FP32 Master copy of weights • In mixed precision training, weights, activations, and gradients are stored as FP16. • In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and update with the weight gradient during the optimizer step. Halving the storage and bandwidth
  • 11. Implementation PR-274 | Mixed Precision Training 11 1. FP32 Master copy of weights → Why? • Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented in FP16. (smaller than 2−24 ) 𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗ 𝜕𝐸 𝜕𝑊
  • 12. Implementation PR-274 | Mixed Precision Training 12 1. FP32 Master copy of weights → Experiments • Train the Mandarin speech model with FP32 master copy and without FP32 master copy. • Updating FP16 weights results in 80% relative accuracy loss. Worse than FP master copy
  • 13. Implementation PR-274 | Mixed Precision Training 13 2. Loss Scaling • Activation gradient values tend to be dominated by small magnitudes. • Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training. • It means activation gradient values below 2−27 were irrelevant to the training.
  • 14. Implementation PR-274 | Mixed Precision Training 14 2. Loss Scaling • One efficient way to shift the gradient values into FP16-representable range is to scale the loss value computed in the forward pass, prior to starting back-propagation. • This can keep the relevant gradient values from becoming zeros. • Weight gradients must be unscaled before weight update to maintain the update magnitudes.
  • 15. Implementation PR-274 | Mixed Precision Training 15 2. Loss Scaling – How to choose the loss scaling factor? • Simple way is to pick a constant scaling factor empirically. • Or if gradient statistics are available, directly choosing a factor so that its product with the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16). • There is no downside to choosing a large scaling factor as long as it does not cause overflow during backpropagation.
  • 16. Implementation PR-274 | Mixed Precision Training 16 2. Loss Scaling – Automatic Mixed Precision • More robust way is to choose the loss scaling factor dynamically (Automatically). • The basic idea is to start with a large scaling factor and then reconsider it in each training iteration. • If an overflow occurs, skip the weight update and decrease the scaling factor. • If no overflow occurs for a chosen number of iterations N, increase the scaling factor. Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html Use N=2000, Increase x2, Decrease x0.5
  • 17. Implementation PR-274 | Mixed Precision Training 17 3. Arithmetic Precision • Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise operations. • To maintain model accuracy, we found that some networks require that FP16 vector dot-product accumulates the partial products into an FP32 value, which is converted to FP16 before writing to memory. Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
  • 18. Implementation PR-274 | Mixed Precision Training 18 3. Arithmetic Precision • Large reductions (sums across elements of a vector) should be carried out in FP32. • Such reductions mostly come up in batch-normalization layers and softmax layers. • Both layer types in author’s implementations still read and write FP16 tensors from memory, performing the arithmetic in FP32. → did not slow down the training process.
  • 19. Results PR-274 | Mixed Precision Training 19 Comparison Baseline(FP32) with Mixed Precision
  • 20. Results PR-274 | Mixed Precision Training 20 Comparison Baseline(FP32) with Mixed Precision
  • 21. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 21 Automatic Mixed Precision in PyTorch • Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially! • We can very simply use Automatic Mixed Precision. Just add 5 lines. Merged into PyTorch / Deprecated!
  • 22. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 22 Automatic Mixed Precision in PyTorch • Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch! Before After Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
  • 23. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 23 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • To verify effect of AMP, perform a simple classification experiment. • Use Kaggle Intel Image Classification dataset. • Contains around 25k images of size 150x150 distributed under 6 categories .
  • 24. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 24 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • Use ImageNet Pretrained ResNet-18. • Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core). • Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
  • 25. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 25 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • We can save GPU Memory almost 30% ~ 40%! • If use good GPU (with Tensor Core), we can save computational time! • NVIDIA Tensor Cores provide hardware acceleration for mixed precision training. Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
  • 26. Conclusion PR-274 | Mixed Precision Training • Introduce methodology for training deep neural networks using half-precision floating point. • Introduce three techniques to prevent model accuracy loss. • PyTorch officially support Automatic Mixed Precision training. 28