PR-433: Test-time Training with Masked Autoencoders

PR-433
Gandelsman, Yossi, et al. "Test-time training with masked autoencoders." Advances in Neural Information
Processing Systems 35 (2022): 29374-29385.
주성훈, VUNO Inc.
2023. 4. 16.

2. Methods
1. Research Background 3
Reference
Sun, Yu, et al. "Test-time training with self-supervision for generalization under distribution
shifts." International conference on machine learning. PMLR, 2020.
•https://yueatsprograms.github.io/ttt/home.html
/ 24

2. Methods
https://yueatsprograms.github.io/ttt/home.html
/ 24

2. Methods
Problem Settings
Generalization under distribution shifts
•Generalization is intrinsically hard without access to training data from the test distribution
•The common practice is to avoid distribution shifts altogether by using a wider training
distribution that hopefully contains the test distribution – with more training data or data
augmentation.
Geirhos, Robert, et al. "Generalisation in humans and deep neural networks." Advances in neural information processing systems 31 (2018).
salt-and-pepper noise
uniform noise uniform noise
uniform noise
Hard to know the test distribution!
/ 24

2. Methods
Test time training (Sun et al., ICML, 2020)
/ 24

2. Methods
Test time training (Sun et al., ICML, 2020)
•The self-supervised pretext task employed by TTT is rotation prediction
This task is limited in generality, because it can often be too easy or too hard.
/ 24

2. Methods
Autoencoders for representation learning
The most successful work is masked autoencoders (MAE)
•He, Kaiming, et al. "Masked autoencoders are scalable vision learners." CVPR. 2022.
•PR-355
Proposed method simply substitutes MAE for the self-supervised part of TTT
/ 24

2. Methods
2. Methods 10
Design choices - Architecture
•Y-shaped (original TTT paper)
/ 24

2. Methods
2. Methods 11
Design choices - Architecture
h •Main task (e.g. object recognition)
f
•MAE encoder: ViT
g
•MAE decoder : ViT
•ViT-Base (for ViT probing)
•Y-shaped (TTT-MAE)
/ 24

2. Methods
2. Methods 12
Training-time training: 1. training encoder and decoder
f
g
•MAE encoder, deocder: ViT-large, pre-trained
for 800 epochs on ImageNet-1k
• ViT probing: train only, with frozen. Here, is a ViT-Base.
h f h
/ 24

2. Methods
2. Methods 13
Training-time training: 2. training main task head
f
•MAE encoder: ViT-Large
• pre-trained for ImageNet-1k reconstruction
• cross entropy loss for classification
• encoder produced by MAE pre-training
•Augmentation: image cropping and horizontal flips
•No other augmentations (random changes in
brightness, contrast, color and sharpness )
•800 epochs
lm :
f0 :
Training set with samples
n
h
Main task head
xi
yi
/ 24

2. Methods
2. Methods 14
Test-time training
g0
Test input arrives,
x
•self-supervised reconstruction loss
(pixel-wise mean squared error),
•random mask (75%)
•SGD, for 20 steps, using a momentum of
0.9, weight decay of 0.2, batch size of 128,
and fixed learning rate of 5e-3.
ls
Make a prediction on as
x h ∘ fx(x)
f0
fx h Bir
Reset the weights to and for the next test input
f0 g0 x
•By test-time training on the test inputs independently, we do not
assume that they come from the same distribution.
/ 24

2. Methods
2. Methods 15
Optimizer for TTT
Figure 2: We experiment with two optimizers for TTT. MAE [19] uses AdamW for pre-training. But our results (left) show that
AdamW for TTT requires early stopping, which is unrealistic for generalization to unknown distributions without a validation
set. We instead use SGD, which keeps improving performance even after 20 steps (right).
•it simply takes the same optimizer setting as during the last epoch of training-time training of the
self-supervised task. (Original TTT)
•the learning rate schedule of MAE reaches zero by the end of pre-training.
•When Test-Time Training (TTT), excessive iterations with AdamW can negatively impact performance.
•more iterations with SGD consistently improve performance on all distribution shifts
/ 24

2. Methods
3. Experimental Results 17
Calibration on out of distribution data
•15 types of corruption to the images of ImageNet-C, 5 levels of severity
• D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018
/ 24

2. Methods
Main results on ImageNet-C
TTT-MAE has higher performance gains in all corruptions than TTT-Rot, on top of their respective baselines.
• Joint Train: ResNet-16-layers, after joint training for rotation prediction and object recognition (baseline for TTT-Rot)
• TTT-Rot: original paper (rotation task, resnet-18)
• Baseline: pre-trained MAE encoder ViT probing (no TTT)
• TTT-MAE (red) on top of our baseline significantly improves performance.
/ 24

2. Methods
TTT-MAE in rotation invariant classes
• Rotation invariant class: images are usually taken from top-down views
TTT-MAE is agnostic to rotation invariance and still helps on these classes.
/ 24

2. Methods
Design choices - Training setup
1. Fine-tuning; train ◦ end-to-end. This
works poorly with TTT
2. ViT probing: train only, with frozen.
Here, is a ViT-Base.
3. Joint training: train both ◦ and ◦ ,
by summing their losses together. This is
used by TTT with rotation prediction. But
with MAE, it performs worse on the
ImageNet validation set
h f
h f
h
h f g f
h
f
g
Object classification
/ 24

2. Methods
Accuracy comparison of three designs (ViT probing, fine-tuning, joint training)
•The first three rows are only for training-time training, after which a fixed model is applied during testing.
•Joint training does not achieve satisfactory performance on most corruptions
•Fine-tuning: initially performs better than ViT probing, it is not amenable to TTT
•TTT-MAE: TTT-MAE after ViT probing, which performs the best across all corruption types
/ 24

2. Methods
Performance on other ImageNet variants
ImageNet-R
• ImageNet-R is a benchmark dataset for evaluating robustness of image classification
• The dataset includes images that are synthetically generated from the original
ImageNet images in a variety of ways, such as adding noise, changing lighting, or
applying artistic styles.
ImageNet-A
• Baseline: pre-trained MAE encoder ViT probing (no TTT)
• ImageNet-A is a dataset designed to test the robustness of computer vision
models against real-world, unmodified images.
• The dataset includes visually similar images to those in ImageNet but with
added challenges such as occlusion, low resolution, and unusual viewpoints.
/ 24

2. Methods
4. Conclusions 24
• Main contribution
• The proposal of a new method - TTT-MAE for addressing the problem of domain shift in visual
recognition tasks.
• TTT can be viewed alternatively as one-sample unsupervised domain adaptation (UDA)
• Limitations & future works
• Slower at test time than the baseline applying a fixed model (Inference speed has not been the focus
of this paper), It might be improved through better hyper-parameters, optimizers, training techniques
and architectural designs.
• Studying the generalization of spatial autoencoding to other main tasks and test distributions beyond
object recognition and the benchmarks used in this study.
• Exploring test-time training on video streams in human-like environments, where self-supervised
learning can take advantage of past frames
Thank you.
/ 24

PR-433: Test-time Training with Masked Autoencoders

More Related Content

What's hot

What's hot (20)

Similar to PR-433: Test-time Training with Masked Autoencoders

Similar to PR-433: Test-time Training with Masked Autoencoders (20)

More from Sunghoon Joo

More from Sunghoon Joo (19)

Recently uploaded

Recently uploaded (20)

PR-433: Test-time Training with Masked Autoencoders