# FaceBook: 1 hour training ImageNet

## Accurate, Large Minibatch SGD

## Training ImageNet in 1 Hour

### Main Idea

* Higher training speed requires larger mini-batch size.

> 8192 images one batch, 256 GPUs

* Larger mini-batch size leads to lower accuracy
* Linear scaling rule for adjusting learning rates as a function of minibatch size
* Warmup scheme overcomes optimization challenges early in training

### Background

* mini-batch SGD
* Larger mini-batch size lead to lower accuracy.

### mini-batch SGD

![](https://1274047417-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-L_G0aSn1Ck9aGSTDmHR%2F-L_G1BDIg9dyfMtC4gJK%2F-L_G1E3StizpGkSaNAWN%2Fmbgd.png?generation=1551842735399760\&alt=media)

### mini-batch SGD

* Iteration(in FaceBook Paper):

![](https://1274047417-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-L_G0aSn1Ck9aGSTDmHR%2F-L_G1BDIg9dyfMtC4gJK%2F-L_G1E3UjXQoAbmTfxnr%2Fmsgd_fomula.png?generation=1551842735420706\&alt=media)

* Convergence:

  * Learning Rate: ![](https://www.zhihu.com/equation?tex=\gamma+%3D+1%2F\sqrt{MK\sigma^2})

  * Converge Speed: ![](https://www.zhihu.com/equation?tex=1%2F\sqrt{MK})

  > M: batch size, K: iteration number, σ²: stochastic gradient variance

### Goal

* Use large minibatches
  * scale to multiple workers
* Maintaining training and generalization accuracy

### Solution

* Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

### Analysis

* k iteration, minibatch size of n:

  ![](https://1274047417-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-L_G0aSn1Ck9aGSTDmHR%2F-L_G1BDIg9dyfMtC4gJK%2F-L_G1E3W7rZDkqdudm24%2Fk_nsgd.png?generation=1551842735319475\&alt=media)
* 1 iteration, minibatch size of kn:

  ![](https://1274047417-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-L_G0aSn1Ck9aGSTDmHR%2F-L_G1BDIg9dyfMtC4gJK%2F-L_G1E3YR-hXB10qkQ4d%2Fkn_sgd.png?generation=1551842735346344\&alt=media)
* Assume gradients of the above fomulas are equal
  * Two updates can be similar only if we set the second learning rate to k times the first learning rate.

### Conditions that assumption not hold

* Initial training epochs when the network is changing rapidly.
* Results are stable for a large range of sizes, beyond a certain point

  ![](https://1274047417-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-L_G0aSn1Ck9aGSTDmHR%2F-L_G1BDIg9dyfMtC4gJK%2F-L_G1E3_8fBIoVrKI5Y5%2Fsize-acc.png?generation=1551842735476266\&alt=media)

### Warm Up

* Low learning rate to solve rapid change of the initial network.
* Constant Warmup: Sudden change of learning rate causes the training error to spike.
* Gradual warmup: Ramping up the learning rate from a small to a large value.
* start from a learning rate of η and increment it by a constant amount at each iteration such that it reaches η̂ = kη after 5 epochs.

### Reference

* [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h3.pdf?)
* [机器之心提问：如何评价Facebook Training ImageNet in 1 Hour这篇论文?](https://www.zhihu.com/question/60874090)
* [Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization](https://arxiv.org/abs/1506.08272)
* [ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS](https://arxiv.org/pdf/1611.01838.pdf)
