Higher training speed requires larger mini-batch size.

8192 images one batch, 256 GPUs

Larger mini-batch size leads to lower accuracy

Linear scaling rule for adjusting learning rates as a function of minibatch size

Warmup scheme overcomes optimization challenges early in training

mini-batch SGD

Larger mini-batch size lead to lower accuracy.

Iteration(in FaceBook Paper):

Convergence:

Learning Rate:

Converge Speed:

M: batch size, K: iteration number, σ²: stochastic gradient variance

Use large minibatches

scale to multiple workers

Maintaining training and generalization accuracy

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

k iteration, minibatch size of n:

1 iteration, minibatch size of kn:

Assume gradients of the above fomulas are equal

Two updates can be similar only if we set the second learning rate to k times the first learning rate.

Initial training epochs when the network is changing rapidly.

Results are stable for a large range of sizes, beyond a certain point

Low learning rate to solve rapid change of the initial network.

Constant Warmup: Sudden change of learning rate causes the training error to spike.

Gradual warmup: Ramping up the learning rate from a small to a large value.

start from a learning rate of η and increment it by a constant amount at each iteration such that it reaches η̂ = kη after 5 epochs.