Optimizers¶
Optimizers govern the path that your neural network takes as it tries to minimize error. Picking the right optimizer and initializing it with the right parameters will either make your network learn successfully or will cause it not to learn at all! Pytorch already implements the most widely used flavors such as SGD, Adam, RMSProp etc. Here we strive to include optimizers that Pytorch has missed (and any cutting edge ones that have not yet been added).
A2Grad¶
-
class
pywick.optimizers.a2grad.
A2GradUni
(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: Optional[float] = None, beta: float = 10, lips: float = 10)[source]¶ Implements A2GradUni Optimizer Algorithm.
It has been proposed in `Optimal Adaptive and Accelerated Stochastic Gradient Descent`__.
- Arguments:
- params: iterable of parameters to optimize or dicts defining
- parameter groups
lr: not used for this optimizer (default: None) beta: (default: 10) lips: Lipschitz constant (default: 10)
- Note:
- Reference code: https://github.com/severilov/A2Grad_optimizer
-
class
pywick.optimizers.a2grad.
A2GradInc
(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: Optional[float] = None, beta: float = 10, lips: float = 10)[source]¶ Implements A2GradInc Optimizer Algorithm.
It has been proposed in `Optimal Adaptive and Accelerated Stochastic Gradient Descent`__.
- Arguments:
- params: iterable of parameters to optimize or dicts defining
- parameter groups
lr: not used for this optimizer (default: None) beta: (default: 10) lips: Lipschitz constant (default: 10)
- Note:
- Reference code: https://github.com/severilov/A2Grad_optimizer
-
class
pywick.optimizers.a2grad.
A2GradExp
(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: Optional[float] = None, beta: float = 10, lips: float = 10, rho: float = 0.5)[source]¶ Implements A2GradExp Optimizer Algorithm.
It has been proposed in `Optimal Adaptive and Accelerated Stochastic Gradient Descent`__.
- Arguments:
- params: iterable of parameters to optimize or dicts defining
- parameter groups
lr: not used for this optimizer (default: None) beta: (default: 10) lips: Lipschitz constant (default: 10) rho: represents the degree of weighting decrease, a constant
smoothing factor between 0 and 1 (default: 0.5)
- Note:
- Reference code: https://github.com/severilov/A2Grad_optimizer
AdaBelief¶
-
class
pywick.optimizers.adabelief.
AdaBelief
(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, weight_decouple: bool = False, fixed_decay: bool = False, rectify: bool = False)[source]¶ Implements AdaBelief Optimizer Algorithm. It has been proposed in `AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients`__.
- Arguments:
- params: iterable of parameters to optimize or dicts defining
- parameter groups
lr: learning rate (default: 1e-2) betas: coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))- eps: term added to the denominator to improve
- numerical stability (default: 0.001)
weight_decay: weight decay (L2 penalty) (default: 0) amsgrad: whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond (default: False)- weight_decouple: If set as True, then the optimizer uses decoupled
- weight decay as in AdamW (default: False)
- fixed_decay : This is used when
- weight_decouple is set as True. When fixed_decay == True, the weight decay is performed as $W_{new} = W_{old} - W_{old} times decay$. When fixed_decay == False, the weight decay is performed as $W_{new} = W_{old} - W_{old} times decay times lr$. Note that in this case, the weight decay ratio decreases with learning rate (lr). (default: False)
- rectify: (default: False) If set as True, then perform the rectified
- update similar to RAdam
- Note:
- Reference code: https://github.com/juntang-zhuang/Adabelief-Optimizer
AdaHessian¶
AdaHessian Optimizer
Lifted from https://github.com/davda54/ada-hessian/blob/master/ada_hessian.py Originally licensed MIT, Copyright 2020, David Samuel
-
class
pywick.optimizers.adahessian.
Adahessian
(params, lr=0.1, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, hessian_power=1.0, update_each=1, n_samples=1, avg_conv_kernel=False)[source]¶ Implements the AdaHessian algorithm from “ADAHESSIAN: An Adaptive Second OrderOptimizer for Machine Learning”
- Arguments:
params (iterable): iterable of parameters to optimize or dicts defining parameter groups lr (float, optional): learning rate (default: 0.1) betas ((float, float), optional): coefficients used for computing running averages of gradient and the
squared hessian trace (default: (0.9, 0.999))eps (float, optional): term added to the denominator to improve numerical stability (default: 1e-8) weight_decay (float, optional): weight decay (L2 penalty) (default: 0.0) hessian_power (float, optional): exponent of the hessian trace (default: 1.0) update_each (int, optional): compute the hessian trace approximation only after this number of steps
(to save time) (default: 1)n_samples (int, optional): how many times to sample z for the approximation of the hessian trace (default: 1)
-
is_second_order
¶
-
set_hessian
()[source]¶ Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.
AdamP¶
AdamP Optimizer Implementation copied from https://github.com/clovaai/AdamP/blob/master/adamp/adamp.py Paper: Slowing Down the Weight Norm Increase in Momentum-based Optimizers - https://arxiv.org/abs/2006.08217 Code: https://github.com/clovaai/AdamP Copyright (c) 2020-present NAVER Corp. MIT license
AdamW¶
-
class
pywick.optimizers.adamw.
AdamW
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[source]¶ Implements AdamW algorithm.
The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.
- Arguments:
- params (iterable): iterable of parameters to optimize or dicts defining
- parameter groups
lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))- eps (float, optional): term added to the denominator to improve
- numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay coefficient (default: 1e-2) amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond (default: False)
AddSign¶
-
class
pywick.optimizers.addsign.
AddSign
(params, lr=0.001, beta=0.9, alpha=1, sign_internal_decay=None)[source]¶ Implements AddSign algorithm.
It has been proposed in Neural Optimizer Search with Reinforcement Learning.
Parameters: - params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
- lr – (float, optional): learning rate (default: 1e-3)
- beta – (float, optional): coefficients used for computing running averages of gradient (default: 0.9)
- alpha – (float, optional): term added to the internal_decay * sign(g) * sign(m) (default: 1)
- sign_internal_decay – (callable, optional): a function that returns an internal decay calculated based on the current training step and the total number of training steps. If None, the internal decay is assumed to be 1.
Apollo¶
-
class
pywick.optimizers.apollo.
Apollo
(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: float = 0.01, beta: float = 0.9, eps: float = 0.0001, warmup: int = 0, init_lr: float = 0.01, weight_decay: float = 0)[source]¶ Implements Apollo Optimizer Algorithm.
It has been proposed in `Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization`__.
- Arguments:
- params: iterable of parameters to optimize or dicts defining
- parameter groups
lr: learning rate (default: 1e-2) beta: coefficient used for computing
running averages of gradient (default: 0.9)- eps: term added to the denominator to improve
- numerical stability (default: 1e-4)
warmup: number of warmup steps (default: 0) init_lr: initial learning rate for warmup (default: 0.01) weight_decay: weight decay (L2 penalty) (default: 0)
- Note:
- Reference code: https://github.com/XuezheMax/apollo
Lars¶
PyTorch LARS / LARC Optimizer An implementation of LARS (SGD) + LARC in PyTorch Based on:
Additional cleanup and modifications to properly support PyTorch XLA. Copyright 2021 Ross Wightman
-
class
pywick.optimizers.lars.
Lars
(params, lr=1.0, momentum=0, dampening=0, weight_decay=0, nesterov=False, trust_coeff=0.001, eps=1e-08, trust_clip=False, always_adapt=False)[source]¶ LARS for PyTorch
Paper: Large batch training of Convolutional Networks - https://arxiv.org/pdf/1708.03888.pdf Args:
params (iterable): iterable of parameters to optimize or dicts defining parameter groups. lr (float, optional): learning rate (default: 1.0). momentum (float, optional): momentum factor (default: 0) weight_decay (float, optional): weight decay (L2 penalty) (default: 0) dampening (float, optional): dampening for momentum (default: 0) nesterov (bool, optional): enables Nesterov momentum (default: False) trust_coeff (float): trust coefficient for computing adaptive lr / trust_ratio (default: 0.001) eps (float): eps for division denominator (default: 1e-8) trust_clip (bool): enable LARC trust ratio clipping (default: False) always_adapt (bool): always apply LARS LR adapt, otherwise only when group weight_decay != 0 (default: False)
Eve¶
-
class
pywick.optimizers.eve.
Eve
(params, lr=0.001, betas=(0.9, 0.999, 0.999), eps=1e-08, k=0.1, K=10, weight_decay=0)[source]¶ Implementation of Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
-
step
(closure)[source]¶ Parameters: closure – (closure). see http://pytorch.org/docs/optim.html#optimizer-step-closure Returns: loss
-
Lookahead¶
-
class
pywick.optimizers.lookahead.
Lookahead
(optimizer, k=5, alpha=0.5)[source]¶ Implementation of Lookahead Optimizer: k steps forward, 1 step back
- Args:
param optimizer: - the optimizer to work with (sgd, adam etc)
param k: (int) - number of steps to look ahead (default=5)
param alpha: (float) - slow weights step size
LookaheadSGD¶
MADGrad¶
PyTorch MADGRAD optimizer MADGRAD: https://arxiv.org/abs/2101.11075 Code from: https://github.com/facebookresearch/madgrad
-
class
pywick.optimizers.madgrad.
MADGRAD
(params: Any, lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, decoupled_decay: bool = False)[source]¶ MADGRAD: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization. .. _MADGRAD: https://arxiv.org/abs/2101.11075 MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep. MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam’s beta1 should work here also. On sparse problems both weight_decay and momentum should be set to 0. Arguments:
- params (iterable):
- Iterable of parameters to optimize or dicts defining parameter groups.
- lr (float):
- Learning rate (default: 1e-2).
- momentum (float):
- Momentum value in the range [0,1) (default: 0.9).
- weight_decay (float):
- Weight decay, i.e. a L2 penalty (default: 0).
- eps (float):
- Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6).
-
step
(closure: Optional[Callable[float]] = None) → Optional[float][source]¶ Performs a single optimization step. Arguments:
closure (callable, optional): A closure that reevaluates the model and returns the loss.
-
supports_flat_params
¶
-
supports_memory_efficient_fp16
¶
Nadam¶
-
class
pywick.optimizers.nadam.
Nadam
(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, schedule_decay=0.004)[source]¶ Implements Nadam algorithm (a variant of Adam based on Nesterov momentum).
It has been proposed in `Incorporating Nesterov Momentum into Adam`__.
Parameters: - params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
- lr – (float, optional): learning rate (default: 2e-3)
- betas – (Tuple[float, float], optional): coefficients used for computing running averages of gradient and its square
- eps – (float, optional): term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay – (float, optional): weight decay (L2 penalty) (default: 0)
- schedule_decay – (float, optional): momentum schedule decay (default: 4e-3)
PowerSign¶
-
class
pywick.optimizers.powersign.
PowerSign
(params, lr=0.001, beta=0.9, alpha=2.718281828459045, sign_internal_decay=None)[source]¶ Implements PowerSign algorithm.
It has been proposed in Neural Optimizer Search with Reinforcement Learning.
Parameters: - params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
- lr – (float, optional): learning rate (default: 1e-3)
- beta – (float, optional): coefficients used for computing running averages of gradient (default: 0.9)
- alpha – (float, optional): term powered to the internal_decay * sign(g) * sign(m) (default: math.e)
- sign_internal_decay – (callable, optional): a function that returns an internal decay calculated based on the current training step and the total number of training steps. If None, the internal decay is assumed to be 1.
QHAdam¶
-
class
pywick.optimizers.qhadam.
QHAdam
(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), nus: Tuple[float, float] = (1.0, 1.0), weight_decay: float = 0.0, decouple_weight_decay: bool = False, eps: float = 1e-08)[source]¶ Implements the QHAdam optimization algorithm.
It has been proposed in `Adaptive methods for Nonconvex Optimization`__.
- Arguments:
- params: iterable of parameters to optimize or dicts defining
- parameter groups
lr: learning rate (default: 1e-3) betas: coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))- nus: immediate discount factors used to estimate the gradient and its
- square (default: (1.0, 1.0))
- eps: term added to the denominator to improve
- numerical stability (default: 1e-8)
weight_decay: weight decay (L2 penalty) (default: 0) decouple_weight_decay: whether to decouple the weight
decay from the gradient-based optimization step (default: False)
- Note:
- Reference code: https://github.com/facebookresearch/qhoptim
RAdam¶
-
class
pywick.optimizers.nadam.
Nadam
(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, schedule_decay=0.004)[source] Implements Nadam algorithm (a variant of Adam based on Nesterov momentum).
It has been proposed in `Incorporating Nesterov Momentum into Adam`__.
Parameters: - params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
- lr – (float, optional): learning rate (default: 2e-3)
- betas – (Tuple[float, float], optional): coefficients used for computing running averages of gradient and its square
- eps – (float, optional): term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay – (float, optional): weight decay (L2 penalty) (default: 0)
- schedule_decay – (float, optional): momentum schedule decay (default: 4e-3)
-
step
(closure=None)[source] Performs a single optimization step.
Parameters: closure – (callable, optional): A closure that reevaluates the model and returns the loss.
Ralamb¶
-
class
pywick.optimizers.radam.
RAdam
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001)[source]¶ Implementation of the `RAdam optimizer`_.
The learning rate warmup for Adam is a must-have trick for stable training in certain situations (or eps tuning). But the underlying mechanism is largely unknown. In our study, we suggest one fundamental cause is the large variance of the adaptive learning rates, and provide both theoretical and empirical support evidence.
- Args:
- params (iterable): iterable of parameters to optimize or dicts defining
- parameter groups
lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))- eps (float, optional): term added to the denominator to improve
- numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay coefficient (default: 0)
RangerLARS¶
SGDW¶
-
class
pywick.optimizers.sgdw.
SGDW
(params, lr=0.003, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]¶ Implements stochastic gradient descent warm (optionally with momentum).
It has been proposed in Fixing Weight Decay Regularization in Adam.
Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.
Parameters: - (iterable) (params) – iterable of parameters to optimize or dicts defining parameter groups
- lr – (float): learning rate
- momentum – (float, optional): momentum factor (default: 0)
- weight_decay – (float, optional): weight decay (L2 penalty) (default: 0)
- dampening – (float, optional): dampening for momentum (default: 0)
- nesterov – (bool, optional): enables Nesterov momentum (default: False)
- Example:
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) >>> optimizer.zero_grad() >>> loss_fn(model(input_), target).backward() >>> optimizer.step()
Note
The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.
Considering the specific case of Momentum, the update can be written as
\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]where p, g, v and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.
This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form
\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]The Nesterov version is analogously modified.
SWA¶
-
class
pywick.optimizers.swa.
SWA
(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]¶ Implements Stochastic Weight Averaging (SWA).
Stochastic Weight Averaging was proposed in Averaging Weights Leads to Wider Optima and Better Generalization by Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov and Andrew Gordon Wilson (UAI 2018).
SWA is implemented as a wrapper class taking optimizer instance as input and applying SWA on top of that optimizer.
SWA can be used in two modes: automatic and manual. In the automatic mode SWA running averages are automatically updated every
swa_freq
steps afterswa_start
steps of optimization. Ifswa_lr
is provided, the learning rate of the optimizer is reset toswa_lr
at every step starting fromswa_start
. To use SWA in automatic mode provide values for bothswa_start
andswa_freq
arguments.Alternatively, in the manual mode, use
update_swa()
orupdate_swa_group()
methods to update the SWA running averages.In the end of training use swap_swa_sgd method to set the optimized variables to the computed averages.
Parameters: - optimizer – (torch.optim.Optimizer): optimizer to use with SWA
- swa_start – (int): number of steps before starting to apply SWA in automatic mode; if None, manual mode is selected (default: None)
- swa_freq – (int): number of steps between subsequent updates of SWA running averages in automatic mode; if None, manual mode is selected (default: None)
- swa_lr – (float): learning rate to use starting from step swa_start in automatic mode; if None, learning rate is not changed (default: None)
- Examples:
>>> from pywick.optimizers import SWA >>> # automatic mode >>> base_opt = torch.optim.SGD(model.parameters(), lr=0.1) >>> opt = SWA(base_opt, swa_start=10, swa_freq=5, swa_lr=0.05) >>> for _ in range(100): >>> opt.zero_grad() >>> loss_fn(model(input_), target).backward() >>> opt.step() >>> opt.swap_swa_sgd() >>> # manual mode >>> opt = SWA(base_opt) >>> for i in range(100): >>> opt.zero_grad() >>> loss_fn(model(input_), target).backward() >>> opt.step() >>> if i > 10 and i % 5 == 0: >>> opt.update_swa() >>> opt.swap_swa_sgd()
Note
SWA does not support parameter-specific values of
swa_start
,swa_freq
orswa_lr
. In automatic mode SWA uses the sameswa_start
,swa_freq
andswa_lr
for all parameter groups. If needed, use manual mode withupdate_swa_group()
to use different update schedules for different parameter groups.Note
Call
swap_swa_sgd()
in the end of training to use the computed running averages.Note
If you are using SWA to optimize the parameters of a Neural Network containing Batch Normalization layers, you need to update the
running_mean
andrunning_var
statistics of the Batch Normalization module. You can do so by using torchcontrib.optim.swa.bn_update utility. For further description see this article.-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.Parameters: (dict) (param_group) – Specifies what Tensors should be optimized along with group specific optimization options.
-
static
bn_update
(loader, model, device=None)[source]¶ Updates BatchNorm running_mean, running_var buffers in the model.
It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model.
Parameters: - (torch.utils.data.DataLoader) (loader) – dataset loader to compute the activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
- (torch.nn.Module) (model) – model for which we seek to update BatchNorm statistics.
- (torch.device, optional) (device) – If set, data will be trasferred to
device
before being passed intomodel
.
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
Parameters: (dict) (state_dict) – SWA optimizer state. Should be an object returned from a call to state_dict.
-
state_dict
()[source]¶ Returns the state of SWA as a
dict
.- It contains three entries:
- opt_state - a dict holding current optimization state of the base
- optimizer. Its content differs between optimizer classes.
- swa_state - a dict containing current state of SWA. For each
- optimized variable it contains swa_buffer keeping the running average of the variable
- param_groups - a dict containing all parameter groups
-
swap_swa_sgd
()[source]¶ Swaps the values of the optimized variables and swa buffers.
It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.
-
update_swa_group
(group)[source]¶ Updates the SWA running averages for the given parameter group.
Parameters: (dict) (group) – Specifies for what parameter group SWA running averages should be updated - Examples:
>>> # automatic mode >>> base_opt = torch.optim.SGD([{'params': [x]}, >>> {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9) >>> opt = torchcontrib.optim.SWA(base_opt) >>> for i in range(100): >>> opt.zero_grad() >>> loss_fn(model(input_), target).backward() >>> opt.step() >>> if i > 10 and i % 5 == 0: >>> # Update SWA for the second parameter group >>> opt.update_swa_group(opt.param_groups[1]) >>> opt.swap_swa_sgd()