Optimizers¶

Optimizers govern the path that your neural network takes as it tries to minimize error. Picking the right optimizer and initializing it with the right parameters will either make your network learn successfully or will cause it not to learn at all! Pytorch already implements the most widely used flavors such as SGD, Adam, RMSProp etc. Here we strive to include optimizers that Pytorch has missed (and any cutting edge ones that have not yet been added).

A2Grad¶

class pywick.optimizers.a2grad.A2GradUni(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: Optional[float] = None, beta: float = 10, lips: float = 10)[source]¶

Implements A2GradUni Optimizer Algorithm.

It has been proposed in `Optimal Adaptive and Accelerated Stochastic Gradient Descent`__.

Arguments:

params: iterable of parameters to optimize or dicts defining: parameter groups

lr: not used for this optimizer (default: None) beta: (default: 10) lips: Lipschitz constant (default: 10)

Note:: Reference code: https://github.com/severilov/A2Grad_optimizer

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶

Performs a single optimization step.

Arguments:: closure: A closure that reevaluates the model and returns the loss.

class pywick.optimizers.a2grad.A2GradInc(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: Optional[float] = None, beta: float = 10, lips: float = 10)[source]¶

Implements A2GradInc Optimizer Algorithm.

It has been proposed in `Optimal Adaptive and Accelerated Stochastic Gradient Descent`__.

Arguments:

params: iterable of parameters to optimize or dicts defining: parameter groups

lr: not used for this optimizer (default: None) beta: (default: 10) lips: Lipschitz constant (default: 10)

Note:: Reference code: https://github.com/severilov/A2Grad_optimizer

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶

Performs a single optimization step.

Arguments:: closure: A closure that reevaluates the model and returns the loss.

class pywick.optimizers.a2grad.A2GradExp(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: Optional[float] = None, beta: float = 10, lips: float = 10, rho: float = 0.5)[source]¶

Implements A2GradExp Optimizer Algorithm.

It has been proposed in `Optimal Adaptive and Accelerated Stochastic Gradient Descent`__.

Arguments:

params: iterable of parameters to optimize or dicts defining: parameter groups

lr: not used for this optimizer (default: None) beta: (default: 10) lips: Lipschitz constant (default: 10) rho: represents the degree of weighting decrease, a constant

smoothing factor between 0 and 1 (default: 0.5)

Note:: Reference code: https://github.com/severilov/A2Grad_optimizer

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶

Performs a single optimization step.

Arguments:: closure: A closure that reevaluates the model and returns the loss.

AdaBelief¶

class pywick.optimizers.adabelief.AdaBelief(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, weight_decouple: bool = False, fixed_decay: bool = False, rectify: bool = False)[source]¶

Implements AdaBelief Optimizer Algorithm. It has been proposed in `AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients`__.

Arguments:

params: iterable of parameters to optimize or dicts defining: parameter groups

lr: learning rate (default: 1e-2) betas: coefficients used for computing

running averages of gradient and its square (default: (0.9, 0.999))

eps: term added to the denominator to improve: numerical stability (default: 0.001)

weight_decay: weight decay (L2 penalty) (default: 0) amsgrad: whether to use the AMSGrad variant of this

algorithm from the paper On the Convergence of Adam and Beyond (default: False)

weight_decouple: If set as True, then the optimizer uses decoupled: weight decay as in AdamW (default: False)
fixed_decay : This is used when: weight_decouple is set as True. When fixed_decay == True, the weight decay is performed as $W_{new} = W_{old} - W_{old} times decay$. When fixed_decay == False, the weight decay is performed as $W_{new} = W_{old} - W_{old} times decay times lr$. Note that in this case, the weight decay ratio decreases with learning rate (lr). (default: False)
rectify: (default: False) If set as True, then perform the rectified: update similar to RAdam

Note:: Reference code: https://github.com/juntang-zhuang/Adabelief-Optimizer

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶

Performs a single optimization step.

Arguments:: closure: A closure that reevaluates the model and returns the loss.

AdaHessian¶

AdaHessian Optimizer

class pywick.optimizers.adahessian.Adahessian(params, lr=0.1, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, hessian_power=1.0, update_each=1, n_samples=1, avg_conv_kernel=False)[source]¶

Implements the AdaHessian algorithm from “ADAHESSIAN: An Adaptive Second OrderOptimizer for Machine Learning”

Arguments:

params (iterable): iterable of parameters to optimize or dicts defining parameter groups lr (float, optional): learning rate (default: 0.1) betas ((float, float), optional): coefficients used for computing running averages of gradient and the

squared hessian trace (default: (0.9, 0.999))

eps (float, optional): term added to the denominator to improve numerical stability (default: 1e-8) weight_decay (float, optional): weight decay (L2 penalty) (default: 0.0) hessian_power (float, optional): exponent of the hessian trace (default: 1.0) update_each (int, optional): compute the hessian trace approximation only after this number of steps

(to save time) (default: 1)

n_samples (int, optional): how many times to sample z for the approximation of the hessian trace (default: 1)

get_params()[source]¶: Gets all parameters in all param_groups with gradients

is_second_order¶

set_hessian()[source]¶: Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.

step(closure=None)[source]¶: Performs a single optimization step. Arguments:

closure (callable, optional) – a closure that reevaluates the model and returns the loss (default: None)

zero_hessian()[source]¶: Zeros out the accumalated hessian traces.

AdamP¶

AdamP Optimizer Implementation copied from https://github.com/clovaai/AdamP/blob/master/adamp/adamp.py Paper: Slowing Down the Weight Norm Increase in Momentum-based Optimizers - https://arxiv.org/abs/2006.08217 Code: https://github.com/clovaai/AdamP Copyright (c) 2020-present NAVER Corp. MIT license

class pywick.optimizers.adamp.AdamP(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, delta=0.1, wd_ratio=0.1, nesterov=False)[source]¶

step(closure=None)[source]¶

AdamW¶

class pywick.optimizers.adamw.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[source]¶

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

Arguments:

params (iterable): iterable of parameters to optimize or dicts defining: parameter groups

lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing

running averages of gradient and its square (default: (0.9, 0.999))

eps (float, optional): term added to the denominator to improve: numerical stability (default: 1e-8)

weight_decay (float, optional): weight decay coefficient (default: 1e-2) amsgrad (boolean, optional): whether to use the AMSGrad variant of this

algorithm from the paper On the Convergence of Adam and Beyond (default: False)

step(closure=None)[source]¶

Performs a single optimization step.

Arguments:

closure (callable, optional): A closure that reevaluates the model: and returns the loss.

AddSign¶

class pywick.optimizers.addsign.AddSign(params, lr=0.001, beta=0.9, alpha=1, sign_internal_decay=None)[source]¶

Implements AddSign algorithm.

It has been proposed in Neural Optimizer Search with Reinforcement Learning.

Parameters:

params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
lr – (float, optional): learning rate (default: 1e-3)
beta – (float, optional): coefficients used for computing running averages of gradient (default: 0.9)
alpha – (float, optional): term added to the internal_decay * sign(g) * sign(m) (default: 1)
sign_internal_decay – (callable, optional): a function that returns an internal decay calculated based on the current training step and the total number of training steps. If None, the internal decay is assumed to be 1.

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure – (callable, optional): A closure that reevaluates the model and returns the loss.

Apollo¶

class pywick.optimizers.apollo.Apollo(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: float = 0.01, beta: float = 0.9, eps: float = 0.0001, warmup: int = 0, init_lr: float = 0.01, weight_decay: float = 0)[source]¶

Implements Apollo Optimizer Algorithm.

It has been proposed in `Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization`__.

Arguments:

params: iterable of parameters to optimize or dicts defining: parameter groups

lr: learning rate (default: 1e-2) beta: coefficient used for computing

running averages of gradient (default: 0.9)

eps: term added to the denominator to improve: numerical stability (default: 1e-4)

warmup: number of warmup steps (default: 0) init_lr: initial learning rate for warmup (default: 0.01) weight_decay: weight decay (L2 penalty) (default: 0)

Note:: Reference code: https://github.com/XuezheMax/apollo

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶

Performs a single optimization step.

Arguments:: closure: A closure that reevaluates the model and returns the loss.

Lars¶

PyTorch LARS / LARC Optimizer An implementation of LARS (SGD) + LARC in PyTorch Based on:

PyTorch SGD: https://github.com/pytorch/pytorch/blob/1.7/torch/optim/sgd.py#L100

NVIDIA APEX LARC: https://github.com/NVIDIA/apex/blob/master/apex/parallel/LARC.py

class pywick.optimizers.lars.Lars(params, lr=1.0, momentum=0, dampening=0, weight_decay=0, nesterov=False, trust_coeff=0.001, eps=1e-08, trust_clip=False, always_adapt=False)[source]¶

LARS for PyTorch

Paper: Large batch training of Convolutional Networks - https://arxiv.org/pdf/1708.03888.pdf Args:

params (iterable): iterable of parameters to optimize or dicts defining parameter groups. lr (float, optional): learning rate (default: 1.0). momentum (float, optional): momentum factor (default: 0) weight_decay (float, optional): weight decay (L2 penalty) (default: 0) dampening (float, optional): dampening for momentum (default: 0) nesterov (bool, optional): enables Nesterov momentum (default: False) trust_coeff (float): trust coefficient for computing adaptive lr / trust_ratio (default: 0.001) eps (float): eps for division denominator (default: 1e-8) trust_clip (bool): enable LARC trust ratio clipping (default: False) always_adapt (bool): always apply LARS LR adapt, otherwise only when group weight_decay != 0 (default: False)

step(closure=None)[source]¶: Performs a single optimization step. Args:

closure (callable, optional): A closure that reevaluates the model and returns the loss.

Eve¶

class pywick.optimizers.eve.Eve(params, lr=0.001, betas=(0.9, 0.999, 0.999), eps=1e-08, k=0.1, K=10, weight_decay=0)[source]¶

Implementation of Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

step(closure)[source]¶

Parameters:	closure – (closure). see http://pytorch.org/docs/optim.html#optimizer-step-closure
Returns:	loss

Lookahead¶

class pywick.optimizers.lookahead.Lookahead(optimizer, k=5, alpha=0.5)[source]¶

Implementation of Lookahead Optimizer: k steps forward, 1 step back

Args:

param optimizer:
	the optimizer to work with (sgd, adam etc)
param k:	(int) - number of steps to look ahead (default=5)
param alpha:	(float) - slow weights step size

add_param_group(param_group)[source]¶

load_state_dict(state_dict)[source]¶

state_dict()[source]¶

step(closure=None)[source]¶

update(group)[source]¶

update_lookahead()[source]¶

LookaheadSGD¶

class pywick.optimizers.lookaheadsgd.LookaheadSGD(params, lr, alpha=0.5, k=6, momentum=0.9, dampening=0, weight_decay=0.0001, nesterov=False)[source]¶

MADGrad¶

PyTorch MADGRAD optimizer MADGRAD: https://arxiv.org/abs/2101.11075 Code from: https://github.com/facebookresearch/madgrad

class pywick.optimizers.madgrad.MADGRAD(params: Any, lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, decoupled_decay: bool = False)[source]¶

MADGRAD: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization. .. _MADGRAD: https://arxiv.org/abs/2101.11075 MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep. MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam’s beta1 should work here also. On sparse problems both weight_decay and momentum should be set to 0. Arguments:

params (iterable):

Iterable of parameters to optimize or dicts defining parameter groups.

lr (float):

Learning rate (default: 1e-2).

momentum (float):

Momentum value in the range [0,1) (default: 0.9).

weight_decay (float):

Weight decay, i.e. a L2 penalty (default: 0).

eps (float):

Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6).

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶: Performs a single optimization step. Arguments:

closure (callable, optional): A closure that reevaluates the model and returns the loss.

supports_flat_params¶

supports_memory_efficient_fp16¶

Nadam¶

class pywick.optimizers.nadam.Nadam(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, schedule_decay=0.004)[source]¶

Implements Nadam algorithm (a variant of Adam based on Nesterov momentum).

It has been proposed in `Incorporating Nesterov Momentum into Adam`__.

Parameters:

params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
lr – (float, optional): learning rate (default: 2e-3)
betas – (Tuple[float, float], optional): coefficients used for computing running averages of gradient and its square
eps – (float, optional): term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay – (float, optional): weight decay (L2 penalty) (default: 0)
schedule_decay – (float, optional): momentum schedule decay (default: 4e-3)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure – (callable, optional): A closure that reevaluates the model and returns the loss.

PowerSign¶

class pywick.optimizers.powersign.PowerSign(params, lr=0.001, beta=0.9, alpha=2.718281828459045, sign_internal_decay=None)[source]¶

Implements PowerSign algorithm.

It has been proposed in Neural Optimizer Search with Reinforcement Learning.

Parameters:

params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
lr – (float, optional): learning rate (default: 1e-3)
beta – (float, optional): coefficients used for computing running averages of gradient (default: 0.9)
alpha – (float, optional): term powered to the internal_decay * sign(g) * sign(m) (default: math.e)
sign_internal_decay – (callable, optional): a function that returns an internal decay calculated based on the current training step and the total number of training steps. If None, the internal decay is assumed to be 1.

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure – (callable, optional): A closure that reevaluates the model and returns the loss.

QHAdam¶

class pywick.optimizers.qhadam.QHAdam(params: Union[Iterable[<sphinx.ext.autodoc.importer._MockObject object at 0x7f22eefb0e10>], Iterable[Dict[str, Any]]], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), nus: Tuple[float, float] = (1.0, 1.0), weight_decay: float = 0.0, decouple_weight_decay: bool = False, eps: float = 1e-08)[source]¶

Implements the QHAdam optimization algorithm.

It has been proposed in `Adaptive methods for Nonconvex Optimization`__.

Arguments:

params: iterable of parameters to optimize or dicts defining: parameter groups

lr: learning rate (default: 1e-3) betas: coefficients used for computing

running averages of gradient and its square (default: (0.9, 0.999))

nus: immediate discount factors used to estimate the gradient and its: square (default: (1.0, 1.0))
eps: term added to the denominator to improve: numerical stability (default: 1e-8)

weight_decay: weight decay (L2 penalty) (default: 0) decouple_weight_decay: whether to decouple the weight

decay from the gradient-based optimization step (default: False)

Note:: Reference code: https://github.com/facebookresearch/qhoptim

step(closure: Optional[Callable[float]] = None) → Optional[float][source]¶

Performs a single optimization step.

Arguments:: closure: A closure that reevaluates the model and returns the loss.

RAdam¶

class pywick.optimizers.nadam.Nadam(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, schedule_decay=0.004)[source]

Implements Nadam algorithm (a variant of Adam based on Nesterov momentum).

It has been proposed in `Incorporating Nesterov Momentum into Adam`__.

Parameters:

params – (iterable): iterable of parameters to optimize or dicts defining parameter groups
lr – (float, optional): learning rate (default: 2e-3)
betas – (Tuple[float, float], optional): coefficients used for computing running averages of gradient and its square
eps – (float, optional): term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay – (float, optional): weight decay (L2 penalty) (default: 0)
schedule_decay – (float, optional): momentum schedule decay (default: 4e-3)

step(closure=None)[source]

Performs a single optimization step.

Parameters:	closure – (callable, optional): A closure that reevaluates the model and returns the loss.

Ralamb¶

class pywick.optimizers.radam.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0001)[source]¶

Implementation of the `RAdam optimizer`_.

The learning rate warmup for Adam is a must-have trick for stable training in certain situations (or eps tuning). But the underlying mechanism is largely unknown. In our study, we suggest one fundamental cause is the large variance of the adaptive learning rates, and provide both theoretical and empirical support evidence.

Args:

params (iterable): iterable of parameters to optimize or dicts defining: parameter groups

lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing

running averages of gradient and its square (default: (0.9, 0.999))

eps (float, optional): term added to the denominator to improve: numerical stability (default: 1e-8)

weight_decay (float, optional): weight decay coefficient (default: 0)

step(closure=None)[source]¶

RangerLARS¶

class pywick.optimizers.rangerlars.RangerLars(params, alpha=0.5, k=6, *args, **kwargs)[source]¶

SGDW¶

class pywick.optimizers.sgdw.SGDW(params, lr=0.003, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]¶

Implements stochastic gradient descent warm (optionally with momentum).

It has been proposed in Fixing Weight Decay Regularization in Adam.

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters:

(iterable) (params) – iterable of parameters to optimize or dicts defining parameter groups
lr – (float): learning rate
momentum – (float, optional): momentum factor (default: 0)
weight_decay – (float, optional): weight decay (L2 penalty) (default: 0)
dampening – (float, optional): dampening for momentum (default: 0)
nesterov – (bool, optional): enables Nesterov momentum (default: False)

Example:

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input_), target).backward()
>>> optimizer.step()

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]

where p, g, v and $\rho$ denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]

The Nesterov version is analogously modified.

SWA¶

class pywick.optimizers.swa.SWA(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]¶

Implements Stochastic Weight Averaging (SWA).

Stochastic Weight Averaging was proposed in Averaging Weights Leads to Wider Optima and Better Generalization by Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov and Andrew Gordon Wilson (UAI 2018).

SWA is implemented as a wrapper class taking optimizer instance as input and applying SWA on top of that optimizer.

SWA can be used in two modes: automatic and manual. In the automatic mode SWA running averages are automatically updated every swa_freq steps after swa_start steps of optimization. If swa_lr is provided, the learning rate of the optimizer is reset to swa_lr at every step starting from swa_start. To use SWA in automatic mode provide values for both swa_start and swa_freq arguments.

Alternatively, in the manual mode, use update_swa() or update_swa_group() methods to update the SWA running averages.

In the end of training use swap_swa_sgd method to set the optimized variables to the computed averages.

Parameters:

optimizer – (torch.optim.Optimizer): optimizer to use with SWA
swa_start – (int): number of steps before starting to apply SWA in automatic mode; if None, manual mode is selected (default: None)
swa_freq – (int): number of steps between subsequent updates of SWA running averages in automatic mode; if None, manual mode is selected (default: None)
swa_lr – (float): learning rate to use starting from step swa_start in automatic mode; if None, learning rate is not changed (default: None)

Examples:

>>> from pywick.optimizers import SWA
>>> # automatic mode
>>> base_opt = torch.optim.SGD(model.parameters(), lr=0.1)
>>> opt = SWA(base_opt, swa_start=10, swa_freq=5, swa_lr=0.05)
>>> for _ in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input_), target).backward()
>>>     opt.step()
>>> opt.swap_swa_sgd()
>>> # manual mode
>>> opt = SWA(base_opt)
>>> for i in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input_), target).backward()
>>>     opt.step()
>>>     if i > 10 and i % 5 == 0:
>>>         opt.update_swa()
>>> opt.swap_swa_sgd()

Note

SWA does not support parameter-specific values of swa_start, swa_freq or swa_lr. In automatic mode SWA uses the same swa_start, swa_freq and swa_lr for all parameter groups. If needed, use manual mode with update_swa_group() to use different update schedules for different parameter groups.

Note

Call swap_swa_sgd() in the end of training to use the computed running averages.

Note

If you are using SWA to optimize the parameters of a Neural Network containing Batch Normalization layers, you need to update the running_mean and running_var statistics of the Batch Normalization module. You can do so by using torchcontrib.optim.swa.bn_update utility. For further description see this article.

add_param_group(param_group)[source]¶

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:	(dict) (param_group) – Specifies what Tensors should be optimized along

with group specific optimization options.

static bn_update(loader, model, device=None)[source]¶

Updates BatchNorm running_mean, running_var buffers in the model.

It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model.

Parameters:

(torch.utils.data.DataLoader) (loader) – dataset loader to compute the activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
(torch.nn.Module) (model) – model for which we seek to update BatchNorm statistics.
(torch.device, optional) (device) – If set, data will be trasferred to device before being passed into model.

load_state_dict(state_dict)[source]¶

Loads the optimizer state.

Parameters:	(dict) (state_dict) – SWA optimizer state. Should be an object returned from a call to state_dict.

state_dict()[source]¶

Returns the state of SWA as a dict.

It contains three entries:

opt_state - a dict holding current optimization state of the base

optimizer. Its content differs between optimizer classes.
swa_state - a dict containing current state of SWA. For each

optimized variable it contains swa_buffer keeping the running average of the variable
param_groups - a dict containing all parameter groups

swap_swa_sgd()[source]¶

Swaps the values of the optimized variables and swa buffers.

It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.

update_swa()[source]¶: Updates the SWA running averages of all optimized parameters.

update_swa_group(group)[source]¶

Updates the SWA running averages for the given parameter group.

Parameters:	(dict) (group) – Specifies for what parameter group SWA running averages should be updated

Examples:

>>> # automatic mode
>>> base_opt = torch.optim.SGD([{'params': [x]},
>>>             {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9)
>>> opt = torchcontrib.optim.SWA(base_opt)
>>> for i in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input_), target).backward()
>>>     opt.step()
>>>     if i > 10 and i % 5 == 0:
>>>         # Update SWA for the second parameter group
>>>         opt.update_swa_group(opt.param_groups[1])
>>> opt.swap_swa_sgd()