**Introduction-**

**PyTorch** is the fastest growing deep learning framework and it is also used by many top fortune companies. It integrates many algorithms, methods, and classes into a single line of code to ease your day. PyTorch has many optimizer classes such as AdaDelta, Adam, and SGD to name a few.

The optimizer takes the parameters we want to update, the learning rate we want to use and optimizers update weights through its step() method.

In this blog, we are going to see 13 such optimizers which are supported by the PyTorch framework.

**Table of contents**

- Torch.optim
- AdaDelta Class
- AdaGrad Class
- Adam Class
- AdamW Class
- SparseAdam Class
- Adamax Class
- LBFGS Class
- RMSprop Class
- Rprop Class
- SGD Class
- ASGD Class
- NAdam Class
- RAdam Class
- Conclusion

**TORCH.OPTIM**

**torch.optim** is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.Now to use **torch.optim** you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.

**import torch.optim as optim**

**optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)**

**optimizer = optim.Adam([var1, var2], lr=0.0001)**

**AdaDelta Class**

It implements the Adadelta algorithm and the algorithms were proposed in **ADADELTA: An Adaptive Learning Rate Method** paper. In Adadelta you don’t require an initial learning rate constant to start with, You can use it without any torch method by defining function like this:

**def** Adadelta(weights, sqrt, deltas, rho, batch_size):

eps_stable = 1e-5

**for** weight, sqr, delta **in** zip(weights, sqrs, deltas):

g = weight.grad / batch_size

sqr[:] = rho * sqr + (1. - rho) * nd.square(g)

cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g

delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

*# update weight in place.*

weight[:] -= cur_delta

With help of PyTorch you can do same with just a single line of code as shown below:

**torch.optim.Adadelta( params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)**

**AdaGrad Class**

**Adagrad** (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:

**def** Adagrad(data):

gradient_sums = np.zeros(theta.shape[0])

**for** t **in** range(num_iterations):

gradients = compute_gradients(data, weights)

gradient_sums += gradients ** 2

gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))

weights = weights - lr * gradient_update

**return** weights

In several problems many times the most critical information is present in the data that is not as frequent. So, if the use case you are working on is related to sparse data, Adagrad can be useful. You can call the optimizer algorithm by using the below command with the help torch:

****

**torch.optim.Adagrad( params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)**

But there are some drawbacks too like it is computationally expensive and the learning rate is also decreasing which makes it slow in training.

****

**Adam Class**

**Adam** is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of **Adadelta** and **RMSprop** optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:

**torch.optim.Adam( params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)**

**AdamW Class**

This time the authors suggested an improved version of the Adam class called **AdamW** in which weight decay is performed only after controlling the parameter-wise step size. The weight decay or regularization term does not end up in moving averages and in outputs, it is only proportional to the weight itself. The authors show practically that AdamW yields better training loss, which means the models generalize much better than models trained with Adam allowing the remake to compete with SGD with momentum.

In PyTorch, you can simply call this algorithm using below command:

**torch.optim.AdamW( params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)**

**SparseAdam Class**

SparseAdam Implements a lazy version of the Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

In PyTorch, you can simply call this algorithm using below command:

**torch.optim.SparseAdam( params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)**

**Adamax class**

It Implements the Adamax algorithm (a variant of Adam supported infinity norm).

In PyTorch, you can simply call this algorithm using below command:

**torch.optim.Adamax( params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)**

**LBFGS Class**

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc (minFunc – unconstrained differentiable multivariate optimization in Matlab)

In PyTorch, you can simply call this with the help of the torch method:

**torch.optim.LBFGS( params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)**

**RMSprop Class**

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton.

The centered version first appears in Generating Sequences with Recurrent Neural Networks. The implementation here takes the square root of the gradient average before adding epsilon (note that TensorFlow interchanges these two operations). The effective learning rate is thus \gamma/(\sqrt{v} + \epsilon)*γ*/(*v*+*ϵ*) where \gamma*γ* is the scheduled learning rate and *v* is the weighted moving average of the squared gradient.

To implement this optimizer in PyTorch, you can use below mentioned command:

**torch.optim.RMSprop( params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)**

**Rprop Class**

This class Implements the resilient backpropagation algorithm.

In PyTorch, you can use below mentioned command to call the optimizer:

**torch.optim.Rprop( params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50**

*)*)

**SGD Class**

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is predicated on the formula from On the importance of initialization and momentum in deep learning.

Implementation of the algorithm:

**def** SGD(data, batch_size, lr):

N = len(data)

np.random.shuffle(data)

mini_batches = np.array([data[i:i+batch_size]

**for** i **in** range(0, N, batch_size)])

**for** X,y **in** mini_batches:

backprop(X, y, lr)

In PyTorch, you can implement algorithms by calling below command:

**torch.optim.SGD( params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)**

To use algorithms in building deep neural network model, you can use as per below example:

**optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)**

**optimizer.zero_grad()**

**loss_fn(model(input), target).backward()**

**optimizer.step()**

**ASGD Class**

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

It has been proposed in Acceleration of stochastic approximation by averaging paper.

In PyTorch, you can implement algorithms by calling below command:

**torch.optim.ASGD( params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)**

****

**NAdam Class**

This class implements the NAdam optimization algorithm.

For further details regarding the algorithm, you can refer to Incorporating Nesterov Momentum into Adam paper.

In PyTorch, you can implement algorithms by calling below command:

**torch.optim.NAdam( params, lr=0.02, betas=(0.9,0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004, foreach=None)**

**RAdam Class**

It implements the RAdam optimization algorithm.

For more details regarding the algorithms, you can refer to On the variance of the adaptive learning rate and beyond.

In PyTorch, you can implement algorithms by calling below command

**torch.optim.NAdam( params, lr=0.001, betas=(0.9,0.999), eps=1e-08, weight_decay=0, foreach=None)**

**Conclusion**

In this blog, we saw 13 optimizers supported by the PyTorch framework and their implementations. A PyTorch framework is a powerful tool when it comes to deep learning models whether it is a research problem or business problem. For more details and in-depth details on parameters, you can refer to their official documentation here.