PyTorch Optimizers

August 29, 2022

Introduction-

PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies. It integrates many algorithms, methods, and classes into a single line of code to ease your day. PyTorch has many optimizer classes such as AdaDelta, Adam, and SGD to name a few.

The optimizer takes the parameters we want to update, the learning rate we want to use and optimizers update weights through its step() method.

In this blog, we are going to see 13 such optimizers which are supported by the PyTorch framework.

Table of contents

Torch.optim
AdaDelta Class
AdaGrad Class
Adam Class
AdamW Class
SparseAdam Class
Adamax Class
LBFGS Class
RMSprop Class
Rprop Class
SGD Class
ASGD Class
NAdam Class
RAdam Class
Conclusion

‍

TORCH.OPTIM

‍torch.optim is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.Now to use torch.optim you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

optimizer = optim.Adam([var1, var2], lr=0.0001)

AdaDelta Class

It implements the Adadelta algorithm and the algorithms were proposed in ADADELTA: An Adaptive Learning Rate Method paper. In Adadelta you don’t require an initial learning rate constant to start with, You can use it without any torch method by defining function like this:

def Adadelta(weights, sqrt, deltas, rho, batch_size):

eps_stable = 1e-5

for weight, sqr, delta in zip(weights, sqrs, deltas):

g = weight.grad / batch_size

sqr[:] = rho * sqr + (1. - rho) * nd.square(g)

cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g

delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

# update weight in place.

weight[:] -= cur_delta

With help of PyTorch you can do same with just a single line of code as shown below:

torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

AdaGrad Class

‍Adagrad (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:

‍

def Adagrad(data):

gradient_sums = np.zeros(theta.shape[0])

for t in range(num_iterations):

gradients = compute_gradients(data, weights)

gradient_sums += gradients ** 2

gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))

weights = weights - lr * gradient_update

return weights

In several problems many times the most critical information is present in the data that is not as frequent. So, if the use case you are working on is related to sparse data, Adagrad can be useful. You can call the optimizer algorithm by using the below command with the help torch:

‍

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

But there are some drawbacks too like it is computationally expensive and the learning rate is also decreasing which makes it slow in training.

‍

Adam Class

Adam is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of Adadelta and RMSprop optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

AdamW Class

This time the authors suggested an improved version of the Adam class called AdamW in which weight decay is performed only after controlling the parameter-wise step size. The weight decay or regularization term does not end up in moving averages and in outputs, it is only proportional to the weight itself. The authors show practically that AdamW yields better training loss, which means the models generalize much better than models trained with Adam allowing the remake to compete with SGD with momentum.

In PyTorch, you can simply call this algorithm using below command:

torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)

SparseAdam Class

SparseAdam Implements a lazy version of the Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

In PyTorch, you can simply call this algorithm using below command:

torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)

Adamax class

It Implements the Adamax algorithm (a variant of Adam supported infinity norm).

In PyTorch, you can simply call this algorithm using below command:

torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

LBFGS Class

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc (minFunc – unconstrained differentiable multivariate optimization in Matlab)

In PyTorch, you can simply call this with the help of the torch method:

torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)

RMSprop Class

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton.

The centered version first appears in Generating Sequences with Recurrent Neural Networks. The implementation here takes the square root of the gradient average before adding epsilon (note that TensorFlow interchanges these two operations). The effective learning rate is thus \gamma/(\sqrt{v} + \epsilon)γ/(v+ϵ) where \gammaγ is the scheduled learning rate and v is the weighted moving average of the squared gradient.

To implement this optimizer in PyTorch, you can use below mentioned command:

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

Rprop Class

This class Implements the resilient backpropagation algorithm.

In PyTorch, you can use below mentioned command to call the optimizer:

torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

SGD Class

Implements stochastic gradient descent (optionally with momentum).

‍

Nesterov momentum is predicated on the formula from On the importance of initialization and momentum in deep learning.

Implementation of the algorithm:

def SGD(data, batch_size, lr):

N = len(data)

np.random.shuffle(data)

mini_batches = np.array([data[i:i+batch_size]

for i in range(0, N, batch_size)])

for X,y in mini_batches:

backprop(X, y, lr)

In PyTorch, you can implement algorithms by calling below command:

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

To use algorithms in building deep neural network model, you can use as per below example:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

optimizer.zero_grad()

loss_fn(model(input), target).backward()

optimizer.step()

ASGD Class

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

It has been proposed in Acceleration of stochastic approximation by averaging paper.

In PyTorch, you can implement algorithms by calling below command:

torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)

‍

NAdam Class

This class implements the NAdam optimization algorithm.

For further details regarding the algorithm, you can refer to Incorporating Nesterov Momentum into Adam paper.

In PyTorch, you can implement algorithms by calling below command:

torch.optim.NAdam(params, lr=0.02, betas=(0.9,0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004, foreach=None)

RAdam Class

It implements the RAdam optimization algorithm.

For more details regarding the algorithms, you can refer to On the variance of the adaptive learning rate and beyond.

In PyTorch, you can implement algorithms by calling below command

torch.optim.NAdam(params, lr=0.001, betas=(0.9,0.999), eps=1e-08, weight_decay=0, foreach=None)

Conclusion

In this blog, we saw 13 optimizers supported by the PyTorch framework and their implementations. A PyTorch framework is a powerful tool when it comes to deep learning models whether it is a research problem or business problem. For more details and in-depth details on parameters, you can refer to their official documentation here.

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

PyTorch Optimizers

August 29, 2022

Niharika Srivastava

Introduction-

The optimizer takes the parameters we want to update, the learning rate we want to use and optimizers update weights through its step() method.

In this blog, we are going to see 13 such optimizers which are supported by the PyTorch framework.

Table of contents

Torch.optim
AdaDelta Class
AdaGrad Class
Adam Class
AdamW Class
SparseAdam Class
Adamax Class
LBFGS Class
RMSprop Class
Rprop Class
SGD Class
ASGD Class
NAdam Class
RAdam Class
Conclusion

‍

TORCH.OPTIM

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

optimizer = optim.Adam([var1, var2], lr=0.0001)

AdaDelta Class

def Adadelta(weights, sqrt, deltas, rho, batch_size):

eps_stable = 1e-5

for weight, sqr, delta in zip(weights, sqrs, deltas):

g = weight.grad / batch_size

sqr[:] = rho * sqr + (1. - rho) * nd.square(g)

cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g

delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

# update weight in place.

weight[:] -= cur_delta

With help of PyTorch you can do same with just a single line of code as shown below:

torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

AdaGrad Class

‍

def Adagrad(data):

gradient_sums = np.zeros(theta.shape[0])

for t in range(num_iterations):

gradients = compute_gradients(data, weights)

gradient_sums += gradients ** 2

gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))

weights = weights - lr * gradient_update

return weights

‍

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

But there are some drawbacks too like it is computationally expensive and the learning rate is also decreasing which makes it slow in training.

‍

Adam Class

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

AdamW Class

In PyTorch, you can simply call this algorithm using below command:

torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)

SparseAdam Class

SparseAdam Implements a lazy version of the Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

In PyTorch, you can simply call this algorithm using below command:

torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)

Adamax class

It Implements the Adamax algorithm (a variant of Adam supported infinity norm).

In PyTorch, you can simply call this algorithm using below command:

torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

LBFGS Class

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc (minFunc – unconstrained differentiable multivariate optimization in Matlab)

In PyTorch, you can simply call this with the help of the torch method:

torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)

RMSprop Class

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton.

To implement this optimizer in PyTorch, you can use below mentioned command:

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

Rprop Class

This class Implements the resilient backpropagation algorithm.

In PyTorch, you can use below mentioned command to call the optimizer:

torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

SGD Class

Implements stochastic gradient descent (optionally with momentum).

‍

Nesterov momentum is predicated on the formula from On the importance of initialization and momentum in deep learning.

Implementation of the algorithm:

def SGD(data, batch_size, lr):

N = len(data)

np.random.shuffle(data)

mini_batches = np.array([data[i:i+batch_size]

for i in range(0, N, batch_size)])

for X,y in mini_batches:

backprop(X, y, lr)

In PyTorch, you can implement algorithms by calling below command:

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

To use algorithms in building deep neural network model, you can use as per below example:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

optimizer.zero_grad()

loss_fn(model(input), target).backward()

optimizer.step()

ASGD Class

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

It has been proposed in Acceleration of stochastic approximation by averaging paper.

In PyTorch, you can implement algorithms by calling below command:

torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)

‍

NAdam Class

This class implements the NAdam optimization algorithm.

For further details regarding the algorithm, you can refer to Incorporating Nesterov Momentum into Adam paper.

In PyTorch, you can implement algorithms by calling below command:

torch.optim.NAdam(params, lr=0.02, betas=(0.9,0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004, foreach=None)

RAdam Class

It implements the RAdam optimization algorithm.

For more details regarding the algorithms, you can refer to On the variance of the adaptive learning rate and beyond.

In PyTorch, you can implement algorithms by calling below command

torch.optim.NAdam(params, lr=0.001, betas=(0.9,0.999), eps=1e-08, weight_decay=0, foreach=None)

Conclusion

Sign up for Free Trial

Latest Blogs

PyTorch Optimizers

Table of Contents

Introduction-

PyTorch Optimizers

Table of Contents

Introduction-

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future