Optimization in deep learning- Learn with examples

June 24, 2022

The goal of Optimization in Deep learning-

Although optimization may help deep learning by lowering the loss function, the aims of optimization and deep learning are fundamentally different. The former is more focused on minimizing an objective, whereas the latter is more concerned with finding a good model given a finite quantity of data. Training error and generalization error, for example, vary in that the optimization algorithm's objective function is usually a loss function based on the training dataset, and the purpose of optimization is to minimize training error. Deep learning (or, to put it another way, statistical inference) aims to decrease generalization error. In order to achieve the latter, we must be aware of overfitting as well as use the optimization procedure to lower the training error.

Gradient Descent Deep Learning Optimizer-

Gradient Descent is the most common optimizer in the class. Calculus is used in this optimization process to make consistent changes to the parameters and reach the local minimum. Before you go any further, you might be wondering what a gradient is?

Consider that you are holding a ball that is lying on the rim of a bowl. When you lose the ball, it travels in the steepest direction until it reaches the bowl's bottom. A gradient directs the ball in the steepest way possible to the local minimum, which is the bowl's bottom.

Gradient descent works with a set of coefficients, calculates their cost, and looks for a cost value that is lower than the current one. It shifts to a lesser weight and updates the values of the coefficients. The procedure continues until the local minimum is found. A local minimum is a point beyond which it is impossible to go any farther.

For the most part, gradient descent is the best option. It does, however, have significant drawbacks. Calculating the gradients is time-consuming when the data is large. For convex functions, gradient descent works well, but it doesn't know how far to travel down the gradient for nonconvex functions.Gradient descent works well for convex functions

Stochastic Gradient Descent Deep Learning Optimizer-

On large datasets, gradient descent may not be the best solution. We use stochastic gradient descent to solve the problem. The word stochastic refers to the algorithm's underlying unpredictability. Instead of using the entire dataset for each iteration, we use a random selection of data batches in stochastic gradient descent. As a result, we only sample a small portion of the dataset. The first step in this technique is to choose the starting parameters and learning rate. Then, in each iteration, mix the data at random to get an estimated minimum. When compared to the gradient descent approach, the path taken by the algorithm is full of noise since we are not using the entire dataset but only chunks of it for each iteration.

As a result, SGD requires more iterations to attain the local minimum. The overall computing time increases as the number of iterations increases. However, even when the number of iterations is increased, the computation cost remains lower than that of the gradient descent optimizer. As a result, if the data is large and the processing time is a consideration, stochastic gradient descent should be favored over batch gradient descent.

Mini-batch Stochastic Gradient Descent-

Mini batch SGD straddles the two preceding concepts, incorporating the best of both worlds. It takes training samples at random from the entire dataset (the so-called mini-batch) and computes gradients just from these. By sampling only a fraction of the data, it aims to approach Batch Gradient Descent.

We require fewer rounds because we're utilizing a chunk of data rather than the entire dataset. As a result, the mini-batch gradient descent technique outperforms both stochastic and batch gradient descent algorithms. This approach is more efficient and reliable than previous gradient descent variations. Because the method employs batching, all of the training data does not need to be placed into memory, making the process more efficient. In addition, the cost function in mini-batch gradient descent is noisier than that in batch gradient descent but smoother than that in stochastic gradient descent. Mini-batch gradient descent is therefore excellent and delivers a nice mix of speed and precision.

Mini-batch SGD is the most often utilized version in practice since it is both computationally inexpensive and produces more stable convergence.

Adagrad(Adaptive Gradient Descent) Optimizer -

Adagrad keeps a running total of the squares of the gradient in each dimension, and we adjust the learning rate depending on that total in each update. As a result, each parameter has a variable learning rate (or an adaptive learning rate). Furthermore, when we use the root of the squared gradients, we only consider the magnitude of the gradients, not the sign. We can observe that the learning rate is reduced when the gradient changes rapidly. The learning rate will be higher when the gradient changes slowly. Due to the monotonic growth of the running squared sum, one of Adagrad's major flaws is that the learning rate decreases with time.

RMSprop (Root Mean Square) Optimizer-

Among deep learning aficionados, the RMS prop is a popular optimizer. This might be due to the fact that it hasn't been published but is nonetheless well-known in the community. RMS prop is a natural extension of RPPROP's work. The problem of fluctuating gradients is solved by RPPROP. The issue with the gradients is that some were modest while others may be rather large. As a result, establishing a single learning rate may not be the ideal option. RPPROP adjusts the step size for each weight based on the sign of the gradient. The two gradients are initially compared for signs in this technique.

Adam Deep Learning Optimizer-

To update network weights during training, this optimization approach is a further development of stochastic gradient descent. Unlike SGD, Adam optimizer modifies the learning rate for each network weight independently, rather than keeping a single learning rate for the entire training. The Adam optimizers inherit both Adagrad and RMS prop algorithm characteristics. Instead of using the first moment (mean) like in RMS Prop, Adam employs the second moment of the gradients to modify learning rates. We take the second instance of the gradients to imply the uncentered variance (we don't remove the mean).

AdaDelta Deep Learning Optimizer -

AdaDelta is a more powerful variant of the AdaGrad optimizer. It is based on adaptive learning and is intended to address the major shortcomings of AdaGrad and the RMS prop optimizer. The fundamental disadvantage of the two optimizers mentioned above is that the starting learning rate must be set manually. Another issue is the decreasing learning rate, which eventually becomes infinitesimally tiny. As a result, after a given number of iterations, the model can no longer acquire new information.

Conclusion-

This is a comprehensive explanation of the various optimization methods utilized in Deep Learning. We went through three different types of gradient descent and then moved on to additional optimizer techniques. There is still a lot of work to be done in the field of optimization.

However, for the time being, it is critical to understand your needs and the type of data you are working with in order to select the finest optimization technique and obtain excellent outcomes.

For seamless training and deployment of your deep learning models, explore E2E Cloud’s TIR AI model deployment platform, built to provide scalable and efficient infrastructure tailored to your ML workloads.

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Optimization in deep learning- Learn with examples

June 24, 2022

Niharika Srivastava

The goal of Optimization in Deep learning-

Gradient Descent Deep Learning Optimizer-

Stochastic Gradient Descent Deep Learning Optimizer-

Mini-batch Stochastic Gradient Descent-

Mini-batch SGD is the most often utilized version in practice since it is both computationally inexpensive and produces more stable convergence.

Adagrad(Adaptive Gradient Descent) Optimizer -

RMSprop (Root Mean Square) Optimizer-

Adam Deep Learning Optimizer-

AdaDelta Deep Learning Optimizer -

Conclusion-

However, for the time being, it is critical to understand your needs and the type of data you are working with in order to select the finest optimization technique and obtain excellent outcomes.

Sign up for Free Trial

Latest Blogs

Optimization in deep learning- Learn with examples

Table of Contents

The goal of Optimization in Deep learning-

Gradient Descent Deep Learning Optimizer-

Stochastic Gradient Descent Deep Learning Optimizer-

Mini-batch Stochastic Gradient Descent-

Adagrad(Adaptive Gradient Descent) Optimizer -

RMSprop (Root Mean Square) Optimizer-

Adam Deep Learning Optimizer-

AdaDelta Deep Learning Optimizer -

Conclusion-

Optimization in deep learning- Learn with examples

Table of Contents

The goal of Optimization in Deep learning-

Gradient Descent Deep Learning Optimizer-

Stochastic Gradient Descent Deep Learning Optimizer-

Mini-batch Stochastic Gradient Descent-

Adagrad(Adaptive Gradient Descent) Optimizer -

RMSprop (Root Mean Square) Optimizer-

Adam Deep Learning Optimizer-

AdaDelta Deep Learning Optimizer -

Conclusion-

9 Cloud Computing Trends Shaping India’s Digital Future in 2025

LoRA fine-tune Gemma 7B Using TIR with 10 Easy Steps

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs