Deep learning relies on optimization methods. Training a complicated deep learning model, on the other hand, might take hours, days, or even weeks. The training efficiency of the model is directly influenced by the optimization algorithm's performance. Understanding the fundamentals of different optimization algorithms and the function of their hyperparameters, on the other hand, will allow us to modify hyperparameters in a targeted manner to improve deep learning model performance.
In this blog, we'll go through some of the most popular deep learning optimization techniques in detail.
Table of Content:
- The goal of Optimization in Deep learning
- Gradient Descent Deep Learning Optimizer
- Stochastic Gradient Descent Deep Learning Optimizer
- Mini-batch Stochastic Gradient Descent
- Adagrad(Adaptive Gradient Descent) Optimizer
- RMSprop (Root Mean Square) Optimizer
- Adam Deep Learning Optimizer
- AdaDelta Deep Learning Optimizer
The goal of Optimization in Deep learning-
Although optimization may help deep learning by lowering the loss function, the aims of optimization and deep learning are fundamentally different. The former is more focused on minimizing an objective, whereas the latter is more concerned with finding a good model given a finite quantity of data. Training error and generalization error, for example, vary in that the optimization algorithm's objective function is usually a loss function based on the training dataset, and the purpose of optimization is to minimize training error. Deep learning (or, to put it another way, statistical inference) aims to decrease generalization error. In order to achieve the latter, we must be aware of overfitting as well as use the optimization procedure to lower the training error.
Gradient Descent Deep Learning Optimizer-
Gradient Descent is the most common optimizer in the class. Calculus is used in this optimization process to make consistent changes to the parameters and reach the local minimum. Before you go any further, you might be wondering what a gradient is?
Consider that you are holding a ball that is lying on the rim of a bowl. When you lose the ball, it travels in the steepest direction until it reaches the bowl's bottom. A gradient directs the ball in the steepest way possible to the local minimum, which is the bowl's bottom.
Gradient descent works with a set of coefficients, calculates their cost, and looks for a cost value that is lower than the current one. It shifts to a lesser weight and updates the values of the coefficients. The procedure continues until the local minimum is found. A local minimum is a point beyond which it is impossible to go any farther.
For the most part, gradient descent is the best option. It does, however, have significant drawbacks. Calculating the gradients is time-consuming when the data is large. For convex functions, gradient descent works well, but it doesn't know how far to travel down the gradient for nonconvex functions.
Stochastic Gradient Descent Deep Learning Optimizer-
On large datasets, gradient descent may not be the best solution. We use stochastic gradient descent to solve the problem. The word stochastic refers to the algorithm's underlying unpredictability. Instead of using the entire dataset for each iteration, we use a random selection of data batches in stochastic gradient descent. As a result, we only sample a small portion of the dataset. The first step in this technique is to choose the starting parameters and learning rate. Then, in each iteration, mix the data at random to get an estimated minimum. When compared to the gradient descent approach, the path taken by the algorithm is full of noise since we are not using the entire dataset but only chunks of it for each iteration.
As a result, SGD requires more iterations to attain the local minimum. The overall computing time increases as the number of iterations increases. However, even when the number of iterations is increased, the computation cost remains lower than that of the gradient descent optimizer. As a result, if the data is large and the processing time is a consideration, stochastic gradient descent should be favored over batch gradient descent.
Mini-batch Stochastic Gradient Descent-
Mini batch SGD straddles the two preceding concepts, incorporating the best of both worlds. It takes training samples at random from the entire dataset (the so-called mini-batch) and computes gradients just from these. By sampling only a fraction of the data, it aims to approach Batch Gradient Descent.
We require fewer rounds because we're utilizing a chunk of data rather than the entire dataset. As a result, the mini-batch gradient descent technique outperforms both stochastic and batch gradient descent algorithms. This approach is more efficient and reliable than previous gradient descent variations. Because the method employs batching, all of the training data does not need to be placed into memory, making the process more efficient. In addition, the cost function in mini-batch gradient descent is noisier than that in batch gradient descent but smoother than that in stochastic gradient descent. Mini-batch gradient descent is therefore excellent and delivers a nice mix of speed and precision.
Mini-batch SGD is the most often utilized version in practice since it is both computationally inexpensive and produces more stable convergence.
Adagrad(Adaptive Gradient Descent) Optimizer -
Adagrad keeps a running total of the squares of the gradient in each dimension, and we adjust the learning rate depending on that total in each update. As a result, each parameter has a variable learning rate (or an adaptive learning rate). Furthermore, when we use the root of the squared gradients, we only consider the magnitude of the gradients, not the sign. We can observe that the learning rate is reduced when the gradient changes rapidly. The learning rate will be higher when the gradient changes slowly. Due to the monotonic growth of the running squared sum, one of Adagrad's major flaws is that the learning rate decreases with time.
RMSprop (Root Mean Square) Optimizer-
Among deep learning aficionados, the RMS prop is a popular optimizer. This might be due to the fact that it hasn't been published but is nonetheless well-known in the community. RMS prop is a natural extension of RPPROP's work. The problem of fluctuating gradients is solved by RPPROP. The issue with the gradients is that some were modest while others may be rather large. As a result, establishing a single learning rate may not be the ideal option. RPPROP adjusts the step size for each weight based on the sign of the gradient. The two gradients are initially compared for signs in this technique.
Adam Deep Learning Optimizer-
To update network weights during training, this optimization approach is a further development of stochastic gradient descent. Unlike SGD, Adam optimizer modifies the learning rate for each network weight independently, rather than keeping a single learning rate for the entire training. The Adam optimizers inherit both Adagrad and RMS prop algorithm characteristics. Instead of using the first moment (mean) like in RMS Prop, Adam employs the second moment of the gradients to modify learning rates. We take the second instance of the gradients to imply the uncentered variance (we don't remove the mean).
AdaDelta Deep Learning Optimizer -
AdaDelta is a more powerful variant of the AdaGrad optimizer. It is based on adaptive learning and is intended to address the major shortcomings of AdaGrad and the RMS prop optimizer. The fundamental disadvantage of the two optimizers mentioned above is that the starting learning rate must be set manually. Another issue is the decreasing learning rate, which eventually becomes infinitesimally tiny. As a result, after a given number of iterations, the model can no longer acquire new information.
This is a comprehensive explanation of the various optimization methods utilized in Deep Learning. We went through three different types of gradient descent and then moved on to additional optimizer techniques. There is still a lot of work to be done in the field of optimization.
However, for the time being, it is critical to understand your needs and the type of data you are working with in order to select the finest optimization technique and obtain excellent outcomes.