Introduction
If you've been training deep learning models, whether for research or production, you've probably heard about the many techniques available that can help boost the performance of your model. There are many tricks for improving your model's object identification, picture classification, and image segmentation that are covered in numerous papers on computer vision models.
Picking the appropriate neural network training recipe for deep learning models is still challenging. When ResNet50's top-1 accuracy on ImageNet-1K was increased to 80.4% using a few training strategies, a recent article showed how crucial the appropriate training can be. (The original ResNet50 recipe's accuracy was 75.8%.) Gaining such a huge gain in performance only by tweaking the training mix is pretty astounding.
Putting tricks aside, there are plenty of difficulties to be solved while working with neural network training. To begin with, selecting the architecture for a deep learning task typically entails conducting a literature search to identify a model built using the most recent state-of-the-art techniques. The ideal hyperparameters for the architecture and its training can be found using automated algorithms, however, these methods perform better in theory than in actual use. Additionally, a large change in the data frequently necessitates modifying the model's hyperparameters and training procedure.
With so much research and advances coming out every now and then it is very important to know some of the tips and tricks while training neural networks.
Let’s look at 5 must-have tricks while training the neural network in this blog.
Table of contents:
- Exponential Moving Average to improve model performance
- Overfit a single batch of data
- Zero-weight Decay on BatchNorm and Bias
- Data augmentation to avoid overfitting
- Fine-tuning the model
- Conclusion
1) Exponential Moving Average (EMA) to improve model performance
By preventing convergence to local minima, EMA is a technique that improves the stability of a model's convergence and aids in the achievement of a better overall solution. To avoid abrupt changes in the model’s weights during training, we create a copy of the current weights before changing the model’s weights. Following that, we update the weights to reflect a weighted average of the weights from the post-optimization step and the current weights as follows:
The weights learned are not altered by the addition of EMA compared to training without EMA; rather, an additional set of weights is saved. Even while we typically find them to be fairly beneficial, you can easily decide at the end of the training not to utilize the EMA-acquired weights if they don't offer any value.
2) Overfit a single batch of data
Once you decide on the model architecture to follow, we need to probe the training loop. Take one batch, and then press the train button. The first thing we check is whether the loss makes sense at first and whether it decreases after weight updates. It should be fairly quick and simple to determine whether the training loop can truly over-fit, or learn since this is a single batch.
You need to be convinced that if the training loop model is applied, the desired behavior will result. Even though the entire dataset might have problems, such as memory leaks, at least we have some degree of confidence that the basic training loop functions.
3) Zero-weight Decay on BatchNorm and Bias
Any "go-to" models for different computer vision workloads are probably going to have biases and ‘BatchNorm’ layers added to their linear or convolution layers. Setting the batch norm and bias weights' weight decay in the optimizer to zero while training this kind of model yields better results. A regularization parameter called weight decay stops the model weights from "exploding." Although many projects and frameworks typically zero the weight decay for these parameters by default, it's still important to check as PyTorch still doesn't do this by default.
4) Data augmentation to avoid overfitting
Data augmentation is another popular trick to prevent overfitting due to the limited availability at hand. there are two options to augment data. One option is to carry out all necessary transformations in advance, effectively growing your dataset. Alternatively, you may apply these modifications to a mini-batch right before you feed it to your machine learning model.
Offline augmentation is the name for the first possibility. Given that you would ultimately increase the size of the dataset by a factor equal to the number of changes you apply, this strategy is preferred for relatively smaller datasets.
The second option is referred to as "augmentation on the fly" or "augmentation online." Because you can't afford the exponential growth in size, this strategy is suited for larger datasets. In its place, you would change the mini-batches that you would feed to your model.
5) Fine-tuning the model
Fine-tuning involves making use of transfer learning. In particular, fine-tuning is the process of refining or tweaking a model that has already been trained to perform one specific task in order to have it execute a second task that is similar.
In simple terms, let's say we remove the last layer of this model. The last layer would have previously been classifying whether an image was a car or not. After removing this, we want to add a new layer back that's purpose is to classify whether an image is a truck or not.
6) Conclusion
In this article, we learned about how knowing the different tricks and techniques are important in order to train neural network more optimally. We learned about tricks like EMA, Overfitting single batch, Zero weight decay on BatchNorm, Data augmentation and fine-tuning which are the must-have tricks to keep in mind while training Neural network model.
References:
[3]https://deci.ai/blog/tricks-training-neural-networks/
[4]https://www.doc.ic.ac.uk/~nuric/deeplearning/5-tips-for-training-neural-networks.html
[5]https://medium.com/mlearning-ai/tips-and-tricks-for-training-neural-networks-b0fad16d915c