Strategies for Training Large Scale Neural Network Language Models

October 12, 2022

Tags

ai-machine-learning blog computer-vision

Statistical language models are an essential component of many natural language applications, such as machine translation and automated speech recognition. For over three decades, backoff models based on n-gram statistics have dominated the language modeling field, owing to their simplicity and cheap computer complexity.

Other forms of language models have received a lot of attention in recent years. The maximum entropy language model (ME LMs) and the Neural network-based language model (NN LMs) are two effective examples of such models. Their fundamental appeal stems from their ability to mimic real language more precisely than backoff models. Their main disadvantage is their enormous computational complexity. As a result, much academic effort has gone into making these models more computationally efficient.

In this blog, we briefly discuss known ways for lowering the computational complexity of NN LMs (the majority of which also apply to ME LMs). We suggest new straightforward strategies for lowering the computing cost of the training and testing stages. We demonstrate how these new strategies compliment established ones.

‍

Maximum Entropy Model

A maximum entropy model looks like this:

where f is a collection of features, λ is a set of weights, and h denotes a history Learning the set of weights is the first step in training the maximum entropy model. The most common characteristics are n-grams, although any information source, such as triggers or grammatical features, may be easily included. The selection of features is normally done manually, and it has a considerable impact on the overall performance of the model.

ME models trained with only n-gram features outperform traditional backoff models with modified Kneser-Ney smoothing.

A standard neural network language model

The conventional neural network language model is fairly similar in structure. The primary distinction is that the characteristics of this model are learned automatically as a function of history. Furthermore, the ME model's typical characteristics are binary, whereas NN models employ continuous-valued features. The NN LM can be described as follows:

‍

‍

where s denotes a hidden layer state The state of the hidden layer in the feedforward NN LM architecture is determined by a projection layer, which is constructed by projecting N-1 recent words into low-dimensional space. Comparable words have similar low-dimensional representations once the model has been trained.

Because of their capacity to cluster related words (or comparable histories), neural network models outperform state-of-the-art backoff methods. Furthermore, NN LMs complement backoff models, and further improvements can be realized by linearly interpolating them.

‍

Recurrent neural network-based language model (RNN LM)

‍

Figure: Feedforward neural network 4-gram model (on the left) and Recurrent neural network language model (on the right).

The condition of the concealed layer can be affected by the most recent word and the prior time step. As a result, the time is not clearly represented. This recurrence enables the hidden layer to represent a low-dimensional representation of the whole history (or, in other words, it gives memory to the model). We name this architecture the Recurrent neural network-based language model (RNN LM). RNN LM outperforms typical feedforward NN LM designs, as well as many other sophisticated language modelling approaches.

‍

Complexity in Computation

A neural network language model's computational complexity is high for various reasons, and attempts have been made to address practically all of them. The N-gram neural network LM training time is proportional to

where I is the number of training epochs until convergence, W is the number of tokens in the training set, N is the N-gram order, D is the dimensionality of words in the low-dimensional space, H is the hidden layer size, and V is the vocabulary size. The word (N - 1) x D denotes the size of the projection layer. The computational complexity of the recurrent NN LM is as follows:

The complexity of the feedforward architecture grows linearly with increasing order N, but it remains constant for the recurrent one (really, N has no relevance in RNN LM). If the maximum entropy model employs feature set f with complete N-gram features and is trained using on-line stochastic gradient descent in the same manner as neural network models, its computational cost is

‍

Reduction of Training Epochs

NN LMs are often trained using stochastic gradient descent with on-the-fly weight updates. Typically, 10-50 training epochs are believed to be required to achieve convergence, while there are outliers (thousands of epochs are said to be required).

Reduction in the count of Training Tokens

Backoff n-gram language models are often trained on as much data as is available. However, only a tiny portion of this data is in-domain for conventional voice recognition tasks. Out-of-domain data often account for more than 90% of the training corpus, although their weight in the final model is minimal. As a result, NN LMs are often trained solely on in-domain corpora. [10] trains NN LMs using in-domain data plus a randomly subsampled portion of out-of-domain data that is picked at the start of each training period.

Reduction of Vocabulary size

For LVCSR tasks, the hidden layer H typically has between 100 and 500 neurons, while the vocabulary V has between 50k and 300k words. As a result, several attempts have been made to limit the amount of vocabulary. The most basic method is to construct probability distributions just for the top M words in the neural network model; the other words are based on backoff n-gram probabilities. The list of the top M words is then referred to as a shortlist. However, for small values of M, this strategy degrades performance significantly, and even with M = 2000, the complexity of the H V term remains large.

Smaller Size of the Hidden Layer

Another technique to lower H V is to select a low H value. For example, H = 100 is employed when the training data size exceeds 600M words. However, as long as the standard NN LM architecture is utilized, H = 100 is inadequate to provide decent performance when the amount of training data is considerable.

Parallelization

Artificial neural networks are inherently easy to parallelize. It is feasible to divide the matrix times vector calculation across many CPUs, or to analyse numerous cases simultaneously, allowing for matrix times matrix computing, which may be improved using existing libraries such as BLAS. Also, a several-fold speedup from parallelization in the context of NN LMs is also observed. Because the state of the hidden layer depends on the prior state, it may appear that recurrent networks are significantly more difficult to parallelize. However, only the computation between the hidden and output layers may be parallelized. It is also feasible to parallelize the whole network by simultaneously training from various points in the training data.

Conclusion

Another strategy is to divide the training data into K subgroups and train a single NN model on each. However, because neural network models benefit from grouping comparable events, such an approach would provide inferior results. In addition, throughout the testing process, we would end up having K models rather than one.

In this blog, we showed how neural network models, namely recurrent architecture, may deliver considerable gains over state-of-the-art voice recognition setups. Furthermore, even with extremely small hidden layers, training an RNN model with direct connections can result in high performance on perplexity and word error rates. By rejecting and sorting portions of the training data, we may obtain a 10% reduction in perplexity when compared to standard stochastic gradient descent. This gain would most likely be substantially greater for situations where just a subset of training corpora may be deemed in-domain.

Sign up for Free Trial

Latest Blogs