The simple method of extracting features from text data is known as the Bag of Words model. It is generally used in machine learning activities and is also associated with conversational AI. Machine learning is the present and the future of the world and the Bag of Words model is associated with it. A sentence is defined in the model by the occurrence of words in the sentence. The Bag of Words is immensely successful in solving problems related to document classification and language modeling.
It is known as the 'Bag of Words' because any particulars about the structure or order of words are tossed out. The model is not concerned with the place of the word in a particular sentence but whether the word is present in the sentence or not.
Bag of Words Model: Explained
When text is modeled with machine learning algorithms, that's when you need the Bag of Words. For a better understanding of the concept, follow these below-mentioned steps:
- Collect the required data.
- Choose and design the vocabulary.
- Create vector representations across documents.
- Note the words and their frequency to find the conclusion.
Let's understand the concept of Bag of Words with an example:
- The dog sat. (Sentence 1)
- The dog sat in the hat. (Sentence 2)
- The dog with the hat. (Sentence 3)
There are three sentences and this is one document.
The vocabulary here is- the dog, sat, in, hat, with.
So, the length vector representation of each sentence in the document is:
Sentence 1 - (1,1,1,0,0,0)
Sentence 2- (2,1,1,1,1,0)
Sentence 3- (2,1,0,0,1,1)
The information is transformed in the form of vectors and not texts.
Manage the vocabulary
The vector representation of the document keeps on increasing with the increase in the size of the document. Some words will be left with zero scores and those words are known as sparse vectors.
At the time of modeling the whole document, the sparse vector will always require more computational resources and memory. It becomes very hard for traditional algorithms to model the document with a vast number of dimensions and positions.
To solve these challenges, some simple text cleaning techniques can be followed. Such techniques are stated below:
- Ignore cases and punctuation.
- Ignore words like 'a', 'is', 'of' etc. that occur frequently.
- Fix the words that are misspelled.
- Use stemming algorithms to reduce the words to their stem. For instance, 'going' to 'go'.
Scoring Words
After choosing a vocabulary, the words occurring in a document must be used often so that they can be scored. The methods of scoring words can be used can be stated as follows:
- Binary method: count the number of words and their frequency, as has been used in a document.
- Counts: Simply count the number of each word that is occurring in the document.
- Frequencies: Calculate and note down the frequencies of the words appearing in the document.
TF-IDF
Generally, the highly frequent words tend to dominate the other words in the sequence of the document. It is important to reflect those words in the document more that contains informational content. Such an approach to scoring can be called 'Term Frequency- Inverse Document Frequency.
Term frequency records the scoring of the frequency of the words in the present document whereas Inverse Document Frequency records the scoring of the frequency of the rare words across all documents.
The bag of Words Model allows you to better understand and organize the text data and represent it in the form of vectors. It makes the data compact and machine-readable.