Sentiment analysis is a popular application of natural language processing (NLP) that has many practical uses in business and marketing. It can be used to monitor brand reputation, track customer feedback, and identify emerging trends in customer sentiment.
In this post, we will use feature extraction for a pre-trained LLM from Hugging Face to perform sentiment analysis on a dataset of movie reviews. We will then use MinIO to store the data and the model.
To get started, head over to MyAccount and get yourself a GPU node after signup. Additionally, we recommend that you install the Remote Explorer extension with VS Code to be able to use the remote node as if it's your local development environment.
Next, we need to install the following libraries:
- Hugging Face Transformers
- Hugging Face Datasets
- MinIO Python SDK
Once we have these libraries installed, we can download the movie review dataset from Hugging Face's datasets library using the following code:
The Hugging Face Dataset library provides a convenient way to download and work with datasets. However, when dealing with enterprise data, it is not always feasible to upload and download data from the Hugging Face Hub. A better solution is to store the data in MinIO buckets and objects and then load it into the Dataset library's internal structures.
The dataset contains 50,000 movie reviews that are labelled as either positive or negative. We will use this dataset to train our sentiment analysis model.
In this article, we will explore how to use Hugging Face Datasets and MinIO to perform feature extraction and transfer learning. We will use a pre-trained model from Hugging Face to perform sentiment analysis on a dataset of movie reviews. Here are the steps we will follow:
Hugging Face Datasets and MinIO
First, let’s create some helper functions for getting data into and out of MinIO. These functions are below. The get_object() function will retrieve an object from MinIO and save it as a file. The put_file() function will upload a file to a specified bucket within MinIO. If the bucket does not exist, it will be created.
To create the files and upload them to MinIO, run the snippet below. You will get one file for each set. Other file types that are supported are CSV, Arrow and Parquet.
Finally, we can reload our data from MinIO using the code below.
We now have reviewed the DatasetDict object loaded with a training set, validation set, and test set. We can look at the columns using the column_names property.
Load the Model and Tokenizer
To load a pre-trained model from Hugging Face, we can use the from_pretrained method of the appropriate model class. For example, to load a DistilBERT model for sequence classification, we can use the following code:
We can also load a tokenizer for the model using the from_pretrained method of the appropriate tokenizer class:
Tokenize the Data
Before we can train our model, we need to preprocess the data. This involves tokenizing the text and converting it into a format that can be used by our model.
To tokenize our data using the tokenizer we loaded earlier, we can use the following code:
This code tokenizes each review in the dataset using our tokenizer. It also pads each review to a fixed length and truncates any reviews that are too long.
Feature extraction is a technique that involves using a pre-trained model to extract features from data. In our case, we will use our pre-trained DistilBERT model to extract features from our tokenized movie review dataset.
To perform feature extraction, we can use the following code:
This code iterates over each review in our dataset and uses our pre-trained model to extract features from it. The features are then stored in a list.
Transfer learning is a technique that involves using knowledge learned from one task to improve performance on another related task. In our case, we will use transfer learning to train a new model for sentiment analysis using the features extracted from our pre-trained model.
Install SKlearn Module using:
To train our new model using transfer learning, we can use the following code:
This code splits our extracted features into training and test sets and trains a logistic regression model on them. We then evaluate the performance of our new model on the test set.
Analysing the Results
Our logistic regression model has achieved an accuracy of 0.87 on the test set. This demonstrates that transfer learning can be an effective technique for training models on limited data.
In this article, we have explored how to use Hugging Face Datasets and MinIO to perform feature extraction and transfer learning. We have also demonstrated how transfer learning can be used to train models on limited data. We used a pre-trained LLM from Hugging Face to perform sentiment analysis on a dataset of movie reviews. We also used MinIO to store the data and the model.