How to Create a Vector-Based Recommendation System

In the age of information overload, recommendation systems have become indispensable tools for helping users discover content tailored to their interests. Whether it's suggesting movies, music, or products, recommendation systems rely on sophisticated algorithms and data analysis to predict what users will like. One of the most powerful approaches is the vector-based recommendation system. In this blog post, we will explore how to create a vector-based recommendation system using movie recommendations as our example.

Understanding Recommendation Systems

Before diving into the technical details, let's first understand the fundamentals of recommendation systems:

Collaborative Filtering: This method suggests items based on the preferences and behavior of users. It assumes that users who have shown similar behavior in the past will have similar preferences in the future.
Content-Based Filtering: This approach recommends items based on their features and a user's past behavior. For movie recommendations, it can involve analyzing movie metadata such as genre, actors, and directors.
Vector-Based Recommendation Systems: These systems represent both users and items as vectors in a multi-dimensional space. Recommendations are made by finding the similarity between these vectors.

The Vector-Based Recommendation System

Vector-based recommendation systems take a different approach by representing items and users as vectors in a multi-dimensional space. In this space, similar items and users are located closer to each other. The concept is similar to mapping user preferences and item attributes in a common vector space, making it easier to measure similarity and make recommendations.

Here's how vector-based recommendation systems work:

Embedding Items and Users: Each item and user is assigned a vector representation in a high-dimensional space. These vectors capture various attributes, preferences, and features. For example, in a movie recommendation system, vectors could represent factors like genre, director, actor, and user ratings.
Learning Embeddings: The core of vector-based recommendation systems lies in learning these embeddings. This process involves sophisticated machine learning algorithms, such as matrix factorization or deep learning, that aim to minimize the difference between predicted and actual user-item interactions.
Recommendations: To make recommendations, the system calculates the similarity between a user's vector and items in the database. It suggests items that are most similar to the user's preferences based on the proximity of vectors in the embedded space.

Building a Vector-Based Recommendation System

Now, let's walk through the steps to create a vector-based recommendation system for movie recommendations.

Data Collection: The first step is to gather data. In the case of movie recommendations, you'll need a dataset that contains information about movies (e.g., title, genre, actors, directors) and user interactions (e.g., ratings, reviews). Websites like MovieLens and IMDb provide such datasets for research and development.
Data Preprocessing: Clean and preprocess your data. Remove duplicates, handle missing values, and transform categorical data into numerical form. For example, you can one-hot encode movie genres or create actor and director embeddings.
Creating User and Movie Vectors: To build user and movie vectors, use techniques like matrix factorization, collaborative filtering, or deep learning models like matrix factorization and neural collaborative filtering. These methods extract latent features that represent users and movies in the same vector space.
Calculating Similarities: Once you have your user and movie vectors, calculate the similarity between them. The cosine similarity or Pearson correlation coefficient are commonly used metrics to measure the similarity.
Generating Recommendations: For a given user, identify the movies with the highest similarity scores and recommend them. You can also incorporate user-specific data, such as their past interactions or ratings, to personalize the recommendations further.
Evaluation: Evaluate your recommendation system using metrics like Mean Average Precision (MAP), Root Mean Square Error (RMSE), or precision-recall curves to ensure the recommendations are accurate and relevant to users.

Tools and Technologies

To implement a vector-based recommendation system, you can use a variety of tools and technologies, including Python, popular libraries like NumPy, pandas, and scikit-learn, and machine learning frameworks like TensorFlow or PyTorch.

Benefits of Vector-Based Recommendation Systems

Vector-based recommendation systems offer several advantages over traditional methods:

Improved Personalization: Vector-based systems provide highly personalized recommendations because they can capture complex relationships between users and items in a multi-dimensional space.
Diversity in Recommendations: They are better at suggesting diverse and unexpected items, as they can identify less obvious connections and preferences.
Cold Start Problem Mitigation: Vector-based systems can handle the cold start problem more effectively because they don't solely rely on historical data; they can make educated guesses based on item attributes.
Scalability: These systems are scalable and adaptable to various domains, from e-commerce to content streaming, allowing for seamless expansion.
Constant Learning: They can continuously learn and adapt to changes in user preferences, keeping recommendations up-to-date.

Tutorial: Vector Based Movie Recommendation System

In this tutorial, we will walk through the code provided for ‘Movie Recommender using Vector’. This code leverages natural language processing techniques to recommend movies based on their plot synopses. The code uses various libraries and techniques to achieve this, including Levenshtein distance, sentence embeddings, and nearest neighbours. We will explain each step and provide a clear understanding of the code.

Prerequisites

Before you get started, make sure you have the necessary libraries installed. You can install the required libraries by running the following commands:

!pip install tqdm>=4.62.2
!pip install Levenshtein

The code also uses the Sentence Transformers library, which you can install using the following command:

!pip install sentence-transformers

Code Walkthrough

Let's go through the code step by step:

1. Import Necessary Libraries

import pandas as pd
from Levenshtein import distance
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import numpy as np

2. Download and Load the Dataset

I've obtained the initial dataset from Kaggle, specifically the MPST dataset named "Movie Plot Synopses with Tags," which was authored by Sudipta Kar. To handle this dataset, I'll utilize the pandas library. For our current task, we're primarily interested in two types of data: the movie title and its description. Users will use the title to identify the movie they're interested in, while the movie's description will be encoded into a vector representation. Once the data is encoded, there's no longer a need for the movie descriptions.

Load movie data from a CSV file:

df = pd.read_csv('/content/mpst_full_data.csv')

3. Preprocess the Movie Dataset

To ensure accurate recommendations, we need to remove duplicate movies. Since some duplicates may not be identical, we'll employ a custom algorithm to identify and remove them.

df = df.sort_values('title').reset_index(drop=True)
df['lev'] = None
df
from Levenshtein import distance
for a in range(len(df)-1):
  if distance(df.iloc[a].title, df.iloc[a+1].title) <= 3:=""  =""  print(a,="" df.iloc[a].title,="" df.iloc[a+1].title)=""  df.at[a,="" 'lev']="distance(df.iloc[a].title," df="" #we="" filter="" similar="" movies="" <="" code="">

#find Avengers duplicates
for a in range(len(df)):
    if df.iloc[a]['title'].find('Avengers') != -1:
        pass
        #print(a)
#drop extra
df = df.drop([9572]).reset_index(drop=True) #i can do 1, 2, 3... to drop multiple
df
df.to_csv('mpst_no_duplicates.csv')

4. Encode the Data

This code segment processes text data in a DataFrame to encode it into vector representations using the 'SentenceTransformer' model. It tracks the progress of this encoding task with a progress bar.

from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import numpy as np
tqdm.pandas()
model = SentenceTransformer('all-MiniLM-L6-v2') #all-MiniLM-L6-v2 #all-mpnet-base-v2
df_ = df.copy()
df_['plot_synopsis'] = df_['plot_synopsis'].progress_apply(lambda x : model.encode(x))
df_index = df_.pop('title')
df_ = df_[['plot_synopsis']]
df_ = pd.DataFrame(np.column_stack(list(zip(*df_.values))))
df_.index = df_index
df_

Saving the encoded csv file.

df_.to_csv('mpst_encoded_no_duplicates.csv')

5. Perform a Vector Search: Test Your Recommendation System

To perform the vector search, we'll use the sklearn library for nearest neighbour search. First, we load the encoded dataset.

import pandas as pd
df_movies_encoded = pd.read_csv('/content/mpst_encoded_no_duplicates.csv')
df_movies_encoded.index = df_movies_encoded.pop('title')
df_movies_encoded

Next, we train the nearest neighbour model.

from Levenshtein import distance
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(df_movies_encoded)#string-searching algorithm
def closest_title(title):
    m = pd.DataFrame(df_movies_encoded.index)
    m['lev'] = m['title'].apply(lambda x : distance(x, title))
    return m.sort_values('lev', ascending=True)['title'].iloc[0]

We also handle the issue of searching for movies not in the dataset or with typos by implementing a string-search algorithm based on Levenshtein distance.

def find_similar_movies(df, nbrs, title):
    #if title not in df it will choose the best search
    title = closest_title(title)
    distances, indices = nbrs.kneighbors([df.loc[title]])
    #print(indices)#we print df data, no longer df_
    for index in indices[0][1:]:
        print('index', index)
        print(title, '->', df.iloc[index].name)

Finally, we can recommend movies based on user input.

# Please rename 'Avengers' with your prompt inorder to get the movie recommendation
find_similar_movies(df_movies_encoded, nbrs, 'Avengers')

‍6. Conclusion

This code demonstrates a movie recommender system that utilizes Levenshtein distance to identify similar movie titles, encodes plot synopsis using the Sentence Transformer model, and finds similar movies using nearest neighbors. This can be a useful tool for movie recommendation based on textual data like plot synopsis. You can adapt and extend this code for your specific movie recommendation use case.

Conclusion

Creating a vector-based recommendation system is a powerful way to provide personalized content recommendations to users. Whether it's suggesting movies, songs, or products, understanding the fundamental concepts of recommendation systems and following the steps outlined in this blog post will help you build an effective recommendation engine. Keep in mind that recommendation systems are dynamic and require continuous monitoring and refinement to adapt to changing user preferences. As you delve deeper into the world of recommendation systems, you'll discover more advanced techniques and approaches to further enhance the quality of your recommendations.

How to Create a Vector-Based Recommendation System

Understanding Recommendation Systems

The Vector-Based Recommendation System

Building a Vector-Based Recommendation System

Tools and Technologies

Benefits of Vector-Based Recommendation Systems

Tutorial: Vector Based Movie Recommendation System

Prerequisites

Code Walkthrough

3. Preprocess the Movie Dataset

4. Encode the Data

5. Perform a Vector Search: Test Your Recommendation System

‍6. Conclusion

Conclusion

Related Articles

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

GPU Cloud

Company

Legal & Policies

Investor Relations

Resources