<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: seniordatascientist</title>
    <description>The latest articles on Forem by seniordatascientist (@seniordatascientist).</description>
    <link>https://forem.com/seniordatascientist</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F785769%2Fe6de39ac-600f-41d7-8657-c5129eb4a6d5.png</url>
      <title>Forem: seniordatascientist</title>
      <link>https://forem.com/seniordatascientist</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/seniordatascientist"/>
    <language>en</language>
    <item>
      <title>Content-based Recommender System with Python</title>
      <dc:creator>seniordatascientist</dc:creator>
      <pubDate>Tue, 04 Jan 2022 13:20:07 +0000</pubDate>
      <link>https://forem.com/seniordatascientist/content-based-recommender-system-with-python-5g85</link>
      <guid>https://forem.com/seniordatascientist/content-based-recommender-system-with-python-5g85</guid>
      <description>&lt;p&gt;Recommender systems are methods that help us predict interests of users and generate relevant recommendations for them for different products or services. These products can range from songs to play on Apple Music to movies to watch on one of the streaming services, articles to read on news journal or products from Amazon.&lt;/p&gt;

&lt;p&gt;Recommender systems are differentiated mainly by the type of data in use. &lt;/p&gt;

&lt;p&gt;Whereas content-based recommenders rely on features of users and/or items, the collaborative filtering uses information on the interaction between users and items, as defined in the user-item matrix.&lt;/p&gt;

&lt;p&gt;Recommender systems are generally divided into 3 main approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;content-based recommendation engines&lt;/li&gt;
&lt;li&gt;collaborative filtering recommendation engines&lt;/li&gt;
&lt;li&gt;and hybrid recommendation systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What are content-based recommender systems?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Content-based recommenders produce recommendations using the features or attributes of items and/or users. &lt;/p&gt;

&lt;p&gt;User attributes can include age, sex, job and other personal information. Item attributes are different in that they are of descriptive kind that distinguishes items from each other. &lt;/p&gt;

&lt;p&gt;Example features for movies would be title, cast, description, genre and others.&lt;/p&gt;

&lt;p&gt;Content-based methods, by means of their reliance on features are similar to traditional machine learning models which are often feature based. &lt;/p&gt;

&lt;p&gt;One of the inherent advantages of content-based recommenders is that they have a certain degree of user independence. To generate recommendation for a user, they namely do not need information about other users, like the CF (collaborative filtering) methods do. &lt;/p&gt;

&lt;p&gt;Content-based approach is thus easier to scale. Explainability of AI models has become very important in last years. There has been a whole field developed from efforts in this area - called XAI. &lt;/p&gt;

&lt;p&gt;There are many nice libraries available to help explainability of AI predictions, personally I like SHAP and LIME. &lt;/p&gt;

&lt;p&gt;Content-based methods are better from respect of explainability as it is easier to explain their recommendations than in case of collaborative filtering. &lt;/p&gt;

&lt;p&gt;Although CF methods also have some explainability available. CF library &lt;a href="https://github.com/benfred/implicit"&gt;https://github.com/benfred/implicit&lt;/a&gt; which I used a lot in my past projects, e.g. has the method model.explain available for that. &lt;/p&gt;

&lt;p&gt;Returning back to content-based approach, it also has its drawbacks. One of them is that it can over-specialize – if the user is only interested in specific categories, recommender will have difficulty recommending items outside of this area. This can lead the user to remain in the area of current items. &lt;/p&gt;

&lt;p&gt;I will now build an example of content-based recommender in python, by using the MovieLens data.&lt;/p&gt;

&lt;p&gt;Content-based recommender system for recommendation of movies&lt;br&gt;
Our recommender system will be able to recommend movies to us.&lt;/p&gt;

&lt;p&gt;First, we load the models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

import ast

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt`

import pandas as pd

import ast

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We next get our data set data from &lt;a href="https://www.kaggle.com/rounakbanik/the-movies-dataset"&gt;https://www.kaggle.com/rounakbanik/the-movies-dataset&lt;/a&gt; and &lt;a href="https://grouplens.org/datasets/movielens/latest/:"&gt;https://grouplens.org/datasets/movielens/latest/:&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_data = pd.read_csv(‘movies_metadata.csv’, low_memory=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As part of pre-processing we remove movies which have low number of votes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
df_data = df_data[df_data['vote_count'].notna()]

plt.figure(figsize=(20,5))

sns.distplot(df_data['vote_count'])

plt.title("Histogram of vote counts")

df_data = df_data[df_data['vote_count'].notna()]

plt.figure(figsize=(20,5))

sns.distplot(df_data['vote_count'])

plt.title("Histogram of vote counts")
# determine the minimum number of votes that the movie must have to be included 

min_votes = np.percentile(df_data['vote_count'].values, 85)
1
min_votes = np.percentile(df_data['vote_count'].values, 85)
# exclude movies that do not have minimum number of votes

df = df_data.copy(deep=True).loc[df_data['vote_count'] &amp;gt; min_votes]
1
df = df_data.copy(deep=True).loc[df_data['vote_count'] &amp;gt; min_votes]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Content-based recommender will have a goal of recommending movies which have a similar plot to a selected movie.&lt;/p&gt;

&lt;p&gt;We will use “overview” feature from our dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
# removing rows with missing overview

df = df[df['overview'].notna()]

df.reset_index(inplace=True)


# processing of overviews

def process_text(text):

    # replace multiple spaces with one

    text = ' '.join(text.split())

    # lowercase

    text = text.lower()

    return text

df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)

# removing rows with missing overview

df = df[df['overview'].notna()]

df.reset_index(inplace=True)


# processing of overviews

def process_text(text):

    # replace multiple spaces with one

    text = ' '.join(text.split())

    # lowercase

    text = text.lower()

    return text

df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To compare movie plots, we first need to compute their vector representation. There are various methods available from from bag of words, word embeddings to TF-IDF, we will select the latter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TF-IDF approach&lt;/strong&gt;&lt;br&gt;
TF-IDF of a word in a text which is part of a larger corpus of text is a combination of two values. One is term frequency (TF), which measures how frequently the word occurs in the document.&lt;/p&gt;

&lt;p&gt;However, some of the words, such as “the” and “is”, occur frequently in all documents and we want to downsize their importance. This is done by multiplying term frequency with the inverse document frequency.&lt;/p&gt;

&lt;p&gt;In this way only those words are considered relevant for the document that are frequent in this text but more rarely present in the rest of the corpus.&lt;/p&gt;

&lt;p&gt;For building the TF-IDF representation of movie plots we will use the TfidfVectorizer from scikit-learn. We first fit TfidfVectorizer on train data set of movie plot descriptions and then transform the movie plots into TF-IDF numerical representation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tf_idf = TfidfVectorizer(stop_words='english')

tf_idf_matrix = tf_idf.fit_transform(df['overview']);

tf_idf = TfidfVectorizer(stop_words='english')

tf_idf_matrix = tf_idf.fit_transform(df['overview']);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now compute similarity of movies by calculating their pair-wise cosine similarities and storing them in cosine similarity matrix:&lt;/p&gt;

&lt;h1&gt;
  
  
  calculating cosine similarity between movies
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)
1
2
3
# calculating cosine similarity between movies

cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With cosine similarity matrix computed, we can define the function “recommendations” that will return top recommendations for a given movie:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

def index_from_title(df,title):

return df[df['original_title']==title].index.values[0]

# function that returns the title of the movie from its index

def title_from_index(df,index):

return df[df.index==index].original_title.values[0]`



# generating recommendations for given title

def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):

index = index_from_title(df,original_title)

similarity_scores = list(enumerate(cosine_similarity_matrix[index]))

similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]

return df['original_title'].iloc[recommendations_indices]


def index_from_title(df,title):

return df[df['original_title']==title].index.values[0]


# function that returns the title of the movie from its index

def title_from_index(df,index):

return df[df.index==index].original_title.values[0]


# generating recommendations for given title

def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):

index = index_from_title(df,original_title)

similarity_scores = list(enumerate(cosine_similarity_matrix[index]))

similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]

return df['original_title'].iloc[recommendations_indices]


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now produce our recommendation for a given film, e.g. ‘Batman’:&lt;/p&gt;

&lt;p&gt;recommendations(‘Batman’, df, cosine_similarity_matrix, 10)&lt;/p&gt;

&lt;p&gt;3693    Batman Beyond: Return of the Joker    &lt;/p&gt;

&lt;p&gt;5962    The Dark Knight Rises                 &lt;/p&gt;

&lt;p&gt;7379    Batman vs Dracula                     &lt;/p&gt;

&lt;p&gt;5476    Batman: Under the Red Hood            &lt;/p&gt;

&lt;p&gt;6654    Batman: Mystery of the Batwoman       &lt;/p&gt;

&lt;p&gt;3911    Batman Begins                         &lt;/p&gt;

&lt;p&gt;6334    Batman: The Dark Knight Returns, Part&lt;/p&gt;

&lt;p&gt;1770     Batman &amp;amp; Robin                        &lt;/p&gt;

&lt;p&gt;4725    The Dark Knight                       &lt;/p&gt;

&lt;p&gt;709     Batman Returns   &lt;/p&gt;

&lt;p&gt;In the second article, we will build another content-based recommender. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use cases of content-based recommenders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Content-based recommenders can be used for many different purposes. We have used it for many different platforms in the past. At &lt;a href="https://www.onlinestores.ai"&gt;online-stores.ai&lt;/a&gt; we built a content-based recommender which suggests similar stores from given input online store, using the product names as the relevant feature. Product names were transformed into vector representation using sentence embeddings. We were surprised how good the online stores content based recommender was, using this approach. &lt;/p&gt;

&lt;p&gt;For another platform, &lt;a href="https://www.trendingproducts.io"&gt;trending-products.io&lt;/a&gt; we built a content-based recommender which predicts, for given trending product, what other trending products would be also interesting for you. The key part here was using product categorization API for classifying the trending products in many categories according to Google Taxonomy. We used &lt;a href="https://www.productcategorization.com"&gt;product categorization API&lt;/a&gt; for this purpose as classifying it manually would take way too much time, as the number of trending products that are covered is over 0.5 million. &lt;/p&gt;

&lt;p&gt;These are only a few content-based recommender use cases, there are many others out there. But what they share is vectorization of features and then finding the suggestions using commonly machine learning library for this purpose. We can recommend Spotify's annoy library: &lt;a href="https://github.com/spotify/annoy"&gt;https://github.com/spotify/annoy&lt;/a&gt; for this purpose when you are dealing with millions of vectors and 100+ dimension of the vectors. &lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
