logo

Vector Embeddings in Action: Visualizing IMDB's Top 250 Movies

A Python guide to scrape IMDB and explore semantic connections between movies with vector embeddings

July 15, 2024

In our last post, we explored Retrieval Augmented Generation by building a RAG app from scratch. One of the most important concepts in that post was Vector Embeddings, a very powerful tool that allows us to represent information as vectors in a high-dimensional space and encode the semantics of the information. This was very useful for semantic search, where we used vector embeddings to find the pieces of information that were most similar in meaning to a specific query.

Given that vector embeddings are one of the most powerful and fascinating concepts in AI, I thought it would be interesting to build a simple project to gain a better intuition about them by actually visualizing the embeddings and their relationships. And to make the project more fun, we are going to be exploring connections between popular movies.

So here is the plan:

And you can find all the code in the following GitHub repository.

#Scraping

To scrape IMDB we are going to use the amazing Beautiful Soup Python library, that makes parsing and extracting data from HTML pages a breeze. We are also using the Requests library to make the necessary HTTP requests. Our scraping work will be done in two parts:

  1. First, we'll get the list of 250 movie titles and their URLs from IMDB's Top 250 movies chart.
  2. Then, we'll extract a short and a long plot from each of those movies. This is the actual text that will be converted to vector embeddings later.

Let's begin extracting the top 250 movies from the URL https://www.imdb.com/chart/top/. The simplest way I found to scrape the necessary data is to read a <script> tag that contains the top 250 movies data in JSON-LD format:

Scraping IMDB's top 250 movies

This is simpler than parsing the actual HTML elements that display the movies on the webpage because these are not all displayed at once, only the first 25 are shown and the rest are loaded dynamically using JavaScript. To scrape dynamically loaded content, we would need a more advanced scraping library. The JSON-LD script, however, contains the data we need for all 250 movies and is straightforward to extract.

Let's take a a look at the code:

headers = {
'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'Accept-Language':'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1'
}

def scrape_imdb_top_250():
url = 'https://www.imdb.com/chart/top/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

movies = []
data = json.loads(soup.find('script', {'type':'application/ld+json'}).text)
for item in tqdm(data['itemListElement']):
# Need to unescape HTML characters to extract the title properly
title = html.unescape(item['item'].get('alternateName', item['item']['name']))
url = item['item']['url']
short_plot, long_plot = get_movie_plot(url + 'plotsummary/')
movies.append({'title': title, 'short_plot': short_plot, 'long_plot': long_plot })
time.sleep(1)
return pd.DataFrame(movies)

As you can see above, we first make a request to the top 250 movies URL using specific headers to emulate a regular web browser request and avoid web scraping blocking. We then simply find the application/ld+json script and load the JSON-LD data. The title for each movie in the list is contained under ['item']['name'] or ['item']['alternateName'] (for the English title of non-English movies). The URL is contained under ['item']['url'].

For each of the 250 movies, we then scrape the movie plot subpage with the function get_movie_plot and extract the short and long plot. And, finally, we build a Pandas DataFrame with the title, short plot and long plot for each movie. We use a DataFrame because it makes it easy to operate with the data and save it to a Comma Separated Values (CSV) file.

This is the code for the get_movie_plot function that scrapes the plots for each movie:

def get_movie_plot(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
summaries = soup.select('li.ipc-metadata-list__item div.ipc-html-content-inner-div div.ipc-html-content.ipc-html-content--base div.ipc-html-content-inner-div')
plot = summaries[0].get_text().strip()
# Remove the storyline author <span> element
storyline_author = summaries[1].find('span')
if storyline_author:
storyline_author.decompose()
storyline = summaries[1].get_text().strip()
return plot, storyline

To understand what it's doing, let's look at the plot subpage for The Godfather. We are extracting the first one (a short plot) and the second one (a longer plot). These are the same ones that appear on the main movie page, but they are easier to scrape from the plot subpage. Notice also how we remove the unwanted author <span> in the second plot so that it doesn't affect the semantics of the plot when we embed it.

Plot subpage for The Godfather

Finally, we can run our scraping function and save the returned Pandas DataFrame as a CSV file for later use:

movies_df = scrape_imdb_top_250()
movies_df.to_csv('data/movies.csv', index=False)

#Embedding

As mentioned earlier, we are going to use OpenAI's Embeddings to convert the movie plots into vector embeddings. We are using their latest and most performant model text-embedding-3-large via the OpenAI API. To get started with their API you can follow this guide and you can also check the Embeddings API reference for more detailed information.

The code to create the embeddings is really simple:

def embed_movies(movies_df, model='text-embedding-3-large'):
movie_plots = (movies_df['short_plot'] + '\n' + movies_df['long_plot']).to_list()
embed_res = openai_client.embeddings.create(input=movie_plots, model=model)
movies_df['embedding'] = [d.embedding for d in embed_res.data]
return movies_df

Notice how we are concatenating the short and long plots for each movie, embedding them and then adding them to the movies DataFrame as a new column.

We can now run the function and save the result to another CSV file, which we'll use for analysis and plotting:

movies_df = embed_movies(movies_df)
movies_df.to_csv('data/movies_embeddings.csv', index=False)

You can find both CSV files (with and without embeddings) in the data folder of the repo, so you can easily plot the movie embeddings without having to create an OpenAI account and generate the embeddings if you don't want to.

#Plotting Vector Embeddings in 2D and 3D

Now that we have converted the movie plots to high-dimensional vector embeddings (specifically, vectors of 3072 dimensions using OpenAI's text-embedding-3-large model), we face a challenge: How can we visualize data with 3072 dimensions?

The answer is using dimensionality reduction techniques. Dimensionality reduction allows us to transform high-dimensional data into a lower-dimensional space while preserving as much of the important information as possible. Our goal is to reduce those 3072 dimensions to 2 or 3 dimensions that we can easily plot in Python, while still keeping as many of the relationships between the data points as possible.

Yes, much of the data will be lost when moving from 3072 to 2 or 3 dimensions, if you are wondering about it. However, with the use of powerful dimensionality reduction techniques, enough of the key information will be retained to be able to visualize the connections between movies and gain a better intuition about vector embeddings.

The technique we are going to use is t-SNE (t-Distributed Stochastic Neighbor Embedding) which works really well for reducing high-dimensional data to 2 or 3 dimensions and it's particularly good at preserving local relationships in the data. And thanks to the wonderful scikit-learn library, we can apply it with a single line of Python.

Let's take a look at the code to plot the movie embeddings in 2D:

import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from ast import literal_eval

def plot_movies_2d(file='data/movies_embeddings.csv'):
df = pd.read_csv(file)
embeddings = np.array(df['embedding'].apply(literal_eval).tolist())

tsne = TSNE(n_components=2)
scaler = StandardScaler()
embeddings_2d = tsne.fit_transform(embeddings)
embeddings_2d = scaler.fit_transform(embeddings_2d)

df['x'] = embeddings_2d[:, 0]
df['y'] = embeddings_2d[:, 1]

fig = px.scatter(
df, x='x', y='y', text='title',
hover_data={'title': True, 'x': False, 'y': False},
title='Movie Similarity based on Plot Embeddings'
)
fig.update_traces(textposition='top center', textfont=dict(size=6))
fig.show()

Let's highlight the important parts of the code:

Let's now take a look at the results. You can click on the image to view a full size version:

2D plot of movie embeddings showing clusters of related movies

The plot contains only the top 150 movies to make it more legible, and I've added hand-drawn clusters to highlight groups of films that are close together and semantically related. This visualization gives us a fascinating way to explore the "semantic space" of movie plots, and the clusters reveal very interesting insights about how our embedding model understands and relates different movies.

You can find clusters of similar genres, themes or styles. And sometimes the clusters reveal unexpected similarities between films. A few interesting examples:

The 3D plotting code is almost identical to the 2D, so I won't repeat it here, but you can check it in detail in this file. I recommend running the code yourself to fully explore and interact with the 3D plot in detail. Let's see how it looks like:

3D plot of movie embeddings

3D plot of movie embeddings

The transition from the 2D to the 3D visualization adds a new layer of depth to the movies' representation and relationships. With this additional dimension, we can observe more nuanced clusters and complex interconnections that were not apparent in the 2D plot.

While the 3D plot is more informative than the 2D version, it's still a massive simplification of the original 3072-dimensional vector embeddings space. You can intuitively imagine the incredible level of detail and nuance that can be captured in 3072 dimensions. It's like having 3072 scales or spectrums to describe each movie, each of those scales representing a different feature. This high-dimensionality is what gives vector embeddings their power, allowing them to represent complex concepts and relationships.

By exploring these visualizations, it's a lot easier to understand how vector embeddings work. In this application, we used them to cluster and explore relationships between movies. But they can similarly be used with other types of content and formats, like novels, scientific papers, emails, and even images and audio. And they can be used for applications like semantic search (as we saw in the previous post) or content recommendation systems.

Vector embeddings are an incredible tool to understand how machines can represent concepts and meaning, and they open up new possibilities for AI applications that can understand and work with human-generated content.