Vector Embeddings in Action: Visualizing IMDB's Top 250 Movies
A Python guide to scrape IMDB and explore semantic connections between movies with vector embeddings
July 15, 2024
In our last post, we explored Retrieval Augmented Generation by building a RAG app from scratch. One of the most important concepts in that post was Vector Embeddings, a very powerful tool that allows us to represent information as vectors in a high-dimensional space and encode the semantics of the information. This was very useful for semantic search, where we used vector embeddings to find the pieces of information that were most similar in meaning to a specific query.
Given that vector embeddings are one of the most powerful and fascinating concepts in AI, I thought it would be interesting to build a simple project to gain a better intuition about them by actually visualizing the embeddings and their relationships. And to make the project more fun, we are going to be exploring connections between popular movies.
So here is the plan:
- First of all, we'll scrape IMDB's top 250 movies, extracting the titles and plots.
- Then we'll convert the plots into vector embeddings using OpenAI's Embeddings, capturing their meanings and relationships.
- And finally we'll transform these high-dimensional embeddings into 2D and 3D representations to be able to plot them and visualize the semantic relationships among them.
And you can find all the code in the following GitHub repository.
#Scraping
To scrape IMDB we are going to use the amazing Beautiful Soup Python library, that makes parsing and extracting data from HTML pages a breeze. We are also using the Requests library to make the necessary HTTP requests. Our scraping work will be done in two parts:
- First, we'll get the list of 250 movie titles and their URLs from IMDB's Top 250 movies chart.
- Then, we'll extract a short and a long plot from each of those movies. This is the actual text that will be converted to vector embeddings later.
Let's begin extracting the top 250 movies from the URL https://www.imdb.com/chart/top/. The simplest way I found to scrape the necessary data is to read a <script>
tag that contains the top 250 movies data in JSON-LD format:
This is simpler than parsing the actual HTML elements that display the movies on the webpage because these are not all displayed at once, only the first 25 are shown and the rest are loaded dynamically using JavaScript. To scrape dynamically loaded content, we would need a more advanced scraping library. The JSON-LD script, however, contains the data we need for all 250 movies and is straightforward to extract.
Let's take a a look at the code:
As you can see above, we first make a request to the top 250 movies URL using specific headers to emulate a regular web browser request and avoid web scraping blocking. We then simply find the application/ld+json
script and load the JSON-LD data. The title for each movie in the list is contained under ['item']['name']
or ['item']['alternateName']
(for the English title of non-English movies). The URL is contained under ['item']['url']
.
For each of the 250 movies, we then scrape the movie plot subpage with the function get_movie_plot
and extract the short and long plot. And, finally, we build a Pandas DataFrame with the title, short plot and long plot for each movie. We use a DataFrame because it makes it easy to operate with the data and save it to a Comma Separated Values (CSV) file.
This is the code for the get_movie_plot
function that scrapes the plots for each movie:
To understand what it's doing, let's look at the plot subpage for The Godfather. We are extracting the first one (a short plot) and the second one (a longer plot). These are the same ones that appear on the main movie page, but they are easier to scrape from the plot subpage. Notice also how we remove the unwanted author <span>
in the second plot so that it doesn't affect the semantics of the plot when we embed it.
Finally, we can run our scraping function and save the returned Pandas DataFrame as a CSV file for later use:
#Embedding
As mentioned earlier, we are going to use OpenAI's Embeddings to convert the movie plots into vector embeddings. We are using their latest and most performant model text-embedding-3-large
via the OpenAI API. To get started with their API you can follow this guide and you can also check the Embeddings API reference for more detailed information.
The code to create the embeddings is really simple:
Notice how we are concatenating the short and long plots for each movie, embedding them and then adding them to the movies DataFrame as a new column.
We can now run the function and save the result to another CSV file, which we'll use for analysis and plotting:
You can find both CSV files (with and without embeddings) in the data folder of the repo, so you can easily plot the movie embeddings without having to create an OpenAI account and generate the embeddings if you don't want to.
#Plotting Vector Embeddings in 2D and 3D
Now that we have converted the movie plots to high-dimensional vector embeddings (specifically, vectors of 3072 dimensions using OpenAI's text-embedding-3-large
model), we face a challenge: How can we visualize data with 3072 dimensions?
The answer is using dimensionality reduction techniques. Dimensionality reduction allows us to transform high-dimensional data into a lower-dimensional space while preserving as much of the important information as possible. Our goal is to reduce those 3072 dimensions to 2 or 3 dimensions that we can easily plot in Python, while still keeping as many of the relationships between the data points as possible.
Yes, much of the data will be lost when moving from 3072 to 2 or 3 dimensions, if you are wondering about it. However, with the use of powerful dimensionality reduction techniques, enough of the key information will be retained to be able to visualize the connections between movies and gain a better intuition about vector embeddings.
The technique we are going to use is t-SNE (t-Distributed Stochastic Neighbor Embedding) which works really well for reducing high-dimensional data to 2 or 3 dimensions and it's particularly good at preserving local relationships in the data. And thanks to the wonderful scikit-learn library, we can apply it with a single line of Python.
Let's take a look at the code to plot the movie embeddings in 2D:
Let's highlight the important parts of the code:
- The first thing we do is read the previously stored DataFrame that includes the vector embeddings from the CSV file. We are using the slightly awkward line
np.array(df['embedding'].apply(literal_eval).tolist())
to convert the embeddings column in the CSV file from text to a list of lists of floats and then to a NumPy array. - We then use scikit-learn's TSNE class to apply t-SNE to the data and transform it to two dimensions, and we also use the StandardScaler class to normalize the results.
- Finally, we add the
x
andy
coordinates from the reduced 2D vectors to the original DataFrame and use the Plotly library to create an interactive scatter plot. Each point in the scatter plot represents a movie, with the title as a label, and its position is determined by the coordinates of the reduced 2D vector embedding.
Let's now take a look at the results. You can click on the image to view a full size version:
The plot contains only the top 150 movies to make it more legible, and I've added hand-drawn clusters to highlight groups of films that are close together and semantically related. This visualization gives us a fascinating way to explore the "semantic space" of movie plots, and the clusters reveal very interesting insights about how our embedding model understands and relates different movies.
You can find clusters of similar genres, themes or styles. And sometimes the clusters reveal unexpected similarities between films. A few interesting examples:
- The Matrix, Inception and Metropolis, despite being very different films, share interesting similarities and appear close together.
- Casablanca is not clearly contained by any of the clusters but seems to be positioned somewhere in between the Noir Thrillers and the WWII / Nazi Germany films.
- Pan's Labyrinth appears to be thematically close to the Studio Ghibli and Japanese animated films.
The 3D plotting code is almost identical to the 2D, so I won't repeat it here, but you can check it in detail in this file. I recommend running the code yourself to fully explore and interact with the 3D plot in detail. Let's see how it looks like:
The transition from the 2D to the 3D visualization adds a new layer of depth to the movies' representation and relationships. With this additional dimension, we can observe more nuanced clusters and complex interconnections that were not apparent in the 2D plot.
While the 3D plot is more informative than the 2D version, it's still a massive simplification of the original 3072-dimensional vector embeddings space. You can intuitively imagine the incredible level of detail and nuance that can be captured in 3072 dimensions. It's like having 3072 scales or spectrums to describe each movie, each of those scales representing a different feature. This high-dimensionality is what gives vector embeddings their power, allowing them to represent complex concepts and relationships.
By exploring these visualizations, it's a lot easier to understand how vector embeddings work. In this application, we used them to cluster and explore relationships between movies. But they can similarly be used with other types of content and formats, like novels, scientific papers, emails, and even images and audio. And they can be used for applications like semantic search (as we saw in the previous post) or content recommendation systems.
Vector embeddings are an incredible tool to understand how machines can represent concepts and meaning, and they open up new possibilities for AI applications that can understand and work with human-generated content.