Week 2

Welcome to Week 2 of the AI Club movie recommender project. Last week, we formalized what we are building and setup our environment. This week, we will be looking at the data we downloaded last week and transforming for our ML models in the following weeks.

Data

Data is the single most important thing for ML. Last week, we downloaded a dataset of movies with their features. However, for a real ML project, the process of gathering data takes much longer because you are often the ones collecting, we merely collected it from other sources. Now that we have it downloaded, take a look at it. The data is pretty readable for humans. You'll see that it has a basic information about movies, like the year, release date, etc. Intuitively, you can see how this data would be useful for movie recommendation. If someone likes movies from the 60's, it would be a fair prediction to recommend them a movie from that era. But what about data like plot summary? How can you compare 2 paragraph long plot summaries. Think about this by yourself. Would you compare each word and count how many they have in common? Maybe just look at proper nouns? This would be pretty time consuming, even for a computer. Keep these questions in mind while we go and try to transform each feature we have.

The intuitive features

As mentioned earlier, some features are pretty intuitive. Years, for example, are simple because they are already numbers and have an obvious sense of proximity. If two years are the same, it’s clear that there must be some correlation between the two movies. Even if the years are close, like 1999 and 2000, it’s evident that they are similar. Features like this, along with rating and popularity, are fairly straightforward, and for the most part, we can leave them as they are. However, there is one thing we need to do: normalization. This is especially important because, for example, if a movie and its remake are from 1960 and 2022, they’d seem drastically different. But realistically, most of their features, like the description, genre, etc., should be nearly identical. Someone who enjoyed the original might want to watch the remake, so we don’t want the large gap in years to skew the comparison. Without normalization, features like the release year could overshadow everything else, leading to unintended biases. Basically, instead of having a 62 year gap, all the years will be squished together so they are not as far apart. While some bias, like two movies with the same name having more weight, can be useful, it’s a good idea to normalize all the features at the start to reduce any unintended distortions. It is also important that different models deal with this differently. Some are more prone to bias than other and thus normalization is more important.

Other features

In addition to the numerical features we discussed earlier, there are also text-based features, such as the title, director, and summaries. For these, we need to convert the text into numbers in a way that captures their meaning. There are various techniques for this, ranging from simple approaches like Bag of Words to more advanced methods like BERT. Essentially, all of these techniques transform words into numerical representations that capture some aspect of their meaning. For Bag of Words, the representation is fairly basic and doesn’t capture much about the relationships between words—it essentially assigns random numbers to words. On the other hand, pre-trained models like BERT are designed to place words in a vector space where similar words are grouped together. For example, "king" and "man" are positioned close to each other. This strong representation allows for operations like vector arithmetic. For instance, "king" + "man" - "woman" results in "queen," which is exactly the kind of functionality we want for our project. Next week, we'll use one of these advanced models to accomplish this. In the meantime, take a look at the other methods in this link. While we’re using state-of-the-art models right now, it’s valuable to understand earlier techniques and the progression that led to where we are today.

To do

For this week, please look at all the features in the CSV files. Just look at which ones need modification and how you would go about doing that. Next week, we will go through each one step by step. After that, we will be able to start transforming the data into something we use, and create our first model!

Extra

Next week we will be setting up our environment. We will use google colab because of its simple UI and ease of sharing. Along the way we will have example notebooks for you to copy and run yourself. In the meantime, if you are unfamiliar with google colab, please check it out. Not only is a good tool, the service also gives you access to powerful GPU's that you may need in projects you do in the future.