Collaborative Filtering and some history on The Netflix Prize
In 2006, Netflix began an open competition to improve its recommendation algorithm for movies. The prize was for a grand total of $1,000,000! Seems like a lot for a humble data scientist like me but probably not a lot for Netflix, as they knew better recommendations led to more movie rentals and therefore more $$$. In order to win, the winning team had to improve Netflix’s recommendation system by at least 10%. To help, Netflix published all of its data containing all ratings at the time from 480,189 users and 17,770 movies.
This single competition basically started a revolution in the engineering and computer science world. Companies and people realized that recommendation systems could be used for a lot more stuff than just movies (such as clothing, music, and basically everything that Amazon sells now). Every data engineer at the time was willing to give it a shot. This led to numerous discoveries and breakthroughs in the field and became an event that is basically mentioned in every single data science course.
Almost three years later, in 2009, BellKor’s Pragmatic Chaos team won the grand prize by predicting ratings with an improvement of 10.06%. They used an ensemble algorithm (aka combining a bunch of different models and making them vote to come up with a final prediction). Ironically, another team had matched their results but because Bellkor submitted their results just 20 minutes earlier they became the grand winners. Really Netflix? Why not give some kind of second prize?
Moving on from this history piece, the recommendation system used in this type of prediction is called collaborative filtering. And in very simple terms collaborative filtering consists of using other people’s ratings to help rank and recommend things to other people. It was first invented to sort emails and show the most important emails first. Basically, it brings order to this chaotic world of online information.
Below I will show how to easily use Netflix’s Prize data and apply a collaborative filtering model. Due to the mere size of this dataset, we will use Apache Spark in Python.
As you can see, nowadays, participating in the Netflix competition would have been much easier! And that’s what I love about data science, is an ever-evolving field, and every day it becomes more accessible for those curious enough. My very simple algorithm had a root-mean-square error of 0.96, BellKor’s had an error of 0.86… have some catching up to do but not bad for a first try!