Title
Parallel Recommendation Engine
Team members:
 Kai Kang (kaik1)
 Jianfei Liu (jianfeil)
Summary
We are going to implement a parallel collaborative filtering recommmendation engine on a single machine, and explore different optimization algorithms addressing memory access problems, and model parallelism using multiple GPUs.
Background and Challenge
Recommendation system is crucial to a lot of tech companies such as Facebook, Amazon, and Netflix because it helps users quickly find what they need, and help companies best sell their products (news for Facebook, actual products for Amazon, and movies for Nextflix.) In fact, Netflix hosted a contest Netflix Prize to award 1 millon dollars to the winning recommendation algorithm. To evaluate a recommendation system, we look at two factors, speed and accuracy. Speed is as important as accuracy because a huge amount of new training data come in every second and it is important to use the new data as soon as we can.
In my last internship at Bloomberg, I implemented a recommendation engine of the Bloomberg research report. However, I encountered some difficulties during the process:

distributed vs. single machine: I experimented with distributed platforms such as Hadoop and Spark. One the one hand, the problem with distributed systems is the dataset itself is not large enough to cancel the overhead of using a distributed system. On the other hand, singlemachine implementations suffer from memory and speed problems.

data parallelism: The collaborative filtering algorithm has almost no dependencies and is easy to parallelize naively. However, the biggest challenge is that because the memory access pattern is very random, there is no locality in the naive implementation. What’s more, since the dataset is very sparse, the most naive implementation might not even be runnable on a single machine (this actually happened to me). We want to address this problem by developing compressed data structure and algorithms.

model parallelism: Since the problem has few dependencies, model parallelism will give us huge speedup if implemented correctly. However, distributing to multiple machines is complicated in practice especially in the development stage when all the effort should be focused on improving the algorithm. We address this problem by developing multiGPU solutions

approximation algorithm: What is more, it seems unecessary to compute all pairwise user similarities. In pratice, an approximated solution will often suffice. We will develop parallel approximation algorithms to preprocess the raw dataset.
We will be looking at the collaborative filtering algorithm in depth:
Collaborative Filtering
recommend(u) {
likelihood = {}
V = similar_user(u);
for v in V:
for items v purchased but u hasn't:
likelihood[v] += similarity(u, v)
}
In Collaborative Filtering, getting similar users of a user can be reduced to a clustering problem where we want to cluster users by their similarities. The rest double for loop is also computationintensive and requires effort to efficiently parallelize.
In a word, this is our roadmap:
The idea of matrix factorization is that we have , where each row represents a user, and each column represents an item, and entry represents the rating of user to item . What we will be implementing is a matrix factorization algorithm to factor into user matrix and item matrix , where and essentially represents the inherent attributes of each user and item. This algorithm by [Simon Funk]((http://sifter.org/~simon/journal/20061211.html) wins the Netflixprize. Here is an example to illustrate how it works: Imagine our is:
Using an online calculator, we get to be:
We get to be:
Matrix factorization automatically extracts the features of each movie, and each person’s favorite feature. In this example, it is apparent that the two features are romance, and adventure. Thus, the romantic movies have a large value for romance feature and small value for adventure feature. And Alice has a large value for romance, and Bob has a large value for adventure.
Resources
We will be refering to the following academic papers (the list will be updated during the project):
 Parallel Matrix Factorization for Recommender Systems
 Largescale Parallel Collaborative Filtering for the Netflix Prize
 A Parallel Clustering Algorithm with MPI  MKmeans
We will be using the following dataset for testing purposes
Goals and Deliverables
 A frontend web app which allows users to upload their training data and test data, and visualizes the correctness of each algorithm, (and possible each different implementation using GPU/OpenMP)
 A backend service API which implements the different algorithms (collaborative filtering, clustering)
Platform Choice
We will be using C++ as our main programming language because we want to use CUDA and OpenMP to parallelize our program. C++ is a perfect balance between performance and code readability. Our code will run on computers. We will try to make our code compatible with all OS.
Schedule
 Friday, April 8: Have a RESTful C++ server ready, design the backend API, discuss which versions of the algorithms to implement
 Friday, April 15: Study, implement, and experiment clustering algorithms
 Friday, April 22: Use clustering algorithms to implement collaborative filtering, benchmark different implementations
 Friday, April 29: Implement Simon Funk’s matrix factorization algorithm and connect with APIs
 Friday, May 6: Wrap up the project with output visualization, final writeup, and reflection