ParaRec Project Proposal


Parallel Recommendation Engine

Team members:

  • Kai Kang (kaik1)
  • Jianfei Liu (jianfeil)


We are going to implement a parallel collaborative filtering recommmendation engine on a single machine, and explore different optimization algorithms addressing memory access problems, and model parallelism using multiple GPUs.

Background and Challenge

recommendation engine Recommendation system is crucial to a lot of tech companies such as Facebook, Amazon, and Netflix because it helps users quickly find what they need, and help companies best sell their products (news for Facebook, actual products for Amazon, and movies for Nextflix.) In fact, Netflix hosted a contest Netflix Prize to award 1 millon dollars to the winning recommendation algorithm. To evaluate a recommendation system, we look at two factors, speed and accuracy. Speed is as important as accuracy because a huge amount of new training data come in every second and it is important to use the new data as soon as we can.

In my last internship at Bloomberg, I implemented a recommendation engine of the Bloomberg research report. However, I encountered some difficulties during the process:

  • distributed vs. single machine: I experimented with distributed platforms such as Hadoop and Spark. One the one hand, the problem with distributed systems is the dataset itself is not large enough to cancel the overhead of using a distributed system. On the other hand, single-machine implementations suffer from memory and speed problems.

  • data parallelism: The collaborative filtering algorithm has almost no dependencies and is easy to parallelize naively. However, the biggest challenge is that because the memory access pattern is very random, there is no locality in the naive implementation. What’s more, since the dataset is very sparse, the most naive implementation might not even be runnable on a single machine (this actually happened to me). We want to address this problem by developing compressed data structure and algorithms.

  • model parallelism: Since the problem has few dependencies, model parallelism will give us huge speedup if implemented correctly. However, distributing to multiple machines is complicated in practice especially in the development stage when all the effort should be focused on improving the algorithm. We address this problem by developing multi-GPU solutions

  • approximation algorithm: What is more, it seems unecessary to compute all pairwise user similarities. In pratice, an approximated solution will often suffice. We will develop parallel approximation algorithms to preprocess the raw dataset.

We will be looking at the collaborative filtering algorithm in depth:

Collaborative Filtering

recommend(u) {
  likelihood = {}
  V = similar_user(u);
  for v in V:
    for items v purchased but u hasn't:
      likelihood[v] += similarity(u, v)

In Collaborative Filtering, getting similar users of a user can be reduced to a clustering problem where we want to cluster users by their similarities. The rest double for loop is also computation-intensive and requires effort to efficiently parallelize.

In a word, this is our roadmap: roadmap

Matrix Factorization

The idea of matrix factorization is that we have , where each row represents a user, and each column represents an item, and entry represents the rating of user to item . What we will be implementing is a matrix factorization algorithm to factor into user matrix and item matrix , where and essentially represents the inherent attributes of each user and item. This algorithm by [Simon Funk](( wins the Netflix-prize. Here is an example to illustrate how it works: Imagine our is:

Using an online calculator, we get to be:

We get to be:

Matrix factorization automatically extracts the features of each movie, and each person’s favorite feature. In this example, it is apparent that the two features are romance, and adventure. Thus, the romantic movies have a large value for romance feature and small value for adventure feature. And Alice has a large value for romance, and Bob has a large value for adventure.


We will be refering to the following academic papers (the list will be updated during the project):

We will be using the following dataset for testing purposes

Goals and Deliverables

  • A frontend web app which allows users to upload their training data and test data, and visualizes the correctness of each algorithm, (and possible each different implementation using GPU/OpenMP)
  • A backend service API which implements the different algorithms (collaborative filtering, clustering)

Platform Choice

We will be using C++ as our main programming language because we want to use CUDA and OpenMP to parallelize our program. C++ is a perfect balance between performance and code readability. Our code will run on computers. We will try to make our code compatible with all OS.


  • Friday, April 8: Have a RESTful C++ server ready, design the backend API, discuss which versions of the algorithms to implement
  • Friday, April 15: Study, implement, and experiment clustering algorithms
  • Friday, April 22: Use clustering algorithms to implement collaborative filtering, benchmark different implementations
  • Friday, April 29: Implement Simon Funk’s matrix factorization algorithm and connect with APIs
  • Friday, May 6: Wrap up the project with output visualization, final writeup, and reflection