ParaRecParaRec - a parallel recommendation system
http://pavelkang.github.io/418finalproject/418finalproject//
Thu, 14 Jul 2016 18:29:12 +0000Thu, 14 Jul 2016 18:29:12 +0000Jekyll v3.1.6ParaRec Final Report<h1 id="summary">Summary</h1>
<p>We have implemented and optimized a parallel collaborative filtering engine which features:</p>
<ul>
<li>A <strong>compact data structure and related algorithms</strong> that we invented (we have <strong>not</strong> read about any data compression work in collaborative filtering papers)</li>
<li>A <strong>multi-GPU</strong>, matrix-based solution</li>
<li><strong>Parallel locality sensitive hashing</strong> preprocessing algorithm for user clustering</li>
</ul>
<p>Our deliverables will include:</p>
<ul>
<li>Performance graph of multiple algorithms we have implemented.</li>
</ul>
<p>Here is a link to our final presentation slides:</p>
<ul>
<li><a href="https://drive.google.com/file/d/0B4lG-7EeCB0cVV9WQnFNOEtXUDg/view?usp=sharing">Slides</a></li>
</ul>
<h1 id="background-and-approach-and-results">Background and Approach and Results</h1>
<p>We have put up separate writeups for each optimization technique we used. <strong>Detailed explanations</strong> of the optimization techniques, algorithms, designs, and results can be found in the <strong>following posts</strong>, <strong>please read</strong>:</p>
<ul>
<li><a href="http://pavelkang.github.io/418finalproject/2016/cuda/">CUDA Matrix</a></li>
<li><a href="http://pavelkang.github.io/418finalproject/2016/datacompression/">Compressed Data Structure</a></li>
<li><a href="http://pavelkang.github.io/418finalproject/2016/nearestneighbor/">Locality Sensitive Hashing</a></li>
</ul>
<h1 id="more-results">More Results</h1>
<ul>
<li>We measured the the performance using the <code class="highlighter-rouge">CycleTimer.h</code> in all previous assignments.</li>
<li>For the matrix solution, the precise setup is in the slides above. For the compressed data structure and nearest neighbor search implementations, they are tested on latedays cluster.</li>
<li>Graphs can be found in those reports.</li>
<li>We have datasets from Movielens ranging from 100k to 10m.</li>
<li>As to what limits our speedup, please read those reports above.</li>
</ul>
<h1 id="references">References</h1>
<ul>
<li><a href="http://istc-bigdata.org/plsh/docs/plsh_paper.pdf">Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing</a></li>
<li><a href="http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/reco/paper/MatrixFactorizationALS.pdf">Large-scale Parallel Collaborative Filtering for the Netflix Prize</a></li>
<li><a href="http://infolab.stanford.edu/~ullman/mmds/ch3.pdf">Standford Data Mining Textbook</a></li>
</ul>
<h1 id="list-of-work">List of work</h1>
<p>We have done equal amount of work for this project</p>
Wed, 04 May 2016 00:00:00 +0000
http://pavelkang.github.io/418finalproject/418finalproject//2016/final-writeup/
http://pavelkang.github.io/418finalproject/418finalproject//2016/final-writeup/FinalFinalCUDA-based Approximate Nearest Neighbor Search with Locality Sensitive Hashing<h2 id="motivation">Motivation</h2>
<p>If we think about the first step in collaborative filtering algorithm, it calculates the pairwise user similarity. However, we only care about the most similar users in the second step, because the rest will have very small weights and contribute very little to the final result. Intuitively, it can be explained as: we only care about the top n most similar users, who cares what does the 1000th most similar user like?</p>
<p>Therefore, we want to <strong>cluster</strong> our users so that we can do nearest neighbor queries fast. Essentially, we want to build a data structure such that it supports the function <code class="highlighter-rouge">find_most_similar_users(target_user, n)</code>.</p>
<h2 id="brute-force-solution-benchmark">Brute Force Solution Benchmark</h2>
<p>The brute force solution will be calculating pairwise distances using our <strong>pearson_correlation</strong>, and sort by the distance from small to large. To serve a query for user <code class="highlighter-rouge">target_user</code> and <code class="highlighter-rouge">n</code>, we simply take the top <code class="highlighter-rouge">n</code> elements in the sorted list for <code class="highlighter-rouge">target_user</code>. This is essentially what we used to be doing in the first step. However, notice that this does not save any work. It can be a correctness benchmark for our approximation algorithms.</p>
<h2 id="locality-sensitive-hashing">Locality Sensitive Hashing</h2>
<p>However, let’s say we only want approximate nearest neighbors, then we can take advantage of the approximation algorithm <code class="highlighter-rouge">Locality Sensitive Hashing</code>. Here we attempt to implement locality sensitive hashing with CUDA, what’s more, the implementation is based on the compressed data format we developed previously.</p>
<p>Here is an illustration of the parallel locality sensitive hashing algorithm we are going to implement:
<img src="http://pavelkang.github.io/418finalproject/assets/lsh.svg" alt="Locality Sensitive Hashing" title="Logo Title Text 1" /></p>
<h2 id="parallel-min-hashing">Parallel Min-hashing</h2>
<p>This is the original min-hashing algorithm from the <a href="http://infolab.stanford.edu/~ullman/mmds/ch3.pdf">Standford Data Mining Textbook</a>. We have developed our own parallel version of this algorithm using compressed data structure. The original min-hashing algorithm is item-major, we have modified it to user-major and fit it with our compressed data structure.</p>
<ul>
<li>Step 1. Calculate the mean rating for each user and “binarize” each user’s rating using his mean using <code class="highlighter-rouge">mean_kernel</code> and <code class="highlighter-rouge">binarize_kernel</code>:</li>
</ul>
<div class="language-c highlighter-rouge"><pre class="highlight"><code> <span class="n">mean_kernel</span><span class="o"><<<</span><span class="n">UPDIV</span><span class="p">(</span><span class="n">USER_SIZE</span><span class="p">,</span> <span class="n">tpb</span><span class="p">),</span> <span class="n">tpb</span><span class="o">>>></span><span class="p">(</span><span class="n">compact_data_cuda</span><span class="p">,</span> <span class="n">compact_index_cuda</span><span class="p">,</span> <span class="n">mean_cuda</span><span class="p">);</span>
<span class="n">binarize</span><span class="o"><<<</span><span class="n">UPDIV</span><span class="p">(</span><span class="n">USER_SIZE</span><span class="p">,</span> <span class="n">tpb</span><span class="p">),</span> <span class="n">tpb</span><span class="o">>>></span><span class="p">(</span><span class="n">compact_data_cuda</span><span class="p">,</span> <span class="n">compact_index_cuda</span><span class="p">,</span> <span class="n">mean_cuda</span><span class="p">);</span>
</code></pre>
</div>
<ul>
<li>Step 2. Calculate a hashed function matrix:</li>
</ul>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="n">mean_kernel</span><span class="o"><<<</span><span class="n">UPDIV</span><span class="p">(</span><span class="n">USER_SIZE</span><span class="p">,</span> <span class="n">tpb</span><span class="p">),</span> <span class="n">tpb</span><span class="o">>>></span><span class="p">(</span><span class="n">compact_data_cuda</span><span class="p">,</span> <span class="n">compact_index_cuda</span><span class="p">,</span> <span class="n">mean_cuda</span><span class="p">);</span>
<span class="n">binarize</span><span class="o"><<<</span><span class="n">UPDIV</span><span class="p">(</span><span class="n">USER_SIZE</span><span class="p">,</span> <span class="n">tpb</span><span class="p">),</span> <span class="n">tpb</span><span class="o">>>></span><span class="p">(</span><span class="n">compact_data_cuda</span><span class="p">,</span> <span class="n">compact_index_cuda</span><span class="p">,</span> <span class="n">mean_cuda</span><span class="p">);</span>
</code></pre>
</div>
<ul>
<li>Step 3. We run <code class="highlighter-rouge">lsh_kernel</code> for each user</li>
</ul>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">u_start</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">u_end</span><span class="p">;</span> <span class="n">i</span><span class="o">+=</span><span class="mi">2</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">item</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="kt">int</span> <span class="n">rating</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span> <span class="k">if</span> <span class="p">(</span><span class="n">rating</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">100</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// for all possible hash functions int hashed_item = hash[item * 100 + j]; sigs[tid * 100 + j] = min(sigs[tid * 100 + j], hashed_item); } }
</span></code></pre>
</div>
<h2 id="results">Results</h2>
<p>For the union find implementation which follows the <code class="highlighter-rouge">lsh_kernel</code> implementation, we implemented using thrust vector and related functions such as <code class="highlighter-rouge">sequence</code>, <code class="highlighter-rouge">sort</code>, etc. However, union find is inherently sequential.</p>
<p>Preprocessing takes 1.1x ~ 2.5x of the compressed data structure implementation on different datasets. After preprocessing, recommend() query takes almost no time.</p>
<p><strong>Analysis</strong>: There are several bottlenecks. First of all, we have to pack numbers in each band into a struct and put the struct in a thrust vector. Second of all, to avoid <script type="math/tex">O(n^2)</script> pairwise band comparison, we decided to sort such that same bands are always adjacent to each other. This sorting step is very slow. However, we have not yet come up with a solution to solve iet.</p>
Thu, 28 Apr 2016 15:04:23 +0000
http://pavelkang.github.io/418finalproject/418finalproject//2016/nearestneighbor/
http://pavelkang.github.io/418finalproject/418finalproject//2016/nearestneighbor/datadataData Compression<h2 id="motivation">Motivation</h2>
<p>One of the biggest challenge for improving the performance of a recommendation engine is how to reduce the data access time. By the nature of collaborative filtering algorithm, the data access pattern is very random. What’s more, since the user rating matrix is very sparse, multiple memory loads are very likely accessing very different addresses. Therfore, a compressed data structure that exploits memory locality will be extremely helpful.</p>
<p>We designed and implemented a compact data structure which allows collaborative filtering algorithms to take advantage of memory locality. We have never read anything in literature about compressed data structure in collaborative filtering.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/illustration.svg" alt="data structure" title="Logo Title Text 1" />
On the right, we have an illustration of memory footprint of collaborative filtering on a matrix, and our compressed data structure on the left. Our data structure is able to convert random memory accesses to contiguous memory accesses.</p>
<h2 id="data-structure">Data Structure</h2>
<p><img src="http://pavelkang.github.io/418finalproject/assets/illustration.svg" alt="data structure" title="Logo Title Text 1" /></p>
<h2 id="algorithm-implementation">Algorithm Implementation</h2>
<ol>
<li>Calculating User Similarity with Pearson Correlation</li>
</ol>
<p>We want to compute the user-user similarity. And the algorithm depends on “finding common items between two users”. In a sparse representation, this is very easy to do because checking whether an item is consumed by a user or not is simply $O(1)$. However, in our compressed data structure, the most naive implementation will require $O(n)$ lookup for each item. Thus, it takes $O(mn)$ to compute a pairwise user similarity where $m$ is the number of items consumed by one user and $n$ is the number of items consumed by another. To combat this problem, we have developed an $O(m+n)$ algorithm as follows: Essentially, the algorithm looks at the two data “regions” for the two users. Since the incoming data is item-sorted for each user, we can find the common items in two “regions” by maintaining two pointers starting from the respective starting points, and incrementing the one with the smaller item index. Once one pointer reaches the end, we know for sure that there won’t be common items.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/algo.svg" alt="data structure" title="Logo Title Text 1" /></p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code> <span class="k">if</span> <span class="p">(</span><span class="n">user</span> <span class="o">==</span> <span class="n">USER_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">u_end</span> <span class="o">=</span> <span class="n">DATA_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">u_end</span> <span class="o">=</span> <span class="n">compact_index</span><span class="p">[</span><span class="n">user</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">tid</span> <span class="o">==</span> <span class="n">DATA_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">v_end</span> <span class="o">=</span> <span class="n">DATA_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">v_end</span> <span class="o">=</span> <span class="n">compact_index</span><span class="p">[</span><span class="n">tid</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">u_start</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">v_start</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">rui</span><span class="p">,</span> <span class="n">rvi</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">item_i</span><span class="p">,</span> <span class="n">item_j</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o"><</span> <span class="n">u_end</span> <span class="o">&&</span> <span class="n">j</span> <span class="o"><</span> <span class="n">v_end</span><span class="p">)</span> <span class="p">{</span>
<span class="n">item_i</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">item_j</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="n">item_i</span> <span class="o">==</span> <span class="n">item_j</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// common item
</span> <span class="n">commons</span><span class="o">++</span><span class="p">;</span>
<span class="n">rui</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="n">rvi</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">j</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="n">a</span> <span class="o">+=</span> <span class="p">(</span><span class="n">rui</span> <span class="o">-</span> <span class="n">u_mean</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">rvi</span> <span class="o">-</span> <span class="n">v_mean</span><span class="p">);</span>
<span class="n">b</span> <span class="o">+=</span> <span class="p">(</span><span class="n">rui</span> <span class="o">-</span> <span class="n">u_mean</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">rui</span> <span class="o">-</span> <span class="n">u_mean</span><span class="p">);</span>
<span class="n">c</span> <span class="o">+=</span> <span class="p">(</span><span class="n">rvi</span> <span class="o">-</span> <span class="n">v_mean</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">rvi</span> <span class="o">-</span> <span class="n">v_mean</span><span class="p">);</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">item_i</span> <span class="o"><</span> <span class="n">item_j</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre>
</div>
<ol>
<li>Calculating Item Preference</li>
</ol>
<p>To calculate item preferences, instead of finding common items, we want to find items that are not present in one user’s ratings. The idea is similar to the previous one. We keep two pointers i, j. If the item pointed by i is smaller than j, then item i is not in the other user’s consumption.</p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o"><</span> <span class="n">u_end</span> <span class="o">&&</span> <span class="n">j</span> <span class="o"><</span> <span class="n">v_end</span><span class="p">)</span> <span class="p">{</span>
<span class="n">item_i</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">item_j</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="n">item_i</span> <span class="o">==</span> <span class="n">item_j</span><span class="p">)</span> <span class="p">{</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">item_i</span> <span class="o"><</span> <span class="n">item_j</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// possible item_j appear in u's ratings
</span> <span class="n">i</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// item_i > item_j, item_j won't be rated by u
</span> <span class="n">like</span><span class="p">[</span><span class="n">item_j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">sim</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">*</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">j</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">!=</span> <span class="n">v_end</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">j</span> <span class="o"><</span> <span class="n">v_end</span><span class="p">)</span> <span class="p">{</span>
<span class="n">item_j</span> <span class="o">=</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
<span class="n">like</span><span class="p">[</span><span class="n">item_j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">sim</span><span class="p">[</span><span class="n">tid</span><span class="p">]</span> <span class="o">*</span> <span class="n">compact_data</span><span class="p">[</span><span class="n">j</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre>
</div>
<h2 id="performance">Performance</h2>
<p>We have tested four implementations on two datasets of size 100k and 10m. The four implementations are: a serial CPU version, a naive CUDA implementation, a <script type="math/tex">O(mn)</script> naive algorithm on compressed data structure, and our <script type="math/tex">O(m+n)</script> compressed data structure.
<img src="http://pavelkang.github.io/418finalproject/assets/100k.png" alt="data structure" title="Logo Title Text 1" />
<img src="http://pavelkang.github.io/418finalproject/assets/10m.png" alt="data structure" title="Logo Title Text 1" /></p>
Wed, 27 Apr 2016 15:04:23 +0000
http://pavelkang.github.io/418finalproject/418finalproject//2016/datacompression/
http://pavelkang.github.io/418finalproject/418finalproject//2016/datacompression/datadataCUDA Optimization<h1 id="cuda-implementation-of-collaborative-filtering">Cuda Implementation of Collaborative Filtering</h1>
<h2 id="implementations">Implementations</h2>
<h3 id="serial-version">Serial Version</h3>
<p>For each of the users, we first calculate the collrelation between the user and all other users and we get a list of correlations results. Then for each of the user’s unrated items, we calculate the score of that item by summing all the products of the users who rated the item with their ratings. Then we sort the resulting list and return the 5 items with the highest sums. After running it on the ml-1M dataset, it perfomed 6040 recommendations using 737.455 seconds.</p>
<h3 id="parallelizing-the-correlation-calculation">Parallelizing the Correlation Calculation</h3>
<p>We first try to improve the program by parallelizing the correlation calculation process. We name the matrix that contains all the ratings of the users as the rating matrix. The rating matrix has <em>ITEM_COUNT</em> columns and <em>USER_COUNT</em> rows. In order to populate the correlation scores for user A, we use user A’s row in the rating matrix and calculate its correlation using the transpose of the rating matrix. We can parallelize this process by dividing the rating matrix in to chunks and mapping them to threads on the GPU. During the process, we found out that threads can actually share a huge part of the data, so we decided to store them into shared memory. Just as shown in the figure, the blue part can be stored in the shared memory since it will be shared when computing the orange block.
After running it on the ml-1M dataset, it perfomed 6040 recommendations using 341.807 seconds. We can definitely do better than a 50% speedup, so we decided to futher parallelize it.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/similarity.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<h3 id="parallelizing-the-recommendation-calculation">Parallelizing the Recommendation Calculation</h3>
<p>The next thing that we tried is to parallelize the recommendation process. We identified that the recommendation process is just calculating the matrix vector product of the list of similarity for user A and the rating matrix. Using similar methods, we stored parts of the vector in the shared memory to improve the arithmetic intensity.
After running it on the ml-1M dataset, it perfomed 6040 recommendations using 152.90467 seconds. We were confused why it only acheived 4x speedup. We had a hypothesis that calling the same CUDA over and over again will cause the application bounded by the memory system since we are constantly coping the user rating matrix from the host to the device.</p>
<p>In order to verify our hypothesis, we decided to do a detailed benchmark of the program using the Nvidia Visual Profiler. From the timeline we can see that the program was constantly moving data between the device and the host.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/slow.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<p>After performing a kernel analysis on the recommendation kernel, we found out that the memory utilization on the device is a lot higher than the computational utilization. Thus, this proves our hypothesis that the program is memory bounded. So in order to improve, we should find ways to share data between kernel calls.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/recommend-slow.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<h3 id="further-parallelizing-the-correlation-calculation">Further Parallelizing the Correlation Calculation</h3>
<p>Then we found out that when calcualting the correaltion of the single user, a lot of the data from the rating matrix can be reused. So instead of calculating the correlation for a single user, we can calculate the correlation between all the users in parallel. This way, although it will take longer to compute this kernel, but we can reuse the data across all the recommendation calls. We can populate the correlation of all users using the rating matrix and the rating matrix transposed. Here we stored another transposed version of the rating matrix because we want to maintain spacial locality when accessing the columns. Similar to the previous correlation calculation kernel, two blocks can be stored in shared memory so that we can share data across threads and increase the arithmetic intensity.
After running it on the ml-1M dataset, it perfomed 6040 recommendations using 119.90333 seconds. It did not acheive the speedup that we had in mind. In order to find the problem with this implementation, we headed back to the Nvidia Visual Profiler.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/semi-slow.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<p>From the timeline we can see that the correlation kernel (cyan block) only took up 5% of the total time. Simlilar to what happed before, we can see that the program is still constantly moving data between the host and the device.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/semi-slow-rec.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<p>By performing a kernel analysis on the recommendation kernel we found out that although the memory utilization is a bit lower, it still takes up the majority of the timeline. This proves that the kernel is still memory bounded.</p>
<h3 id="further-parallelizing-the-recommendation-calculation">Further Parallelizing the Recommendation Calculation</h3>
<p>After going over the code multiple times, we found the bottleneck of the whole program. In each recommendation kernel call, we copied over the whole rating matrix from the host to the device! We decided that we can share this data across all calls of the recommendation kernel and calculate the recommendations for all users in parallel and store it in memory. Then when each user requests a recommendation, we can just read if off from memory.
After running it on the ml-1M dataset, it perfomed 6040 recommendations using only <em>5.59757</em> seconds. Thats a roughly 150X speedup compared to the serialized version.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/fastest.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<p>Heading back to the visual profiler we found out that for more than 90% of the time, the GPU is doing arithmetic computation. This shows that the program achieved a relatively high arithmetic intensity.</p>
<p><img src="http://pavelkang.github.io/418finalproject/assets/fastest_recommendation.png" alt="recommendation engine" title="Logo Title Text 1" />
<img src="http://pavelkang.github.io/418finalproject/assets/fastest_correlation.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
<p>Looking at the two kernels individually we confirmed that both the kernels are bounded by computation. Since the only way to improve the computational power of the GPU is basically getting a better GPU, we know that we did a fine job parallelizing the program.</p>
<h2 id="results">Results</h2>
<p>By comparing the performance across different versions of the program we can see the effects of each small improvement.
<img src="http://pavelkang.github.io/418finalproject/assets/comparison.png" alt="recommendation engine" title="Logo Title Text 1" /></p>
Sun, 27 Mar 2016 15:04:23 +0000
http://pavelkang.github.io/418finalproject/418finalproject//2016/cuda/
http://pavelkang.github.io/418finalproject/418finalproject//2016/cuda/checkpointcheckpointCheckpoint 1<h6 id="instead-of-focusing-on-collaborative-filtering-then-matrix-factorization-we-decided-to-implement-both-serial-versions-so-that-we-can-parallelize-our-work-after-the-checkpoint-here-is-a-list-of-things-we-have-already-finished">Instead of focusing on collaborative filtering then matrix factorization, we decided to implement both serial versions so that we can parallelize our work after the checkpoint. <strong>Here is a list of things we have already finished</strong>:</h6>
<h2 id="a-serial-implementation-of-collaborative-filtering-and-performance-analysis">A Serial Implementation of Collaborative Filtering and Performance Analysis</h2>
<ul>
<li>Code is available here at <a href="https://github.com/pavelkang/cf/blob/master/cf/main.cpp">Github</a></li>
<li>Algorithm explanation:
<ol>
<li>We calculated pairwise user similarity using the <a href="http://grouplens.org/blog/similarity-functions-for-user-user-collaborative-filtering/">Pearson Correlation</a></li>
<li>To recommend to user <script type="math/tex">u</script>, for any other user <script type="math/tex">u'</script> different from <script type="math/tex">u</script>, we look at the items <script type="math/tex">u'</script> has rated, weight them by <script type="math/tex">u'</script>’s rating and the similarity between <script type="math/tex">u</script> and <script type="math/tex">u'</script>, and use this as the predicted score for user <script type="math/tex">u</script>.</li>
<li>We ran the algorithm on the <script type="math/tex">ml-100k</script> dataset mentioned in the proposal
<ul>
<li>We ran <code class="highlighter-rouge">gprof</code> on this serial implementation of the code, the result is available here: <a href="https://gist.github.com/pavelkang/4c3a9ae32699fe0d4b1a3544c685ceb2">gist</a></li>
</ul>
</li>
</ol>
</li>
</ul>
<h2 id="a-cuda-implementation-of-collaborative-filtering-and-performance-analysis">A CUDA Implementation of Collaborative Filtering and Performance Analysis</h2>
<ul>
<li>Code is available here at <a href="https://github.com/pavelkang/cf/tree/cuda-cf/cuda-cf">Github</a></li>
</ul>
<h2 id="a-serial-implementation-of-svd-matrix-factorization-and-performance-analysis">A Serial Implementation of SVD Matrix Factorization and Performance Analysis</h2>
<ul>
<li>Code is available here at <a href="https://github.com/pavelkang/cf/blob/master/mf/main.cpp">Github</a></li>
</ul>
<h2 id="api">API</h2>
<p>Our API is available at <a href="https://docs.pararec2.apiary.io">Apiary</a>. We designed this API so that even if we don’t have time to furnish the frontend, we still have a RESTful service available.</p>
<h2 id="server-and-front-end">Server and Front-end</h2>
<ul>
<li>Code is available here at <a href="https://github.com/jeff95723/418FinalServer">Github</a></li>
</ul>
<h2 id="issues-and-concerns">Issues and Concerns</h2>
<ul>
<li>We decided not to work on the matrix factorization because we see a lot of possible optimization techniques on this one.</li>
</ul>
<h2 id="goals-and-deliverables">Goals and Deliverables</h2>
<ul>
<li>We won’t be implementing the parallel matrix factorization algorithm. Instead, we will focus more on the collaborative filtering algorithm.</li>
</ul>
<h2 id="future-plans">Future Plans</h2>
<ul>
<li>We are going to try these techniques on the collborative filtering algorithm next:
<ol>
<li>Data Compression</li>
<li>Multi-GPU (we have two)</li>
<li>CuBLAS</li>
<li>Faster matrix multiplication so that the amortized cost of recommending multiple users is low</li>
<li>Sorting optimization (Since we only need the <script type="math/tex">topn</script> recommendations, we don’t have to sort everything)</li>
<li>Use K-nearest Neighbor algorithm to find clusters of similar users. In this way, we can approximate the final result.</li>
</ol>
</li>
</ul>
Sun, 27 Mar 2016 15:04:23 +0000
http://pavelkang.github.io/418finalproject/418finalproject//2016/checkpoint-1/
http://pavelkang.github.io/418finalproject/418finalproject//2016/checkpoint-1/checkpointcheckpointParaRec Project Proposal<h2 id="title">Title</h2>
<p><strong>Parallel Recommendation Engine</strong></p>
<p>Team members:</p>
<ul>
<li>Kai Kang (kaik1)</li>
<li>Jianfei Liu (jianfeil)</li>
</ul>
<h2 id="summary">Summary</h2>
<p>We are going to implement a parallel collaborative filtering recommmendation engine on a single machine, and explore different optimization algorithms addressing memory access problems, and model parallelism using multiple GPUs.</p>
<h2 id="background-and-challenge">Background and Challenge</h2>
<p><img src="http://pavelkang.github.io/418finalproject/assets/rec.svg" alt="recommendation engine" title="Logo Title Text 1" />
Recommendation system is crucial to a lot of tech companies such as Facebook, Amazon, and Netflix because it helps users quickly find what they need, and help companies best sell their products (news for Facebook, actual products for Amazon, and movies for Nextflix.) In fact, Netflix hosted a contest <a href="http://www.netflixprize.com/"><em>Netflix Prize</em></a> to award <strong>1 millon dollars</strong> to the winning recommendation algorithm. To evaluate a recommendation system, we look at two factors, speed and accuracy. Speed is as important as accuracy because a huge amount of new training data come in every second and it is important to use the new data as soon as we can.</p>
<p>In my last internship at Bloomberg, I implemented a recommendation engine of the Bloomberg research report. However, I encountered some difficulties during the process:</p>
<ul>
<li>
<p><strong>distributed vs. single machine</strong>: I experimented with distributed platforms such as Hadoop and Spark. One the one hand, the problem with distributed systems is the dataset itself is not large enough to cancel the overhead of using a distributed system. On the other hand, single-machine implementations suffer from memory and speed problems.</p>
</li>
<li>
<p><strong>data parallelism</strong>: The collaborative filtering algorithm has almost no dependencies and is easy to parallelize naively. However, the biggest challenge is that because the memory access pattern is very random, there is no locality in the naive implementation. What’s more, since the dataset is very sparse, the most naive implementation might not even be runnable on a single machine (this actually happened to me). We want to address this problem by developing compressed data structure and algorithms.</p>
</li>
<li>
<p><strong>model parallelism</strong>: Since the problem has few dependencies, model parallelism will give us huge speedup if implemented correctly. However, distributing to multiple machines is complicated in practice especially in the development stage when all the effort should be focused on improving the algorithm. We address this problem by developing multi-GPU solutions</p>
</li>
<li>
<p><strong>approximation algorithm</strong>: What is more, it seems unecessary to compute all pairwise user similarities. In pratice, an approximated solution will often suffice. We will develop parallel approximation algorithms to preprocess the raw dataset.</p>
</li>
</ul>
<p>We will be looking at the collaborative filtering algorithm in depth:</p>
<p><strong>Collaborative Filtering</strong></p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="n">recommend</span><span class="p">(</span><span class="n">u</span><span class="p">)</span> <span class="p">{</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">similar_user</span><span class="p">(</span><span class="n">u</span><span class="p">);</span>
<span class="k">for</span> <span class="n">v</span> <span class="n">in</span> <span class="n">V</span><span class="o">:</span>
<span class="k">for</span> <span class="n">items</span> <span class="n">v</span> <span class="n">purchased</span> <span class="n">but</span> <span class="n">u</span> <span class="n">hasn</span><span class="err">'</span><span class="n">t</span><span class="o">:</span>
<span class="n">likelihood</span><span class="p">[</span><span class="n">v</span><span class="p">]</span> <span class="o">+=</span> <span class="n">similarity</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="p">}</span>
</code></pre>
</div>
<p>In <em>Collaborative Filtering</em>, getting similar users of a user can be reduced to a clustering problem where we want to cluster users by their similarities. The rest double for loop is also computation-intensive and requires effort to efficiently parallelize.</p>
<p>In a word, this is our roadmap:
<img src="http://pavelkang.github.io/418finalproject/assets/overview.svg" alt="roadmap" title="Logo Title Text 1" /></p>
<p><strong>Matrix Factorization</strong> <a id="abcd" name="abcd"></a></p>
<p>The idea of matrix factorization is that we have <script type="math/tex">M</script>, where each row represents a <strong>user</strong>, and each column represents an <strong>item</strong>, and entry <script type="math/tex">M[u][i]</script> represents the rating of user <script type="math/tex">u</script> to item <script type="math/tex">i</script>. What we will be implementing is a matrix factorization algorithm to factor <script type="math/tex">M</script> into user matrix <script type="math/tex">U</script> and item matrix <script type="math/tex">I</script>, where <script type="math/tex">U</script> and <script type="math/tex">I</script> essentially represents the inherent attributes of each user and item. This algorithm by [Simon Funk]((http://sifter.org/~simon/journal/20061211.html) wins the Netflix-prize. Here is an example to illustrate how it works:
Imagine our <script type="math/tex">M</script> is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\left[
\begin{array}{cccc}
& \text{Titanic} & \text{Interstellar} & \text{The Notebook}\\
Alice & 5.0 & 2.0 & 4.0\\
Bob & 2.0 & 5.0 & 1.0\\
\end{array}
\right] %]]></script>
<p>Using an online calculator, we get <script type="math/tex">U</script> to be:</p>
<script type="math/tex; mode=display">% <![CDATA[
\left[
\begin{array}{ccc}
Alice & 1.0 & 0.0\\
Bob & 0.4 & 1.0 &\\
\end{array}
\right] %]]></script>
<p>We get <script type="math/tex">I</script> to be:</p>
<script type="math/tex; mode=display">% <![CDATA[
\left[
\begin{array}{ccc}
\text{Titanic} & \text{Interstellar} & \text{The Notebook}\\
5.0 & 2.0 & 4.0\\
0.0 & 4.2 & -0.6\\
\end{array}
\right] %]]></script>
<p>Matrix factorization <em>automatically</em> extracts the features of each movie, and each person’s favorite feature. In this example, it is apparent that the two features are <em>romance</em>, and <em>adventure</em>. Thus, the romantic movies have a large value for romance feature and small value for adventure feature. And Alice has a large value for romance, and Bob has a large value for adventure.</p>
<h2 id="resources">Resources</h2>
<p>We will be refering to the following academic papers (the list will be updated during the project):</p>
<ul>
<li><a href="http://www.cs.utexas.edu/~inderjit/public_papers/kais-pmf.pdf">Parallel Matrix Factorization for Recommender Systems</a></li>
<li><a href="http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/reco/paper/MatrixFactorizationALS.pdf">Large-scale Parallel Collaborative Filtering for the Netflix Prize</a></li>
<li><a href="http://www.jcomputers.us/vol8/jcp0801-02.pdf">A Parallel Clustering Algorithm with MPI - MKmeans</a></li>
</ul>
<p>We will be using the following dataset for testing purposes</p>
<ul>
<li><a href="https://snap.stanford.edu/data/web-Amazon.html">Amazon Reviews</a></li>
<li><a href="https://gist.github.com/entaroadun/1653794">Public Dataset(more than 5 categories)</a></li>
<li><a href="http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a">Netflix Prize Data Set</a></li>
</ul>
<h2 id="goals-and-deliverables">Goals and Deliverables</h2>
<ul>
<li>A frontend web app which allows users to upload their training data and test data, and visualizes the correctness of each algorithm, (and possible each different implementation using GPU/OpenMP)</li>
<li>A backend service API which implements the different algorithms (collaborative filtering, clustering)</li>
</ul>
<h2 id="platform-choice">Platform Choice</h2>
<p>We will be using C++ as our main programming language because we want to use CUDA and OpenMP to parallelize our program. C++ is a perfect balance between performance and code readability.
Our code will run on computers. We will try to make our code compatible with all OS.</p>
<h2 id="schedule">Schedule</h2>
<ul>
<li><strong>Friday, April 8</strong>: Have a RESTful C++ server ready, design the backend API, discuss which versions of the algorithms to implement</li>
<li><strong>Friday, April 15</strong>: Study, implement, and experiment clustering algorithms</li>
<li><strong>Friday, April 22</strong>: Use clustering algorithms to implement collaborative filtering, benchmark different implementations</li>
<li><strong>Friday, April 29</strong>: Implement Simon Funk’s matrix factorization algorithm and connect with APIs</li>
<li><strong>Friday, May 6</strong>: Wrap up the project with output visualization, final writeup, and reflection</li>
</ul>
Fri, 08 Jan 2016 15:04:23 +0000
http://pavelkang.github.io/418finalproject/418finalproject//2016/welcome-to-jekyll/
http://pavelkang.github.io/418finalproject/418finalproject//2016/welcome-to-jekyll/IntroIntro