Parallelized Collaborative filtering with CUDA

Topic

Parallelism and Acceleration in Recommendation Systems

Introduction

In the era of big data, recommendation system is widely utilized in from commercial e-shops to social networks, product review sites other SaaS applications. Many websites collect user profiles to provide some valuable information through personalized recommendation.

Collaborative filtering (CF) is a very common technique used in recommender systems. The key idea of collaborative filtering is to recommend items based on similarity measures between users or items. For instance, the items recommended to a user are those preferred by similar users. The technique is widely used due to its properties that (1) a single algorithm works on any kind of item and (2) it can handle the complexity of users’ preference.

Due to the increasing scale of data in these applications is constantly increasing, in some practices, the computational time of collaborative filtering algorithm may be unsatisfying. Different approaches have been proposed to deal with the problem of scalability of recommender systems’ algorithms. This project focus on implementation of parallelism approaches with CUDA and compare the computational time with the sequential approach.

Dataset

Source: https://grouplens.org/datasets/movielens/

Dataset Movie (M)	User (U)	Ratings (N)
tiny (for debug)	5	5
100k	943	1682
small	610	9725
1m	6040	3706

Implementation

The goal is to predict the rating of unrated items for every user. Then, items with high predicted rating can be recommended to each user.

There are two approaches to implement Collaborative Filtering: (1) by user-user similarity and (2) by item-item similarity, both have been implemented in this project. Here we use item-item similarity for demonstration since the algorithms is totally same.

Data structure
- M = no of items, U = no of users, N = no of ratings
- Rating = <user id OR item id, value>
```
struct Rating {
    int id;
    double value;
};
```
- Rating ratings[N] – sort by item and divided to M items
- int count[M] (no of ratings for each movie)
- double magnitudes[M]
- double similarities[M*M]
- double predictions[M*N] (output)
Input processing mapUser():
- Read each (user, item, rating) input
- Group ratings to (item, (user, ratings)) pairs by insert the ratings into self-sorted multimap (from STL)
- Hence, the ratings is now sorted by items
```
multimap<int, Rating> ratingsMap;
readInput(filename, 0, ratingsMap);
```
- Time complexity: $O(NlogN)$
Parallelization: Not implemented is this step.
Data standardization + Calculate rating magnitude of each item

magnitude = ||r|| for each movie is calculated by the following equation:

$$| r_{x}||=\sqrt{\sum_{i \in N} x_{i}^{2}}$$

calculateMagnitude(): For each item
- Calculate the mean of each item (we can treat missing data as 0, by doing this, the missing data will not have effect on items’ similarities)
- For each user-item rating,
  - Substract the rating by its mean
  - Add the square of rating to the sum
- magnitude[m] = square root of the sum
- Time complexity: $O(N)$
Parallelization:
- Since the calculation of each item is independent, it can be parallelized to calculateMagnitude<<<M, 1>>>()
- Each thread performs independent calculation on each item
- Time complexity for each thread: user that rate the item = O(U)
Calculate item-item similarity

There are several similarity metrics can be used. In this implementation, cosine similarity is used as the similarity measure.

$$\operatorname{sim}(x, y) = \cos(r_x, r_y) = \frac{r_x \cdot r_y}{||r_{x}|| \cdot||r_{y}||}$$

calculateSimilarity(): For each item-item pair (m1,m2)
- mag_prod = magnitude[m1] * magnitude[m2]
- For each user u that has rated m1 and m2dot_prod += rating[u,m1] * rating[u,m2]
- similarities[m1,m2] = dot_prod / mag_prod;
- Time complexity: $O(M^2U)$
Parallelization:
- Since the calculation of each item with all another item is independent, it can be parallelized to calculateSimilarities<<<M, 1>>>()
- Each thread performs independent calculation on each item, iterate through all other item
- Why not <<<M*M, 1>>>? Data locality of one item is preserved in each thread
- Time complexity for each thread: $O(MU)$
Regroup the ratings by users

mapUser():
- Read each (user, item, rating) input
- Group ratings to (user, (item, ratings)) pairs by insert the ratings into self-sorted multimap (from STL)
- Hence, the ratings is now sorted by users
```
multimap<int, Rating> ratingsMap;
readInput(filename, 0, ratingsMap);
```
  Time complexity: O(NlogN)
Parallelization: Not implemented is this step.
Predict rating for each user-item

The rating rxi of user x to item i is predicted based on the similarities between the item and other items those have been rated by the user:

$$r_{x i}=\frac{\sum_{j \in N(i ; x)} S_{i j} \cdot r_{x j}}{\sum_{j \in N(i ; x)} S_{i j}} $$

predictRatings(): For each user-item pair (u,m1)
- For each item that has been rated by the user (u,m2)
  - sum_rating += similarities[m1,m2] * rating[u, m2]
  - sum_sim += similarities[m1,m2]
- predict rating[u, m1] = sum_rating / sum_sim
- Time complexity: $O(UM^2)$
Parallelization:
- Since the calculation of each user with all another item is independent, it can be parallelized to predictRatings<<<U, 1>>>()
- Each thread performs independent calculation on each user, iterate through all other item
- Why not <<<U*M, 1>>>? Data locality of one user is preserved in each thread
- Time complexity for each thread: $O(M^2)$

Optimization

Optimizing pairwise operation

The time elapsed is mainly determined by calculateSimilarity() function, which is very time-consuming. It consists of pairwise operations of M*M.

Method 1:

for (int i = index; i < M; i+=stride )
	for (int j = 0; j < M; j++)
		// job with constant length

Problem: Half of the calculation is duplicated

Method 2:

for (int i = index; i < M; i+=stride )
	for (int j = i+1; j < M; j++)
		// job with constant length

Problem: total time is determined by the last completed job

Method 3 (Currently used):

for (int i = index; i < N; i+=stride )
	for (int i = index; i < N; i+=stride )
		if ((i + j) % 2 && i < j) || (i + j) % 2 == 0 && i > j))

Workload of each thread is now j/2

Optimizing data structure

Method 1:

Data structure: use separate array for count, user id, rating, magnitudes, similarities

int movieList[M]; int userCount[M]; int userId[N];
double u_ratings[N]; magnitudes[M];

Result: time elapsed (ms)

dataset	algorithm	Sequential	Parallel 1 <32,32>
4x5	item	0.008	0.24
	user	0.008	0.34
610x9125	item	2605.565	7343.519 / 2323.9
	user	116.78	769.604 / 421.8
943x1682	item	728.141	7343.519 / 989.608
	user	567.318	4557.879 / 3373.767
6040x3264	item	4421.31	47038.702 / 23848.671
	user	8090.46	1365.302 / 711.995

Parallel 1 <32,32>: before optimizing pairwise operation / after optimizing pairwise operation

Experimented with other combinations <<< blockPerGrid * threadPerBlock>>>, but still slower than sequential code

Method 2:

Try to group the data for each item and user into profiles

struct Rating {
    int id;
    double value;
};

struct Profile1 {
    int count;
    double magnitude;
    Rating* ratings;
};

struct Profile2 {
    int count;
    Rating* ratings;
};

Dynamically allocate memory for each item / user during mapMovie() / mapUser()

For each item:

int cnt = m_u_ratings.count(it->first);
cudaMallocManaged(&mData[m], sizeof(T) + sizeof(Rating) * cnt);
mData[m]->count = cnt;
mData[m]->ratings = (Rating*) mData[m] + sizeof(mData);
movieList[m] = it->first;

Result: time elapsed (ms)

dataset	mode	Sequential	Parallel 2 <M,1>
4x5	item	0.01	0.876
	user	0.012	0.88
610x9125	item	2510.706	523.93
	user	Segmentation fault	285.996
943x1682	item	467.644	150.769
	user	308.696	155.981
6040x3264	item	9283.719	Segmentation fault
	user	17495.76	Segmentation fault

Generally faster than the sequential code
Problem 1: Presence of pointer Ratings* makes things complicated and hard to debug, I don’t recommend anyone doing like this
Problem 2: Undergoes segmentation fault due to failure of cudaMalloc() / cudaMallocManaged(), even there are enough memory in GPU. I can’t fix that.

Method 3 (Currently used):

Back to method 1, but group id and rating value together as Rating
```
struct Rating {
    int id;
    double value;
};
```
No more passing pointer, yay!
If a value in array is repeatedly used, use a register in kernel to store it

For example,
```
sim = similarities[m*M + ratings[n].id]
```

Result: time elapsed (ms)

dataset	mode	Sequential	Parallel 2 <M,1>	Parallel 3 <M,1>
4x5	item	0.01	0.876	0.619
	user	0.012	0.88	0.61
610x9125	item	2510.706	523.93	384.981
	user	Segmentation fault	285.996	101.199
943x1682	item	467.644	150.769	86.457
	user	308.696	155.981	75.279
6040x3264	item	9283.719	Segmentation fault	2675.681
	user	17495.76	Segmentation fault	4003.035

Obvious improvement in speed
No more segmentation fault

Experiment on other <<<blockPerGrid * threadPerBlock>>> combination

Conclusion:
- <<<M, 1>>> for item-based and <<<U, 1>>> for user-based achieve the best result
- Only in some cases, <<<M/8, 8>>> or <<<U/8, 8>>> perform better
  
  dataset mode Parallel 2 <M,1> Parallel 2 <M/8,8>
  
  943x1682 item 384.981 362.527
  
  6040x3264 user 4003.035 3435.059
- More threadPerBlock may result in segmentation fault

Instructions

Input files
- There are 4 datasets placed under input/
  - input/ratings_tiny.dat
  - input/ratings_100k.dat
  - input/ratings_small.dat
  - input/ratings_1m.dat
- Input format Movielens dataset is modified to meet the following input format:
  - First line: U M N (specifying no. of users, no. of movies, no. of ratings)
  - Following by N lines of: user_id movie_id rating
  - U, M, N are integers. user_id, movie_id are integers start from 1, where as ratings is between 1.0~5.0.

Compilation and Execution

Evaluating performance (This will print time elapsed in each stage and GPU memory used)

make eval && srun -p ipc21 --gres=gpu:1 -N 1 ./eval input/*[dataset] [mode] [threadNo]*

For example,

make eval && srun -p ipc21 --gres=gpu:1 -N 1 ./eval input/ratings_1m.dat user 8

Print predict ratings (Output file will contain U*M-N lines of user_id movie_id rating)

make predict && srun -p ipc21 --gres=gpu:1 -N 1 ./predict input/*[dataset] [mode] [threadNo]*  > *[Outputfile]*

For example,

make predict && srun -p ipc21 --gres=gpu:1 -N 1 ./predict input/ratings_1m.dat user 8 > Output.txt

Evaluating performance of method 2

make v2 && srun -p ipc21 --gres=gpu:1 -N 1 ./v2 input/[dataset] [mode]

Evaluating performance of sequential code

make seq && ./seq input/[dataset] [mode]

Appendix: Time elapsed in each stage

Sequential

Dataset	algorithm	Input Processing	Calculate Magnitude	Calculate Similarites	Predict Ratings	Total
4x5	item	0.129	0.003	0.004	0.003	0.01
	user	0.075	0.003	0.006	0.003	0.012
943x1682	item	90.959	0.269	465.615	1.76	467.644
	user	82.573	0.263	308.433	0.016	308.696
610x9125	item	83.326	0.428	2507.027	3.251	2510.706
	user	87.705	0.252	147.057	Seg. fault	Seg. fault
6040x3264	item	931.874	2.649	9255.989	25.081	9283.719
	user	782.601	2.7322	17493.03	0.005	17495.76

Parallel 2 <M,1>

Dataset	algorithm	Input Processing	Calculate Magnitude	Calculate Similarites	Predict Ratings	Total
4x5	item	0.298	0.34	0.028	0.508	0.876
	user	0.214	0.299	0.036	0.545	0.88
943x1682	item	57.347	1.538	99.15	50.081	150.769
	user	48.434	1.522	90.162	64.297	155.981
610x9125	item	113.392	3.054	462.547	58.329	523.93
	user	48.088	3.343	113.224	169.429	285.996
6040x3264	item	Segmentation fault	Segmentation fault	Segmentation fault	Segmentation fault	Segmentation fault
	user	Segmentation fault	Segmentation fault	Segmentation fault	Segmentation fault	Segmentation fault

Parallel 3 <M,1>

Dataset	mode	Input Processing	Calculate Magnitude	Calculate Similarites	Predict Ratings	Total
4x5	item	72.93	0.279	0.027	0.313	0.619
	user	82.56	0.27	0.037	0.303	0.61
943x1682	item	116.509	0.917	83.855	1.685	86.457
	user	128.393	0.853	73.501	0.925	75.279
610x9125	item	126.718	1.076	379.831	4.074	384.981
	user	140.972	1.021	99.201	0.977	101.199
6040x3264	item	762.175	5.902	2655.184	14.595	2675.681
	user	545.094	6.094	3990.43	6.511	4003.035

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
input		input
output		output
Final Presentation.pptx		Final Presentation.pptx
Makefile		Makefile
README.md		README.md
final.cu		final.cu
final_seq.cc		final_seq.cc
final_v1.cu		final_v1.cu
final_v2.cu		final_v2.cu
proposal.pdf		proposal.pdf
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallelized Collaborative filtering with CUDA

Topic

Introduction

Dataset

Implementation

Optimization

Instructions

Appendix: Time elapsed in each stage

About

Releases

Packages

Languages

FooJiaYin/cuda-collaborative-filtering

Folders and files

Latest commit

History

Repository files navigation

Parallelized Collaborative filtering with CUDA

Topic

Introduction

Dataset

Implementation

Optimization

Instructions

Appendix: Time elapsed in each stage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages