recommender-learn

The know-how on how recommendation system works, and how to apply different algorithms for your projects. Also, add links to implementation in real-life!

https://github.com/alextanhongpin/data-science-learn/blob/master/similarity.md

Related: how to build a search ranking algorithm using Learning To Rank

https://www.searchenginejournal.com/build-search-ranking-algorithm-machine-learning/297047/#close

How youtube sorting works

https://stackoverflow.com/questions/27781751/what-is-youtube-comment-system-sorting-ranking-algorithm

Types of recommendation system

Collaborative Learning
Content Based (item based or user based)
Hybrid recommender

Input

data in the form of rating, score or events, sentiment

Output

ranking of items to recommend
pool of items
score for the items

Similarity Algorithms

What are the different type of similarity function and what usecases are they suitable for?

Collaborative Filtering

Using One Slope algorithm for binary ranking (1, 0) https://en.wikipedia.org/wiki/Slope_One

Recommendation algorithm

how to write one from scratch
how to scale one
using existing libraries
deploying it to production
retraining the recommendation algorithm

https://medium.com/arc-software/recommender-systems-behind-the-scenes-a39c831a0ae2 https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea https://en.m.wikipedia.org/wiki/Recommender_system#Content-based_filtering https://en.m.wikipedia.org/wiki/Tf–idf https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System https://en.m.wikipedia.org/wiki/Latent_semantic_analysis https://en.m.wikipedia.org/wiki/Latent_Dirichlet_allocation https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

Viral factor calculator

Viral factor calculator - calculates the net worth of a social media celebrity. Also includes recency (newest comments/feeds will have higher score) and the success rate of the merchants they promote.

Recommender system algorithm + hot rank algorithm. Incorporate recency into the recommender engine so that newer items will be ranked higher than older ones. Why is this important? Some products (such as events/promotion etc) has expiry time and is valid only for a certain period. Others, such as movies/songs are seasonal - they follow a certain trend and usually according to the time. So it might be more reasonable to recommend songs now than songs in the 50s even if the similiarity score is higher.

ElasticSearch/Lucene

Page Rank

Data Gathering

Github Recommender

displays a list of top github users with recommendation. Skills proven: data recommendation for users, data scraping and data visualization. learn to

Events

Events	Event Meaning	Event Name
Scrolling a themed row	User is interested in the theme, e.g. Dramas	genreView
Placing the mouse over a film to request an overview of content	User is interested in the movie (a drama) thereby showing interest in this category	details
Clicking the film to request the details of the content	User is more interested in the movie	moreDetails
Adding the film to my list	User intends to watch the movie later	addToList
Starting to watch the movie	User purchases the movie	playStart

Logging	userID	contentID	event date
1234	1	addToList	2017-01-01
1234	2	genreView	2017-01-01

Client side code to capture events

function onScroll(evt) {
	// Get the genre
	if ($el === '.carousel' && $el.dataset[‘genre’])  logGenre()
}

function onClick() {
	logEvent(‘details’, userId)
}

Indicator of Interest

clicks

links: user wants to get more information
see more (expansion clicks): user is interested to know more
item: user selects the items of preference
filter: user has a specific search criteria (which can be stored for future use) share: user shows interest in an item save for later/bookmark: user shows interest in an item subscribe/unsubscribe: user wants to be notified on new content scroll
user wants to see more item rate: user wants to rate an item. ratings can be explicit or implicit
- explicit: added manually by the users
- implicit: calculated from the evidence you collected vote: user wants to vote/unvote item he/she likes/agree/disagree/has strong opinion . Sites that use voting are called reputation systems. like: user shows like/dislike on an item comment: user wants to leave comment (positive/negative/neutral) review: user wants to leave review on a product (positive/negative/neutral) search
keywords: user is looking for a particular keyword
filters: user has a criteria of search (which may change over time, so it’s better to set an “expiry” time on the preference, or store a history of it) page_duration: For page duration, it depends on the application too. A video streaming site might have users staying longer (if the user stayed for the whole period of the movie, it means they love it!) while a ecommerce site might expect users to be there just for a short duration activity: let’s take a music playing app for example
start playing: the user is interested, that is already a positive
stop playing: maybe the user think it’s bad, but this is relative to where the user stop. Stopping at the start could mean that it is bad, stopping at the end could mean something else
resume playing: forget all the negative implicit ratings that the system has registered, the user is probably distracted by something else (someone calling them etc). If playing is resumed in 5 minutes, the user probably has something else at hand. If the playing is resumed 24 hours later, it could mean something else)
speeding: If the user skips something in the middle, it’s probably not a good sign. Unless the user is resuming
playing it to the end - we have a winner. This is a good sign, unless the user is idle and left the playlist playing
replaying - this could be a positive indicator. But replaying a music and an educational video might have different context - the latter might suggest that the content is too difficult to follow.

Creating profile for a single vs average group

when building a profile, we always attempt to target a group of users and averaging their behaviours.
this might or might not work, as different users really have different habits
but building a targetted personalisation would consume too much resources

Duration on Page	What it means
less than 5 seconds	No interest
more than 5 seconds	interested
more than 1 minute	very interested
more than 5 minutes	probably went to get coffee
more than 10 mins	interrupted or went away from the page without following the link

Cold Start

Dealing with cold start

show default data by most popular/latest
we can also segment them by demographic/age etc
gray sheeps are users that have individual taste that does not resemble other users, and hence it is hard to recommend products to them.

Building a recommendation engine

https://www.codeproject.com/Articles/1232150/Building-a-Recommendation-Engine-in-Csharp https://stackoverflow.com/questions/2323768/how-does-the-amazon-recommendation-feature-work https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5

Case based reasoning CBR: https://en.m.wikipedia.org/wiki/Case-based_reasoning https://www.quora.com/What-are-examples-of-rule-engines-combined-with-machine-learning

Matrix Factorization

Link to another repo here.

Reddit Algorithm

https://github.com/alextanhongpin/go-rate

http://restfulmvc.com/reddit-algorithm.shtml https://www.reddit.com/r/changelog/comments/7hkvjn/what_we_think_about_when_we_think_about_ranking/ https://www.reddit.com/r/changelog/comments/7j5w9f/keeping_the_home_feed_fresh/ https://www.reddit.com/r/changelog/comments/7spgg0/best_is_the_new_hotness/

Hacker news algorithm

Pseudo code:

score = (P - 1) / (T + 2)^ G
where,
P = points of an item (-1 is to negate submitters vote)
T = time since submission (in hours)
G = gravity, defaults to 1.8

Effects of gravity (G) and time (T)

the score decreases as T increases, meaning older items will get lower and lower score
the score decreases much faster for older items if gravity is increased

Python implementation:

def calculate_score(votes, item_hour_age, gravity=1.8):
	return (votes - 1) / pow((item_hour_age + 2), gravity)

References:

https://medium.com/hacking-and-gonzo/how-hacker-news-ranking-algorithm-works-1d9b0cf2c08d

Recommendation

Learning to Rank (LtR)

Ranking and Learning to Rank

types of Learning to Rank (LTR) algorithms, such as pointwise, pairwise and listwise comparison of ranks
bayesian personalized ranking (BPR)

The three types of LTR algorithm

pointwise

produce a score for each item and them ranks them accordingly
the difference between rating prediction and ranking is that with ranking, you don’t care of an item has a utility score of a million or within a rating scale, as long as the score symbolises a position in the rank

pairwise

a type of binary classifier
a function that takes two items an returns an ordering of the two
in pairwise ranking, output is optimised to minimise the number of inversions of items compared to the optimal rank of the items
an inversion is when two items changes place
e.g. Bayesian Personalized Ranking (BPR)

listwise

the king of all LTR subgroups
looks at the whole ranked list and optimises that
advantage of listwise ranking is that it intuits that ordering is more important at the top of a ranked list than at the bottom
pointwise and pairwise algorithms don’t distinguish where on the ranked list you are
e.g. CoFiRank (collaborative filtering for ranking)

Implicit ratings

Implicit Rating, IR = #event_1 * w_1 + #event_2 * w_2 + #event_3 * w_3…#event_n + w_n

IR: the implicit rating
#event: the number of times the specific event occured
w: The weights that you set based on previous analysis (and probably tweak again when you start creating recommendations)

Recommender algorithms

memory-based: recommender accesses the logs data in real-time
model-based: signifies that the algorithm aggregates the data beforehand to make it more responsive

Similarity and Distance

The similarity of item 1 and item 2 can be represented as the following:

sim(i_1, i_2), 0=not similar, 1=exact
similarity is the inverse of distance
as the distance increase, the similarity goes towards zero
as the distance decrease, the similarity goes towards one

Data Type	Similarity
unary/binary data	jaccard similarity
quantitative data	pearson/cosine similarity

Types of similarity algorithm

jaccard
l1 norm - manhattan distance
l2 norm - euclidean distance
pearson
cosine

Jaccard Index

Jaccard similarity index/coefficient compares members for two sets to see which members are shared and which are distinct.
It is a measure of similarity for the two sets of data, with a range from 0 to 1.
The higher the value, the more similar the two populations

J(X, Y) = X intersect Y / X union Y

In steps, that is:

count the number of similar members between both sets (intersection)
count the total number of unique members in both sets (union)
divide them to get the jaccard similarity

Observation:

the value is 1 when both sets have the same members. (a non-empty set that is computed against itself will return 1 too)
the value is 0 if both sets have no similar members
the value is 0.5 if both sets share half the members
the value is 0 if the set is empty

Jaccard Distance

as opposed to the jaccard similarity, the jaccard distance measures how dissimilar two sets are.
in set notation, subtract 1 from the jaccard distance

D(X, Y) = 1 - J(X, Y)

L1 Norm

aka manhattan distance or taxicab norm
the idea is that if you want to measure the distance between two street corners in Manhattan, you drive a grid, rather than measure as the crow flies
the sum of magnitudes of the vectors in space

Given X = [3,4] // x and y coordinate
||X||1 = |3| + |4| = 7

The similarity is just the inverse of the distance. We add one to the denominator to avoid zero-division.

S = 1 / (||X||1 + 1)

L2 Norm

aka euclidean norm
the distance between two points not travelled by a taxi in Manhattan, but by the crow, going directly from one point to another.
pythagorean theorem, a^2 + b^2 = c^2

Given X = [3,4] // x and y coordinate
||X||2 = sqrt(|3^2| + |4^2|) = 5

Cosine similarity

the angle between two points in a vector space

sim(i, j) = item1 * item2 / sqrt(square(item1) * square(item2))

Ratings

Implicit vs explicit ratings

implicit ratings: deduced from events triggered by the user. reliable if the context is known
explicit ratings: actual ratings a user insert. May not be reliable, as they can be biased due to social influences.

E.g. If a user said he loves the Movie A by rating it 5 stars (explicit), but he purchase the ticket for Movie B that is screening on the same day (implicit), we know he would probably like Movie B more.

Conversions

Conversion rate = number of goal achievements / number of visits

Conversion funnel = shows the path users take before they convert. It is called a funnel because it starts out wide, then narrow.

Conversion paths = the sequence of pages and actions a user takes from the time of arrival at the landing page until the conversion.

Events that can be scraped from a page

view content
view details
look inside
like content
share content
sign up for newsletter
search result click
campaign link
add to cart
add to favorite
rate
write review
buy

Recommendation engine

With surprise: using a library instead of implementing one yourself https://realpython.com/build-recommendation-engine-collaborative-filtering/

Learning to rank with Bayesian Personallzed Ranking from implicit feedback https://researchcode.com/code/1395261448/bpr-bayesian-personalized-ranking-from-implicit-feedback/ https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe https://towardsdatascience.com/recommender-system-using-bayesian-personalized-ranking-d30e98bba0b9

Recommendation with Singular Value Decomposition (SVD)

how to build the svd matrix?
how to perform dimensionality reduction?
how to add new users?
how to update the ranking when new items are added

http://rstudio-pubs-static.s3.amazonaws.com/335300_11d40bf12d8940f78d9661b3c63150dc.html http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html

Using machine learning to rank https://dec0de.me/2014/10/learning-to-rank-1 https://towardsdatascience.com/learning-to-rank-with-python-scikit-learn-327a5cfd81f https://thenewstack.io/letor-machine-learning-web-search-technique-thats-turned-key-information-retrieval-tool/ http://times.cs.uiuc.edu/course/598f14/l2r.pdf

References

Personalization based on user’s last activity

Say if we have a gamification leaderboard for top answers for a question answer platform, how do deal with if the last n + 1 position has the same score? e.g. we are suppose to take the top 10, but the 10th and 11th person score is the same?

Get the average timestamp which they answer, it should be lower for those that answered first. So first come first serve principle.

Improvement, just compare the last answer's timestamp.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
algorithm		algorithm
alt		alt
core		core
data		data
public		public
.gitignore		.gitignore
01_clustering_based_recommender.ipynb		01_clustering_based_recommender.ipynb
02_product_recommendation.ipynb		02_product_recommendation.ipynb
03_recommendation.ipynb		03_recommendation.ipynb
LEARN.md		LEARN.md
README.md		README.md
index.html		index.html
main.js		main.js
rating2.csv		rating2.csv
ratings.csv		ratings.csv

alextanhongpin/recommender-learn

Folders and files

Latest commit

History

Repository files navigation