Link prediction using proximity-based methods
This project was done in the subject, COMP90051 (Statistical Machine learning) taken in Semester2, 2020 in the University of Melbourne.
- Ranked 14th out of 132 teams. https://www.kaggle.com/c/comp90051-2020-sem2-proj1/leaderboard
Among numerous approaches we took, this is about our final approach. For features, we implemented methods for getting features below.
- jaccard distance
- cosine distance
- adamic-adar index
- preferential attachment
- Resource allocation
- Other features: followers/followees of source/sink each and their common followers/followees
We referred to some implemented codes in the github but mostly it was easy to implement according to the formula just by using python dictionary. Mainly two types of dictionary which: 1) stores nodes that are followed by a node 2) stores nodes that follows a node
- XG boost: Powerful for classification problems. directly output the probability of being a positive label using an objective set to ‘binary:logistic’.
- 50k pos/50k neg random sampling
To run quickly, change params of def get_trained() to smaller size:
ex) get_trainset(500, 500) // instead of (50000, 50000)
- Final (Private leaderboard) score: 0.89480