- Develop a Recommendation System that suggests businesses to users based on the location, stars/ratings, category, reviews and business hours.
- Build on Apache Spark and Recommendation System.
The json has been converted into csv files using the sample code provided by the Yelp at https://github.com/Yelp/dataset-examples. The converted CSV can't be loaded into Github due to large size.
-
Converting all files from json to csv for easier reading and utlizing the existing resources. [ref: https://github.com/Yelp/dataset-examples]
-
Loading business data. Dropping variables ['type','postal_code','latitude','longitude','review_count','attributes','address','is_open','hours','stars']
-
Filtered data to contain only state 'NC'
-
Loading review data. keeping only business id, user id, stars, funny, cool, and useful votes.
-
Merging both datasets by business id.
-
Updating value of the rating 'stars' by the formula (definiately it includes bias):
if stars>=3
stars=stars+(0.1*funny+.6*useful+0.3*cool)/(funny+useful+cool+1)
else
stars=stars-(0.1*funny+.6*useful+0.3*cool)/(funny+useful+cool+1)
- Use apache spark to store and transform data on parallel (
sc
variable) - Use Alternative Least Square as collaborative algorithm to recommend products
- Apply RMSE for evaluation of the model
|_ Work: where all code resides.
|_ Documents: All the documents provided by the professor.
|_ Sample Data: The filtered data files and sample initial files.
├── Work\
| ├── ReadData.py
| ├── business_recommender.ipynb
├── Documents\
| ├── Business Questions.txt
| ├── FinalProjectProblemsProposal.pdf
├── Sample_Data\
| ├── business.txt
| ├── checkin.txt
| ├── merged_BR3.csv
| ├── tip.txt
Works Folder Details:
- The ReadData_AG.py is used for reading csv files, merging data, reducing dimensionality of data and data modification like modifying the rating. After completing data pre-processing, the final dataframe is stored in file to avoid processing the code again and again. Take a look at the path folder of data file.
- Before running evaluate.pynb make sure Spark is working properly on your machine. One will require Jupyter Notebook to run it. From terminal invoke the command
IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark
. Using the run button, the code can be executed.
- Anshuman Goel (agoel5@ncsu.edu)
- Siddharth Heggadahalli (ssheggad@ncsu.edu)
- Huy Tu (hqtu@ncsu.edu)