INF-553 2019 fall, by Prof. Anna Farzindar
The code for homework is not allowed to published, so only tips for homework were published. Hope these will help you.
[TOC]
- Lec1: Introduction & Large-Scale File System & MapReduce1
- Lec2: MapReduce2 & 3
- Lec3: Find Frequent Itemsets 1
- Lec4: Frequent Items 2 & 3
- Lec5: Find Similar Sets 1 & 2
- Lec6: Find Similar Sets 3
- Lec7: Recommender System 1 & 2
- Lec8: Recommender System 3 & 4
- Lec9: Social Networks 1
- Lec10: Social Networks 2 & Clustering
- Lec11: Link Analysis
- Lec12: Mining Data Streams
Environment: macOS Mojave 10.14.5
Requirements: Python 3.6, Scala 2.11 and Spark 2.3.3
Except the requirements above, you can only use standard python libraries
Before installing, make sure you have Anaconda as well as Homebrew.
Download Java JDK from the link, and open the dmg
file to install it.
Then add the JAVA_HOME
in ~/.zshrc
file (if using bash, just add this to ~/.bash_profile
file):
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home"
save the changes and activate the change in the terminal:
source ~/.zshrc
or for bash:
source ~/.bash_profile
Some other installment tutorial suggests to set JAVA_HOME
as usr/lib/jvm/xxx.jdk
, but this didn't work in my system.
Now check whether you have installed Java JDK successfully in terminl:
java -version
I used to try to install Java via
brew
:brew cask install javawhich install Java 12. However, it seemed there's some problem with Java 12 and Spark 2.3.3, so I finally used Java 8 JDK (this problem was mentioned here).
Open the download link and choose the version you want. Here I chose spark-2.3.3-bin-hadoop2.7.tgz
.
When download finished, unzip it and move it to your /opt
folder:
tar -xzf spark-2.3.3-bin-hadoop2.7.tgz
mv spark-2.3.3-bin-hadoop2.7 /opt/spark-2.3.3
Create a symbolic link (assuming you have multiple spark versions):
sudo ln -s /opt/spark-2.3.3 /opt/spark
Then add Spark path in the ~/.zshrc
or ~/.bash_profile
file and activate it:
export SPARK_HOME="/opt/spark"
export PATH="$SPARK_HOME/bin:$PATH"
Now, run the example to see whether install successfully:
cd /opt/spark/bin
# use grep to get clean output result
run-example SparkPi 2>&1 | grep "Pi is"
Create new conda environment:
conda create -n inf553 python=3.6
and activate the environment when finished
conda activate inf553
Now install Pyspark
using pip
:
pip install pyspark==2.3.3
I tried to install
pyspark==2.4.4
, but this could cause incompatibility with Spark JVM libraries since Spark 2.3.3 is used!!! If usepyspark==2.4.4
, running test file showed here will end with error:Traceback (most recent call last): File "test.py", line 5, in <module> numAs = logData.filter(lambda line: 'a' in line).count() File "//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py", line 403, in filter return self.mapPartitions(func, True) File "//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py", line 353, in mapPartitions return self.mapPartitionsWithIndex(func, preservesPartitioning) File "//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py", line 365, in mapPartitionsWithIndex return PipelinedRDD(self, f, preservesPartitioning) File "//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py", line 2514, in __init__ self.is_barrier = prev._is_barrier() or isFromBarrier File "//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py", line 2414, in _is_barrier return self._jrdd.rdd().isBarrier() File "//anaconda3/envs/inf553/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "//anaconda3/envs/inf553/lib/python3.6/site-packages/py4j/protocol.py", line 332, in get_return_value format(target_id, ".", name, value)) py4j.protocol.Py4JError: An error occurred while calling o23.isBarrier. Trace: py4j.Py4JException: Method isBarrier([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
Add these codes to ~/.zshrc
or ~/.bash_profile
file to enable python/ipython from conda environment to use PySpark:
export PYSPARK_PYTHON="/anaconda3/envs/inf553/bin/python"
export PYSPARK_DRIVER_PYTHON="/anaconda3/envs/inf553/bin/ipython"
Then activate the changes:
source ~/.zshrc
or
source ~/.bash_profile
Now run the test.py
using spark-submit test.py
to see the result. the test.py
is showed below:
# test.py
from pyspark import SparkContext
sc = SparkContext( 'local', 'test')
logFile = "file:///opt/spark/README.md"
logData = sc.textFile(logFile, 2).cache()
numAs = logData.filter(lambda line: 'a' in line).count()
numBs = logData.filter(lambda line: 'b' in line).count()
print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))
The running result is
Lines with a: 61, Lines with b: 30
There might be a lot of LOG informations in the print out result when running the
spark-submit
as showed:We can manage to hide them.
Go to the spark configure folder:
cd /opt/spark/conf
Copy the
log4j.properties.template
file and edit the copy:cp log4j.properties.template log4j.properties vim log4j.propertiesThen you can see in the Lin 19, there is
log4j.rootCategory=INFO, console
:Change
INFO
toWARN
and save the file. Then it is done!
Download scala 2.11.8 from the link, then run:
sudo tar -zxf scala-2.11.8.tgz -C /usr/local
cd /usr/local/
sudo mv ./scala-2.11.8/ ./scala
Add code to ~/.zshrc
or ~/.bash_profile
:
export PATH="/usr/local/scala/bin:$PATH"
Now run Scala in the terminal:
It seems there will be problem if install Scala 2.11.12.
Setting | Duration Benchmark (sec) | Local Duration (sec) | Result Benchmark | Local Result | |
---|---|---|---|---|---|
HW2 Task 1 | Case1: Support=4 Case2: Support=9 |
Case1: <=200 Case2: <=100 |
Case1: 7 Case2: 8 |
||
HW2 Task 2 | Filter Threshold=20 Support=50 |
<=500 | 14 | ||
HW3 Task1 | Jaccard similarity | <=120 | 12 | Recall>=0.95 Precision=1.0 |
Recall=0.99 Precision=1.0 |
HW3 Task2 | Model-Based: rank=3, lambda=0.2, iterations=15 | Model-Based: <=50 User-Based: <=180 |
Model-Based: 17 User-Based: 13 |
Model-Based RMSE: 1.30 User-Based RMSE: 1.18 |
Model-Based RMSE: 1.066 User-Based RMSE: 1.09 |
HW4 Task1 | <=500 | 210 | |||
HW4 Task2 | <=500 | (Unrecorded) |
-
Task 1: use
user.json
-
Task 2: use
user.json
-
Task 3: use
review.json
andbusiness.json
-
Task 1: use A-Priori & SON algorithm to find all possible frequent itemsets
- for
small2.csv
case 1 withsupport=4
, local test showsminPartition=3
,A_priori_short_basket()
works better. Local test takes 7 seconds. - for
small1.csv
case 2 withsupport=9
, local test showsminPartition=2
,A_priori_long_basket()
works better. Local test takes 8 seconds. -
A_priori_long_basket()
optimizes the process of generating itemset size$k+1$ from itemset size$k$
- for
-
Task 2:
-
collect frequent singleton as well as frequent pairs using brute-force (emit all possible singletons/pairs in each basket, then filter using
support
) -
Then delete all baskets with
size=1
orsize=2
(delete around 16000 such baskets), which helps to speed up for later steps -
using A-priori only to find candidate itemset with
size>=3
-
use
A_priori_long_basket()
, local test showsminPartition=3
works better. Local test takes around 17 seconds.
-
-
Task1: min-hash & LSH to find similar business_id pars
-
Jaccard similarity:
-
Use
mapPartitions()
instead of.map()
for mostRDD
operations to speed up -
local test shows optimal
numPartitions=5
usingsc.parallelize()
to load input file (sometimesparallelize()
is not large enough to load input file), with-
pure computation time 8 seconds for the whole process (
load data
$\to$min-hash$ \to$LSH$ \to$compute similarty$ \to$write result
), -
script running time 12.4 seconds (use
time spark-submit script.py
, cpu time) -
precision=1.0,
-
recall=0.99.
-
-
local
numPartitoins
-data load method
experiment results in task1 experiment log file (experiment script) -
Question:
sc.textFile()
with customizedminPartitions
works similar tosc.parallelize()
with customizednumPartitions
, so what's the difference? (not clear after searching on Google)
-
-
Cosine similarity: optional, not implemented
-
-
Task2: Collaborative filtering
Detail and tips see implementation description file
-
Model-based
-
User-based
- Use global average when calculating similarity rather than co-rated item average!!!!! (Lower RMSE in this case)
-
statistics.mean(list)
is slower thansum(list)/len(list)
!!!!!! After replacingstatistics.mean()
withsum(list)/len(list)
, local user-based test time is around 70s (150s before replacement) - It seems if we use user_avg as the prediction for all pairs, RMSE<1.07 on
yelp_test.csv
?!?!?!?!?!?!?!?!!?
Submission 1 | Submission 2 | Submission 3 | |
---|---|---|---|
Val RMSE | 1.019139997 | 1.002121513 | 0.9807033701 |
Test RMSE | 1.015982788 | 1.000238924 | 0.9793494612 |
Val Duration | 16s | 42s | 241s |
Method | Use weighted average rating on users as well as business. Then combine them together using 1:1 weights again. | Use global average rating on users as well as business. Then combine them together using 1:1 weights again. | Both user.json and business.json are used to generate user_features.csv and business_features.csv for later model. Then use business features 'business_star', 'latitude', 'longitude', 'business_review_cnt', and user features; 'user_review_cnt', 'useful', 'cool', 'funny', 'fans', 'user_avg_star' to train the Gradient Boosting model. |
Error (>=0 and <1) | 96910 | 97892 | 102013 |
Error (>=1 and <2) | 37451 | 36978 | 32998 |
Error (>=2 and <3) | 7051 | 6682 | 6229 |
Error (>=3 and <4) | 632 | 492 | 804 |
Error (>=4) | 0 | 0 | 0 |
- Top 3 test RMSE in the class:
- 0.9750778569
- 0.9773295973
- 0.9784191539