the spark version of basemods pipeline in SMRT-Analysis (v2.3.0)
The OSs must be Linux.
-
For now, we are using SMRT-Analysis v2.3.0
2.1 Download smrtanalysis.tar.gz.
2.2 Copy smrtanalysis.tar.gz to each worker node of your Spark cluster. Then decompress it to a desired directory.
# suppose you want to decompress smrtanalysis.tar.gz to /home/hadoop, # using the following command: tar -xhzvf smrtanalysis.tar.gz -C /home/hadoop
Notes:
(1). The decompressed location of smrtanalysis.tar.gz must be the same on all worker nodes. Don't forget to set the variable SMRT_ANALYSIS_HOME in parameters.conf. (Suppose you have decompressed smrtanalysis.tar.gz to /home/hadoop on all worker nodes, then you have to set SMRT_ANALYSIS_HOME=/home/hadoop/smrtanalysis in parameters.conf)
(2). To preserve symbolic links in the tar.gz file, "-h" must be used when using tar command.
-
If the OSs of nodes (both master and workers) in your cluster don't have python 2.x installed, you should install it. Install numpy, h5py, paramiko, pbcore in your python environment. Install package py4j, pyspark in your python environment if you need to.
Notes:
(1). one better uses sudo when trying to use pip install to install python third-party packages.
(2). Python 2.7.13 (or higher) is strongly recommended (not necessary) because of the bug described in issue #5.
-
If the scripts in the code of basemods_spark you downloaded don't have execute permissions, you should make them executable.
chmod +x basemods_spark/scripts/exec_sawriter.sh chmod +x basemods_spark/scripts/baxh5_operations.sh chmod +x basemods_spark/scripts/cmph5_operations.sh chmod +x basemods_spark/scripts/mods_operations.sh
-
Copy your data to the master node of your Hadoop/Spark cluster.
-
Set the parameters in configure file 'parameters.conf'.
-
(1) start HDFS&YARN
$HADOOP_HOME/sbin/start-dfs.sh $HADOOP_HOME/sbin/start-yarn.sh
(2) start Spark
$SPARK_HOME/sbin/start-all.sh
(3) submit your job
Standalone mode (from master node):
$SPARK_HOME/bin/spark-submit basemods_spark_runner.py
Yarn client mode (from master node):
$SPARK_HOME/bin/spark-submit --master yarn \ --deploy-mode client \ --driver-memory 50g \ basemods_spark_runner_yarn_client.py
Yarn cluster mode:
$SPARK_HOME/bin/spark-submit --master yarn \ --deploy-mode cluster \ --driver-memory 50g \ basemods_spark_runner_yarn_cluster.py