GitHub

basemods on spark

Overview

the spark version of basemods pipeline in SMRT-Analysis (v2.3.0)

Set up the environment

The OSs must be Linux.

Hadoop/Spark

Setting up an Hadoop/Spark cluster.
SMRT-Analysis

For now, we are using SMRT-Analysis v2.3.0

2.1 Download smrtanalysis.tar.gz.

2.2 Copy smrtanalysis.tar.gz to each worker node of your Spark cluster. Then decompress it to a desired directory.
```
# suppose you want to decompress smrtanalysis.tar.gz to /home/hadoop, 
# using the following command: 
tar -xhzvf smrtanalysis.tar.gz -C /home/hadoop
```
Notes:

(1). The decompressed location of smrtanalysis.tar.gz must be the same on all worker nodes. Don't forget to set the variable SMRT_ANALYSIS_HOME in parameters.conf. (Suppose you have decompressed smrtanalysis.tar.gz to /home/hadoop on all worker nodes, then you have to set SMRT_ANALYSIS_HOME=/home/hadoop/smrtanalysis in parameters.conf)

(2). To preserve symbolic links in the tar.gz file, "-h" must be used when using tar command.
Python 2.x and required Python libraries

If the OSs of nodes (both master and workers) in your cluster don't have python 2.x installed, you should install it. Install numpy, h5py, paramiko, pbcore in your python environment. Install package py4j, pyspark in your python environment if you need to.

Notes:

(1). one better uses sudo when trying to use pip install to install python third-party packages.

(2). Python 2.7.13 (or higher) is strongly recommended (not necessary) because of the bug described in issue #5.

How to use basemods_spark

make the scripts executable

If the scripts in the code of basemods_spark you downloaded don't have execute permissions, you should make them executable.

chmod +x basemods_spark/scripts/exec_sawriter.sh

chmod +x basemods_spark/scripts/baxh5_operations.sh

chmod +x basemods_spark/scripts/cmph5_operations.sh

chmod +x basemods_spark/scripts/mods_operations.sh

copy your data

Copy your data to the master node of your Hadoop/Spark cluster.
parameters in configure file

Set the parameters in configure file 'parameters.conf'.

start Spark and use spark-submit to run the pipeline

(1) start HDFS&YARN

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

(2) start Spark

$SPARK_HOME/sbin/start-all.sh

(3) submit your job

Standalone mode (from master node):

$SPARK_HOME/bin/spark-submit basemods_spark_runner.py

Yarn client mode (from master node):

$SPARK_HOME/bin/spark-submit --master yarn \
--deploy-mode client \
--driver-memory 50g \
basemods_spark_runner_yarn_client.py

Yarn cluster mode:

$SPARK_HOME/bin/spark-submit --master yarn \
--deploy-mode cluster \
--driver-memory 50g \
basemods_spark_runner_yarn_cluster.py

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
prenpost_process		prenpost_process
scripts		scripts
H5inRDD.log.md		H5inRDD.log.md
README.md		README.md
basemods_spark_runner.py		basemods_spark_runner.py
basemods_spark_runner_yarn_client.py		basemods_spark_runner_yarn_client.py
basemods_spark_runner_yarn_cluster.py		basemods_spark_runner_yarn_cluster.py
parameters.conf		parameters.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

basemods on spark

Overview

Set up the environment

Hadoop/Spark

SMRT-Analysis

Python 2.x and required Python libraries

How to use basemods_spark

make the scripts executable

copy your data

parameters in configure file

start Spark and use spark-submit to run the pipeline

About

Releases

Packages

Contributors 3

Languages

PengNi/basemods_spark

Folders and files

Latest commit

History

Repository files navigation

basemods on spark

Overview

Set up the environment

Hadoop/Spark

SMRT-Analysis

Python 2.x and required Python libraries

How to use basemods_spark

make the scripts executable

copy your data

parameters in configure file

start Spark and use spark-submit to run the pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages