osdataproc is a command-line tool for creating an OpenStack cluster with Apache Spark and Apache Hadoop configured. It comes with JupyterLab and Hail, a genomic data analysis library built on Spark installed, as well as Netdata for monitoring.
-
Create a Python virtual environment. For example:
python3 -m venv env
-
Download Terraform (0.13, or higher) and unzip it into a location on your path, e.g. into your venv. Make sure to download the appropriate version for your operating system and architecture.
wget https://releases.hashicorp.com/terraform/0.13.5/terraform_0.13.5_linux_amd64.zip unzip terraform_0.13.5_linux_amd64.zip -d env/bin/
-
Source the environment, clone this repository and install the requirements into the virtual environment:
source env/bin/activate git clone https://github.com/wtsi-hgi/osdataproc.git cd osdataproc pip install -e .
-
Make sure you have created an SSH keypair with
ssh-keygen
if you have not done so before. The default options are OK. Read the notes below if your private key has a passphrase. -
Download your OpenStack project's
openrc.sh
file. You can find the specific file for your project at Project > API Access, and then Download OpenStack RC File > OpenStack RC File on the right. -
Source your
openrc.sh
file:source <project-name>-openrc.sh
You can then run the osdataproc
command as shown in the examples,
below. osdataproc --help
, or osdataproc create --help
, etc. will
show all possible arguments.
Once run, it will ask you for a password. This is for access to the web interfaces, including Jupyter Lab. It is also the password for an encrypted NFS volume (see the NFS documentation). When you first access your cluster via a browser you will be asked for said password.
osdataproc create [--num-workers] <Number of desired worker nodes>
[--public-key] <Path to public key file>
[--flavour] <OpenStack flavour to use>
[--network-name] <OpenStack network to use>
[--lustre-network] <OpenStack Lustre provider network to use>
[--image-name] <OpenStack image to use - Ubuntu images only>
[--nfs-volume] <Name/ID of volume to attach or create as NFS shared volume>
[--volume-size] <Size of OpenStack volume to create>
[--device-name] <Device mountpoint name of volume>
[--floating-ip] <OpenStack floating IP to associate to master node - will automatically create one if not specified>
<cluster_name>
NOTE: Ensure that the image used has python3.8 as the default version of python. The focal images should work.
osdataproc create
will output the public IP of your master node when
the node has been created. You can SSH into this using the public key
provided:
ssh ubuntu@<public_ip>
Note that it will take a few minutes for the configuration to complete. The following services can then be accessed from your browser:
Service | URL |
---|---|
Jupyter Lab | https://<public_ip>/jupyter |
Spark | https://<public_ip>/spark |
Spark History | https://<public_ip>/sparkhist |
HDFS | https://<public_ip>/hdfs |
YARN | https://<public_ip>/yarn |
MapReduce History | https://<public_ip>/mapreduce |
Netdata Metrics | https://<public_ip>/netdata |
You can attach a volume as an NFS share to your cluster creating either
a new volume, or attaching an existing volume. This will mount the
volume on the data
directory and mount the data
directory of your
master node to all of the worker nodes as a shared volume over NFS.
See the NFS documentation for details and creation options.
For Lustre support, you must provide the name of the Lustre provider network that exists in your tenant and an image configured to mount Lustre from this network. For Sanger users, please check with ISG for the details.
osdataproc destroy <cluster_name>
There is a vars.yml file where default options for creating the cluster can be saved, as well as Spark and Hadoop configuration items tuned. Additional packages and Python modules to install on the cluster can also be specified here.
-
Your cluster name should never contain underscore characters; valid choices are alphanumeric characters and dashes.
-
If your private key has a passphrase, Ansible will not be able to connect to the created instances unless you add your key to
ssh-agent
first:eval $(ssh-agent) ssh-add
-
You can check the provisioning status of the worker nodes via the master node and checking
/var/log/user_data.log
. For example:ssh -J ubuntu@<public-ip> \ ubuntu@<user>-<cluster_name>-worker-<index> \ tail -f /var/log/user_data.log
-
osdataproc
is configured to use Kryo serialization for use with Hail for up to 10x faster data serialization. However, not allSerializable
types are supported and so it may be necessary to change$SPARK_HOME/conf/spark-defaults.conf
by commenting out or removing thespark.serializer
configuration option. This can also be removed by default in vars.yml when creating a cluster. -
For Sanger users, check the appropriate FCE capacity dashboard under the Sanger metrics.
You can contribute by submitting pull requests to this repository. If
you create a fork you will need to update the REPO
and BRANCH
variables in terraform/user-data.sh.tpl
to the new repository location
for the changes you make to be reflected in the created cluster.
- Refactor/enhance Python CLI
- Move from
openrc.sh
toclouds.yaml
- Incorporate
run
script intoosdataproc
Python CLI - Allow setting of password by environment variable
- Allow multiple public SSH keys
- Allow setting of DNS name servers
- Use exit codes productively if Terraform/Ansible fail
- Machine-readable output to communicate status
- Allow non-interactive destruction
- Correct distribution to recognise multiple Python modules
- Move from
- Refactor Ansible playbooks
- Update to JupyterLab 3
- Resize functionality (larger flavour/more nodes)
- Support for other Linux distributions