Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added compute doc, adding other operations docs (e.g. backups) #167

Merged
merged 12 commits into from
Aug 24, 2023
195 changes: 195 additions & 0 deletions docs/compute.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
.. _compute:

#######
Compute
#######

In order to actually execute ``Transformation``\s to obtain free energy estimates, you must deploy compute services to resources suitable for executing these types of calculations.
This document details how to do this on several different types of compute resources.

There currently exists a single implementation of an ``alchemiscale`` compute service: the py:class:`~alchemiscale.compute.service.SynchronousComputeService`.
Other variants will likely be created in the future, optimized for different use cases.
This documentation will expand over time as these variants become available; for now, it assumes use of this variant.

In all cases, you will need to define a configuration file for your compute services to consume on startup.
A template for this file can be found here; replace ``$ALCHEMISCALE_VERSION`` with the version tag, e.g. ``v0.1.4``, you have deployed for your server::

https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/configs/synchronous-compute-settings.yaml


***********
Single-host
***********

To deploy a compute service (or multiple services) to a single host, we recommend one of two routes.
* installing all dependencies in a ``conda``/``mamba`` environment
* running the services as Docker containers, with all dependencies baked in


.. _compute_conda:

Deploying with conda/mamba
==========================

To deploy via ``conda``/``mamba``, first create an environment (we recommend ``mamba`` for its performance)::

mamba env create -n alchemiscale-compute-$ALCHEMISCALE_VERSION -f https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/conda-envs/alchemiscale-compute.yml

Once created, activate the environment in your current shell::

conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION

Then start a compute service, assuming your configuration file is in the current working directory, with::

alchemiscale compute synchronous -c synchronous-compute-settings.yaml


.. _compute_docker:

Deploying with Docker
dotsdl marked this conversation as resolved.
Show resolved Hide resolved
=====================

Assuming your configuration file is in the current working directory, to deploy with Docker, you might use::

docker run --gpus all --rm -v $(pwd):/mnt ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION compute synchronous -c /mnt/synchronous-compute-settings.yaml


See the `official Docker documentation on GPU use`_ for details on how to specify individual GPUs for each container you launch.
It may also make sense to apply constraints to the number of CPUs available to each container to avoid oversubscription.


.. _official Docker documentation on GPU use: https://docs.docker.com/config/containers/resource_constraints/#gpu

***********
HPC cluster
***********

To deploy compute services to an HPC cluster, we recommend submitting them as individual jobs to the HPC cluster's scheduler.
Different clusters feature different schedulers (e.g. SLURM, LSF, TORQUE/PBS, etc.), and vary widely in their hardware and queue configurations.
You will need to tailor your specific approach to the constraints of the cluster you are targeting.

The following is an example of the *content* of a script submitted to an HPC cluster.
We have left off the top matter that is specific to the queueing system, and certain environment variables (e.g. ``JOBID``, ``JOBINDEX``) should be tailored to those presented by the queueing system.
dotsdl marked this conversation as resolved.
Show resolved Hide resolved
Note that for this case we've made use of a ``conda``/``mamba``-based deployment, detailed above in :ref:`compute_conda`::

# don't limit stack size
ulimit -s unlimited

# make scratch space
mkdir -p /scratch/${USER}/${JOBID}-${JOBINDEX}
dotsdl marked this conversation as resolved.
Show resolved Hide resolved

# activate environment
conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION

# create a YAML file with specific substitutions
# each service in this job can share the same config
envsubst < settings.yaml > configs/settings.${JOBID}-${JOBINDEX}.yaml

# start up a single service
alchemiscale compute synchronous -c configs/settings.${LSB_JOBID}-${LSB_JOBINDEX}.yaml

# remove scratch space
rm -r /scratch/${USER}/${JOBID}-${JOBINDEX}
dotsdl marked this conversation as resolved.
Show resolved Hide resolved


The ``envsubst`` line in particular will make a config specific to this job, with environment variable substitutions.
A subset of options used in the config file are given below::

---
# options for service initialization
init:

# Filesystem path to use for `ProtocolDAG` `shared` space.
shared_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/shared"

# Filesystem path to use for `ProtocolUnit` `scratch` space.
scratch_basedir: "/scratch/${USER}/${LSB_JOBID}-${LSB_JOBINDEX}/scratch"

# Path to file for logging output; if not set, logging will only go to
# STDOUT.
logfile: /home/${USER}/logs/service.${JOBID}.log

# options for service execution
start:

# Max number of Tasks to execute before exiting. If `null`, the service will
# have no task limit.
max_tasks: 1

# Max number of seconds to run before exiting. If `null`, the service will
dotsdl marked this conversation as resolved.
Show resolved Hide resolved
# have no time limit.
max_time: 300


For HPC job-based execution, we recommend limiting the number of ``Task``\s the compute service executes to a small number, preferrably 1, and setting a time limit beyond which the compute service will shut down.
With this configuration, when a compute service comes up and claims a ``Task``, it will have nearly the full walltime of its job to execute it.
Any compute service that fails to claim a ``Task`` will shut itself down, and the job will exit, avoiding waste and a scenario where a ``Task`` is claimed without enough walltime left on the job to complete it.
dotsdl marked this conversation as resolved.
Show resolved Hide resolved


******************
Kubernetes cluster
******************

To deploy compute services to a Kubernetes ("k8s") cluster, we make use of a similar approach to deployment with Docker detailed above in :ref:`compute_docker`.
We define a k8s `Deployment`_ featuring a single container spec as the file ``compute-services.yaml``::

apiVersion: apps/v1
kind: Deployment
metadata:
name: alchemiscale-synchronouscompute
labels:
app: alchemiscale-synchronouscompute
spec:
replicas: 1
selector:
matchLabels:
app: alchemiscale-synchronouscompute
template:
metadata:
labels:
app: alchemiscale-synchronouscompute
spec:
containers:
- name: alchemiscale-synchronous-container
image: ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION
args: ["compute", "synchronous", "-c", "/mnt/settings/synchronous-compute-settings.yaml"]
resources:
limits:
cpu: 2
memory: 12Gi
ephemeral-storage: 48Gi
nvidia.com/gpu: 1
requests:
cpu: 2
memory: 12Gi
ephemeral-storage: 48Gi
volumeMounts:
- name: alchemiscale-compute-settings-yaml
mountPath: "/mnt/settings"
readOnly: true
env:
- name: OPENMM_CPU_THREADS
value: "2"
volumes:
- name: alchemiscale-compute-settings-yaml
secret:
secretName: alchemiscale-compute-settings-yaml


This assumes our configuration file has been defined as a *secret* in the cluster.
Assuming the file is in the current working directory, we can add it as a secret with::

kubectl create secret generic alchemiscale-compute-settings-yaml --from-file=synchronous-compute-settings.yaml


The we can then deploy the compute services with::

kubectl apply -f compute-services.yaml

To scale up the number of compute services, increase the number of ``replicas`` to the number desired, and re-run the ``kubectl apply`` command above.

A more complete example of this type of deployment can be found in `alchemiscale-k8s`_.


.. _Deployment: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s/tree/main/compute
21 changes: 16 additions & 5 deletions docs/deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Only Linux is supported as a platform for deploying ``alchemiscale`` services; W
.. _deploy-docker-compose:

******************************************
Single-Host Deployment with docker-compose
Single-host deployment with docker-compose
******************************************

An alchemiscale "server" deployment consists of a ``neo4j`` database (the "state store"), a client API endpoint, a compute API endpoint, and a reverse proxy (``traefik``).
Expand All @@ -27,7 +27,7 @@ The "server" also requires an object store; see :ref:`deploy-object-store`.

.. _deploy-docker-compose-instructions:

Deployment Instructions
Deployment instructions
=======================

Install `docker compose <https://docs.docker.com/compose/install/#scenario-two-install-the-compose-plugin>`_.
Expand Down Expand Up @@ -97,11 +97,22 @@ Once connected to the instance, run the following commands::
Now the instance has all of the dependencies required for ``docker-compose``-based deployment (:ref:`deploy-docker-compose-instructions`)


.. _deploy-kubernetes:

*************************************************
Kubernetes-based deployment with alchemiscale-k8s
*************************************************

To deploy ``alchemiscale`` to a Kubernetes cluster, review the resources defined and detailed in `alchemiscale-k8s`_.

.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s


.. _deploy-object-store:

************
Object Store
************
**************************
Setting up an object store
**************************

An "object store" is also needed for a complete server deployment.
Currently, the only supported object store is AWS S3.
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ in particular the `OpenForceField`_ and `OpenFreeEnergy`_ ecosystems.
./overview
./user_guide
./deployment
./compute
./operations
./API_docs

Expand Down
40 changes: 38 additions & 2 deletions docs/operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ Operations
##########

*********
Add Users
Add users
*********

To add a new user identity, you will generally use the ``alchemiscale`` CLI::


$ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>7687
$ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>:7687
$ export NEO4J_USER=<NEO4J_USERNAME>
$ export NEO4J_PASS=<NEO4J_PASSWORD>
$
Expand Down Expand Up @@ -51,3 +51,39 @@ The important bits here are:
Backups
*******

Performing regular backups of the state store is an important operational component for any production deployment of ``alchemiscale``.
To do this, **first shut down the ``neo4j`` service so that no database processes are currently running**.
dotsdl marked this conversation as resolved.
Show resolved Hide resolved

The instructions below assume a Docker-based deployment, perhaps via ``docker-compose`` as in :ref:`deploy-docker-compose`.
The same general principles apply to any deployment type, however.

Creating a database dump
========================

**With the neo4j service shut down**, navigate to the directory containing your database data, set ``$BACKUPS_DIR`` to the absolute path of your choice, then run::

docker run --rm \
-v $(pwd):/var/lib/neo4j/data \
-v ${BACKUPS_DIR}:/tmp \
--entrypoint /bin/bash \
neo4j:4.4 \
dotsdl marked this conversation as resolved.
Show resolved Hide resolved
neo4j-admin dump --to /tmp/neo4j-$(date -I).dump

This will create a new database dump in the ``$BACKUPS_DIR`` directory.


Restoring from a database dump
==============================

To later restore from a database dump, navigate to the directory containing your current database data, and clear or move the current data to another location (Neo4J will not restore to a database that is already initialized).

**With the neo4j service shut down**, choose ``$DUMP_DATE``, then run::

docker run --rm \
-v $(pwd):/var/lib/neo4j/data \
-v ${BACKUPS_DIR}:/tmp \
--entrypoint /bin/bash \
neo4j:4.4 \
neo4j-admin load --from /tmp/neo4j-${DUMP_DATE}.dump

Automating the backup process to perform regular backups without human intervention for your deployment is ideal, but out of scope for this document.