Merge pull request #167 from openforcefield/doc-operations

Added compute doc, adding other operations docs (e.g. backups)
OpenFreeEnergy · Aug 24, 2023 · 233bc3a · 233bc3a
2 parents b831756 + b014715
commit 233bc3a
Show file tree

Hide file tree

Showing 4 changed files with 257 additions and 7 deletions.
diff --git a/docs/compute.rst b/docs/compute.rst
@@ -0,0 +1,202 @@
+.. _compute:
+
+#######
+Compute
+#######
+
+In order to actually execute ``Transformation``\s to obtain free energy estimates, you must deploy compute services to resources suitable for executing these types of calculations.
+This document details how to do this on several different types of compute resources.
+
+There currently exists a single implementation of an ``alchemiscale`` compute service: the :py:class:`~alchemiscale.compute.service.SynchronousComputeService`.
+Other variants will likely be created in the future, optimized for different use cases.
+This documentation will expand over time as these variants become available; for now, it assumes use of this variant.
+
+In all cases, you will need to define a configuration file for your compute services to consume on startup.
+A template for this file can be found here; replace ``$ALCHEMISCALE_VERSION`` with the version tag, e.g. ``v0.1.4``, you have deployed for your server::
+
+    https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/configs/synchronous-compute-settings.yaml
+
+
+***********
+Single-host
+***********
+
+To deploy a compute service (or multiple services) to a single host, we recommend one of two routes:
+
+* installing all dependencies in a ``conda``/``mamba`` environment
+* running the services as Docker containers, with all dependencies baked in
+
+
+.. _compute_conda:
+
+Deploying with conda/mamba
+==========================
+
+To deploy via ``conda``/``mamba``, first create an environment (we recommend ``mamba`` for its performance)::
+
+    mamba env create -n alchemiscale-compute-$ALCHEMISCALE_VERSION \
+                     -f https://raw.githubusercontent.com/openforcefield/alchemiscale/$ALCHEMISCALE_VERSION/devtools/conda-envs/alchemiscale-compute.yml
+
+Once created, activate the environment in your current shell::
+
+    conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION
+
+Then start a compute service, assuming your configuration file is in the current working directory, with::
+
+    alchemiscale compute synchronous -c synchronous-compute-settings.yaml
+
+
+.. _compute_docker:
+
+Deploying with Docker
+=====================
+
+Assuming your configuration file is in the current working directory, to deploy with Docker, you might use::
+
+    docker run --gpus all \
+               --rm \
+               -v $(pwd):/mnt ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION \
+               compute synchronous -c /mnt/synchronous-compute-settings.yaml
+
+
+See the `official Docker documentation on GPU use`_ for details on how to specify individual GPUs for each container you launch.
+It may also make sense to apply constraints to the number of CPUs available to each container to avoid oversubscription.
+
+
+.. _official Docker documentation on GPU use: https://docs.docker.com/config/containers/resource_constraints/#gpu
+
+***********
+HPC cluster
+***********
+
+To deploy compute services to an HPC cluster, we recommend submitting them as individual jobs to the HPC cluster's scheduler.
+Different clusters feature different schedulers (e.g. SLURM, LSF, TORQUE/PBS, etc.), and vary widely in their hardware and queue configurations.
+You will need to tailor your specific approach to the constraints of the cluster you are targeting.
+
+The following is an example of the *content* of a script submitted to an HPC cluster. 
+We have omitted queuing system-specific options and flags, and certain environment variables (e.g. ``JOBID``, ``JOBINDEX``) should be tailored to those presented by the queuing system.
+Note that for this case we've made use of a ``conda``/``mamba``-based deployment, detailed above in :ref:`compute_conda`::
+
+    # don't limit stack size
+    ulimit -s unlimited
+    
+    # make scratch space (path will be HPC system dependent)
+    ALCHEMISCALE_SCRATCH=/scratch/${USER}/${JOBID}-${JOBINDEX}
+    mkdir -p $ALCHEMISCALE_SCRATCH
+    
+    # activate environment
+    conda activate alchemiscale-compute-$ALCHEMISCALE_VERSION
+    
+    # create a YAML file with specific substitutions
+    # each service in this job can share the same config
+    envsubst < settings.yaml > configs/settings.${JOBID}-${JOBINDEX}.yaml
+    
+    # start up a single service
+    alchemiscale compute synchronous -c configs/settings.${JOBID}-${JOBINDEX}.yaml
+    
+    # remove scratch space
+    rm -r $ALCHEMISCALE_SCRATCH
+
+
+The ``envsubst`` line in particular will make a config specific to this job, with environment variable substitutions.
+A subset of options used in the config file are given below::
+
+    ---
+    # options for service initialization
+    init:
+    
+      # Filesystem path to use for `ProtocolDAG` `shared` space.
+      shared_basedir: "/scratch/${USER}/${JOBID}-${JOBINDEX}/shared"
+    
+      # Filesystem path to use for `ProtocolUnit` `scratch` space.
+      scratch_basedir: "/scratch/${USER}/${JOBID}-${JOBINDEX}/scratch"
+    
+      # Path to file for logging output; if not set, logging will only go to
+      # STDOUT.
+      logfile: /home/${USER}/logs/service.${JOBID}.log
+    
+    # options for service execution
+    start:
+    
+      # Max number of Tasks to execute before exiting. If `null`, the service will
+      # have no task limit.
+      max_tasks: 1
+    
+      # Max number of seconds to run before exiting. If `null`, the service will
+      # have no time limit.
+      max_time: 300
+
+
+For HPC job-based execution, we recommend limiting the number of ``Task``\s the compute service executes to a small number, preferrably 1, and setting a time limit beyond which the compute service will shut down.
+With this configuration, when a compute service comes up and claims a ``Task``, it will have nearly the full walltime of its job to execute it.
+Any compute service that fails to claim a ``Task`` will shut itself down, and the job will exit, avoiding waste and a scenario where a ``Task`` is claimed without enough walltime left on the job to complete it.
+
+
+******************
+Kubernetes cluster
+******************
+
+To deploy compute services to a Kubernetes ("k8s") cluster, we make use of a similar approach to deployment with Docker detailed above in :ref:`compute_docker`.
+We define a k8s `Deployment`_ featuring a single container spec as the file ``compute-services.yaml``::
+
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+      name: alchemiscale-synchronouscompute
+      labels:
+        app: alchemiscale-synchronouscompute
+    spec:
+      replicas: 1
+      selector:
+        matchLabels:
+          app: alchemiscale-synchronouscompute
+      template:
+        metadata:
+          labels:
+            app: alchemiscale-synchronouscompute
+        spec:
+          containers:
+          - name: alchemiscale-synchronous-container
+            image: ghcr.io/openforcefield/alchemiscale-compute:$ALCHEMISCALE_VERSION
+            args: ["compute", "synchronous", "-c", "/mnt/settings/synchronous-compute-settings.yaml"]
+            resources:
+              limits:
+                cpu: 2
+                memory: 12Gi
+                ephemeral-storage: 48Gi
+                nvidia.com/gpu: 1
+              requests:
+                cpu: 2
+                memory: 12Gi
+                ephemeral-storage: 48Gi
+            volumeMounts:
+              - name: alchemiscale-compute-settings-yaml
+                mountPath: "/mnt/settings"
+                readOnly: true
+            env:
+              - name: OPENMM_CPU_THREADS
+                value: "2"
+          volumes:
+            - name: alchemiscale-compute-settings-yaml
+              secret:
+                secretName: alchemiscale-compute-settings-yaml
+
+
+This assumes our configuration file has been defined as a *secret* in the cluster.
+Assuming the file is in the current working directory, we can add it as a secret with::
+
+    kubectl create secret generic alchemiscale-compute-settings-yaml \
+                                  --from-file=synchronous-compute-settings.yaml
+
+
+Then we can then deploy the compute services with::
+
+    kubectl apply -f compute-services.yaml
+
+To scale up the number of compute services on the cluster, increase ``replicas`` to the number desired, and re-run the ``kubectl apply`` command above.
+
+A more complete example of this type of deployment can be found in `alchemiscale-k8s`_.
+
+
+.. _Deployment: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
+.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s/tree/main/compute
diff --git a/docs/deployment.rst b/docs/deployment.rst
@@ -14,7 +14,7 @@ Only Linux is supported as a platform for deploying ``alchemiscale`` services; W
 .. _deploy-docker-compose:
 
 ******************************************
-Single-Host Deployment with docker-compose
+Single-host deployment with docker-compose
 ******************************************
 
 An alchemiscale "server" deployment consists of a ``neo4j`` database (the "state store"), a client API endpoint, a compute API endpoint, and a reverse proxy (``traefik``).
@@ -27,7 +27,7 @@ The "server" also requires an object store; see :ref:`deploy-object-store`.
 
 .. _deploy-docker-compose-instructions:
 
-Deployment Instructions
+Deployment instructions
 =======================
 
 Install `docker compose <https://docs.docker.com/compose/install/#scenario-two-install-the-compose-plugin>`_.
@@ -97,11 +97,22 @@ Once connected to the instance, run the following commands::
 Now the instance has all of the dependencies required for ``docker-compose``-based deployment (:ref:`deploy-docker-compose-instructions`)
 
 
+.. _deploy-kubernetes:
+
+*************************************************
+Kubernetes-based deployment with alchemiscale-k8s
+*************************************************
+
+To deploy ``alchemiscale`` to a Kubernetes cluster, review the resources defined and detailed in `alchemiscale-k8s`_.
+
+.. _alchemiscale-k8s: https://github.com/datryllic/alchemiscale-k8s
+
+
 .. _deploy-object-store:
 
-************
-Object Store
-************
+**************************
+Setting up an object store
+**************************
 
 An "object store" is also needed for a complete server deployment.
 Currently, the only supported object store is AWS S3.

diff --git a/docs/index.rst b/docs/index.rst
@@ -30,6 +30,7 @@ in particular the `OpenForceField`_ and `OpenFreeEnergy`_ ecosystems.
    ./overview
    ./user_guide
    ./deployment
+   ./compute
    ./operations
    ./API_docs
 

diff --git a/docs/operations.rst b/docs/operations.rst
@@ -3,13 +3,13 @@ Operations
 ##########
 
 *********
-Add Users
+Add users
 *********
 
 To add a new user identity, you will generally use the ``alchemiscale`` CLI::
 
 
-    $ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>7687
+    $ export NEO4J_URL=bolt://<NEO4J_HOSTNAME>:7687
     $ export NEO4J_USER=<NEO4J_USERNAME>
     $ export NEO4J_PASS=<NEO4J_PASSWORD>
     $
@@ -51,3 +51,39 @@ The important bits here are:
 Backups
 *******
 
+Performing regular backups of the state store is an important operational component for any production deployment of ``alchemiscale``.
+To do this, **first shut down the Neo4j service so that no database processes are currently running**.
+
+The instructions below assume a Docker-based deployment, perhaps via ``docker-compose`` as in :ref:`deploy-docker-compose`.
+The same general principles apply to any deployment type, however.
+
+Creating a database dump
+========================
+
+**With the Neo4j service shut down**, navigate to the directory containing your database data, set ``$BACKUPS_DIR`` to the absolute path of your choice and ``$NEO4J_VERSION`` to the version of Neo4j you are using, then run::
+
+    docker run --rm \
+               -v $(pwd):/var/lib/neo4j/data \
+               -v ${BACKUPS_DIR}:/tmp \
+               --entrypoint /bin/bash \
+               neo4j:${NEO4J_VERSION} \
+               neo4j-admin dump --to /tmp/neo4j-$(date -I).dump
+
+This will create a new database dump in the ``$BACKUPS_DIR`` directory.
+
+
+Restoring from a database dump
+==============================
+
+To later restore from a database dump, navigate to the directory containing your current database data, and clear or move the current data to another location (Neo4j will not restore to a database that is already initialized).
+
+**With the Neo4j service shut down**, choose ``$DUMP_DATE`` and set ``$NEO4J_VERSION`` to the version of Neo4j you are using, then run::
+
+    docker run --rm \
+               -v $(pwd):/var/lib/neo4j/data \
+               -v ${BACKUPS_DIR}:/tmp \
+               --entrypoint /bin/bash \
+               neo4j:${NEO4J_VERSION} \
+               neo4j-admin load --from /tmp/neo4j-${DUMP_DATE}.dump
+
+Automating the backup process to perform regular backups without human intervention for your deployment is ideal, but out of scope for this document.