Skip to content

Commit

Permalink
Remove duplicate content.
Browse files Browse the repository at this point in the history
  • Loading branch information
gweatherby committed Sep 19, 2022
1 parent d12eaf2 commit 73f2166
Showing 1 changed file with 0 additions and 341 deletions.
341 changes: 0 additions & 341 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,344 +358,3 @@ Presented by:
Jon Wedell
Lead Software Developer, BioMagResBank
wedell@uchc.edu

## Background

During this tutorial we will walk through submitting jobs to the HTCondor
workload management system. We will then investigate how the jobs run in the HTCondor
environment and explore how to manage submitted jobs.

### The submit file

The submit file is the least complex way to submit one or more jobs to HTCondor. It consists of
the necessary information to specify what work to perform (what executable to call, which arguments
to provide, etc.) and any requirements for your job to complete successfully (a GPU, a certain release of
NMRbox, etc.)

#### The basics

Let's look at a simple submit file in detail to see what arguments are necessary for a submission.


```
universe = vanilla
executable = /bin/ls
arguments = /
log = logs/basic.log
output = logs/basic.out
error = logs/basic.err
queue
```

Going through the lines in order:

* The universe specified which HTCondor universe to run in. These are discussed
in the slides. The most straightforward is the `vanilla` universe. As a beginner this
universe should be suitable for all your jobs.
* The executable specifies which command should actually be ran when the job runs. You
must use an absolute path.
* The arguments line, while technically optional, will almost always be used. It allows you
to specify which command line arguments to provide to the executable.
* The log, output, and error commands specify where HTCondor will log. The output and
error will be populated with whatever the job writes to STDERR and STDOUT while the
log will contain the HTCondor logs related to the scheduling and running of the job.
* The queue parameter is special. Each time this argument is encountered in the file
HTCondor will submit a job to the queue with whatever arguments were specified prior
in the file. You can specify queue multiple times on different lines, and optionally
change any of the parameters between the queue lines to customize the various runs
of the job. You can also provide a number after `queue` to submit multiple jobs
with the same parameters. (e.g. if doing a Monte Carlo simulation). You can see an example
of this behavior used in practice in the "cluster.sub" file.

#### Additional submit file options

Now let's look at additional parameters which can be specified in the submit file.

```
request_cpus = 1
request_disk = 1MB
request_memory = 4GB
```

These arguments speak for themselves. They are telling the HTCondor matchmaker
that in order to run your job the machine must be able to provide the specified
resources.

```
requirements = ((Target.Release == "2022.22") || (Target.Release == "2022.21"))
+Production = True
```

These arguments allow us to further refine which machines our jobs will execute on.
* The `requirements` line is specifying that we only want to run our job on two specific
releases of NMRbox. This is useful if you require a specific version of an installed software
package which is not available on all releases of NMRbox. Remember - you can check which releases
of NMRbox have a given version of a software package in them using the NMRbox software registry on
the NMRbox web site. Here is an example [for NMRpipe](https://nmrbox.org/software/nmrpipe).
* The `+Production=True` job ensures that our job only runs on [production NMRbox machines](https://nmrbox.org/hardware). (Normal
NMRbox computational machines). This line is automatically added if you submit your job from a production
NMRbox machine, but it's good to have it explicitly present. We have additional compute nodes in our pool
which you can access, but they don't have the full set of NMR software packages installed, so using them
requires extra work when setting up your submission. If you don't require any specific software to be installed
on the worker node (for example, if you are running a staticly compiled binary executable or a pure-python
program) you can specify `+Production=False` to get access to these additional computational nodes.

```
transfer_executable = FALSE
should_transfer_files = NO
```

In most circumstances, HTCondor expects you to specify exactly which input files your job needs and then
it will manage transferring those files to the worker node for you, and transferring any created output files back.
While this is maximally portable, it has increased overhead and requires more work by the job submitter. On the
NMRbox computational pool, all the nodes share a filesystem, so it's possible to bypass this transfer mechanism
and rely on the shared file system, which avoids you needing to specify which input files the job needs.

The only thing to be careful of is that you have your job located in your home directory (or a subdirectory)
and not in your scratch directory or other machine-specific temporary directory.

```
getenv=True
```

This line ensures that the HTCondor job runs with the same shell environment as existed in the shell
where you submitted the job. Without specifying this, your job may fail due to not being able to locate
a called executable in the path, or due to other issues. This is not a very portable option when submitting
to a heterogeneous computing environment, but it works well on production NMRbox machines, and simplifies
your life when developing your first submit files.

#### Variables!

It is possible to specify custom variables in your submit file, and have them interpreted in the appropriate
places. Here is an example of the concept:

```
arguments = example $(JobName).pdb
output = logs/$(JobName).stdout
error = logs/$(JobName).stderr
JobName = 2dog
queue
JobName = 2cow
queue
JobName = 1rcf
queue
```

You should be familiar with the `arguments`, `output`, and `error` line from earlier in the tutorial, but now
you can see that they are using the value of a variable called `JobName` to determine their value. Where is
this variable coming from? From us, later in the submit file! You can define any variables you want, other
than ones that would conflict with built-in HTCondor parameter names. In this example, we specify three
different JobNames, and queue up three different jobs. Each of those three jobs will write to a different
output and error file, and will use a different file for their input.

#### Variables, continued

In addition to defining your own variables, some are automatically available. In the
code above, the example shows running a given executable against three different input files,
and performing some action. But what if you have only a single input file, and you want to run a given
computation 1000 times? For example, imagine you are using Monte Carlo methods.

Fortunately it is easy to achieve this as well. Look at this example `.sub` file stub:

```
executable = monte_carlo_method.py
arguments = -seed $(PROCID)
log = logs/mc.log
output = logs/mc_$(PROCID).out
error = logs/mc_$(PROCID).err
queue 1000
```

In this case we've submitted 1000 processes with this single submit file, all of which belong to a single
cluster. `$(PROCID)` is automatically interpreted for each individual process, triggering all 1000
processes to run with a different seed.

When you submit such a job, Condor will print out the job ID. Furthermore, while
each time queue is specified in the file creates an additional process, they share they same
job ID, it is the process ID that is incremented. When looking at the queue, you
can see these numbers in the form "jobID.procID". Running `condor_rm` with a job ID
will remove all of the jobs with that ID, while running `condor_rm` with a jobID.procID
will only remove the specific instance of the job with the given procID.

### The queue

Condor manages a queue of submitted jobs which are waiting to run. By default, the jobs run in
roughly the order they were submitted. To view your job queue, you can run the `condor_q` command.

You will get a result similar to this one:

```
-- Schedd: wiscdaily.nmrbox.org : <127.0.0.1:51028?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 jwedell 7/10 14:59 0+00:00:00 I 0 0.1 zenity --info --te
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
```

This example shows a single job (with ID 1) is idle. If you have a submit file with multiple "queue"
statements but only see one job, you can use `condor_q -nobatch` to see each individual job separately.

Here are some common states of jobs that you should be familiar
with:

* `I/Idle` - This means the job is idle. All jobs will be in this state for at least a brief period before
they begin running. If they remain in this state for an extended period of time, it may be the case
that no machines match your job requirements. To check why a job hasn't yet run, you can use the
`condor_q` command to investigate the status of the job. Adding `-better` will perform a "better analysis"
of the job. So for the example queue above, `condor_q -better 1` would explain why the job hasn't yet
started running.
* `R/Running` - This means the job is currently executing on a machine in the pool. Great!
* `H/Held` - This means that something went wrong when attempting to run the job. There are many reasons
this could happen - to check the error message, you can again use `condor_q -better` with the job ID, and
you can also look in the file you defined as the "log" for the job.

If you realize there is a problem with your requirements or submit file, you can remove the job with the
`condor_rm` command followed by the job ID. If you have a job with the queue statement used more than once,
you will remove all of them if you use the integer ID for the submission. You can then make any necessary
changes and resubmit the job.

### The pool

You can also investigate the Condor pool to see what resources are available. To see the pool,
use the `condor_status` command. Note that our pool uses something called "dynamic partitioning". What
this means is that a machine with 128 cores and 500GB of RAM will only appear as one resource when you
run this command. But that doesn't mean it can only run one job. Instead, what happens is that HTCondor automatically
divides up the resources on this machine according to what was requested in the submit file.

Therefore, such a machine may wind up running 100 jobs which only require 1 GB of RAM and a single core,
1 job which requires 20 cores and 10 GB of RAM, and 1 job which requires 1 CPU and 1 GPU. This ensures
that our resources can be used most effectively, and it's why it is important that you enter realistic numbers
for `request_memory` so that you don't ask for more memory than you'll need.

Here is an example of what you may see when you run the `condor_status` command:

```
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@babbage.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 2063937 5+13:10:01
slot1@barium.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 193335 11+22:11:08
slot1@bismuth.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 364790 0+16:49:41
slot1@bromine.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 385574 11+22:24:05
slot1@calcium.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 191880 23+21:50:56
slot1@chlorine.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 191880 6+20:35:04
slot1@chromium.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 191880 45+12:21:35
slot1@cobalt.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 176292 24+17:32:17
slot1@compute-1.bmrb.io LINUX X86_64 Unclaimed Idle 0.000 515522 4+05:21:42
...
slot1@zinc.nmrbox.org LINUX X86_64 Unclaimed Idle 0.000 385576 1+00:39:18
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 50 0 9 41 0 0 0 0
Total 50 0 9 41 0 0 0 0
```


This shows the current slots in the pool and their status. Remember that a machine can spawn multiple
slots depending on what resources it has available and are requested.

If you want to check a given requirement against the pool to see which machines would be eligible
to run your job, you can do that using the `-const` argument to `condor_status`. Here is an example
to check which machines are on release 2022.22:

`condor_status -const '(Release == "2022.22")'`


### Submitting a job

With an understanding of the essential parts of a Condor submission file, let's go ahead and
actually submit a job to Condor.

In the HTCondor folder in your home directory (`~/EVENTS/2022-nmrbox-summer/HTCondor`) you should
have a file called `pdb.sub` which uses many of the options described above. The executable simply
echoes some text to STDOUT, but it illustrates how you could run a given computation
against a set of PDB structures.

To submit a `.sub` file to Condor, use the `condor_submit` command as such:

```bash
condor_submit pdb.sub
```

The moment you enter this command your job will be in the condor job queue. (If you have
multiple "queue" statements in your job, those are still considered one clustered job with
multiple processes). It will print off the job ID:

```
Submitting job(s)...
3 job(s) submitted to cluster 1.
```

### Check the results

If you're able to run the `condor_q` command quickly enough before
your job completes running, you will see the job you just submitted in the queue.

```
-- Schedd: strontium.nmrbox.org : <155.37.253.57:9618?... @ 06/07/22 06:47:35
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
jwedell ID: 1 6/7 06:47 _ _ 3 3 1.0-2
Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for jwedell: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for all users: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
```

With `-nobatch`:

```
-- Schedd: strontium.nmrbox.org : <155.37.253.57:9618?... @ 06/07/22 06:48:54
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 jwedell 6/7 06:47 0+00:00:00 I 0 0.0 echo example 2dog.pdb 1.1 jwedell 6/7 06:47 0+00:00:00 I 0 0.0 echo example 2cow.pdb 1.2 jwedell 6/7 06:47 0+00:00:00 I 0 0.0 echo example 1rcf.pdb
Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for jwedell: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for all users: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
```

Continue to use `condor_q` to watch and see when your job has completed. When it completes,
take a look at the output files in the `output` folder and the `log` folder.

### Interactive debugging

Let's explore a very useful technique for examining why a job might not be running the way
that you expect. Update the submit file to replace the executable with `/bin/sleep` and update
the `arguments` to be `1000`. This will just run the linux sleep command for 1000 seconds, giving you
time to see your job running in the queue. Submit it again using `condor_submit` as before.

After a few seconds, you should see that one or more of the processes in your submission have entered
the running job state. (Remember, you can use `condor_q` to check.) Determine the job and process ID of one
you would like to explore futher, and run

`condor_ssh_to_job clusterID.procID` where clusterID and procID are replaced with the value for your job,
which you can get from `condor_q` or `condor_q -nobatch`.

This will open up an interactive SSH session to the exact machine and location your job is running. You
can use this to manually step through the actions your job would take and explore and unexpected behavior.

#### File transfer

In the vanilla universe, HTCondor will automatically transfer back any files created
during the execution of a job to your local machine. (Though it will not transfer
back folders created or files within them automatically - you must manually specify
those using the `transfer_output_files` argument if you want them to be preserved.)

As mentioned before, you can avoid this entirely using the `should_transfer_files` and `transfer_executables`
options, and relying on the shared filesystem.

### Helpful hints

Here are some things to keep in mind while working to use Condor to take advantage of distributed computing:

1) Try to keep the runtime of an individual job process to less than 8 hours.
2) You'll generally make better use of the resources if you can break your work up into compute units
of a single core. This is because it is much easier to find a single available core in the pool to start
running your job than to find, for example, a machine with 48 simultaneously free cores. So rather than
run a multiprocess workflow with `request_cpus=48` see if you can instead run 48 condor jobs/processes
which each require just one CPU.
3) To take advantage of our additional "compute-only" nodes, you can't count on NMR software being installed
and will need to ensure your executable can stand alone. A python virtual environment would allow this, as would
a staticly compiled binary.
4) Don't be afraid to [reach out](mailto:support@nmrbox.org)! We're happy to help you take maximum utility from our compute resources to
advance your research.


## Summary

This has been a very brief demonstration of the basic features of HTCondor job
submission and management. For more details on the submit file format, please
see the [HTCondor documentation](https://htcondor.readthedocs.io/en/latest/users-manual/quick-start-guide.html).

0 comments on commit 73f2166

Please sign in to comment.