Remove duplicate content.

NMRbox · Sep 19, 2022 · 73f2166 · 73f2166
1 parent d12eaf2
commit 73f2166
Showing 1 changed file with 0 additions and 341 deletions.
diff --git a/README.md b/README.md
@@ -358,344 +358,3 @@ Presented by:
 Jon Wedell  
 Lead Software Developer, BioMagResBank  
 wedell@uchc.edu  
-
-## Background  
-
-During this tutorial we will walk through submitting jobs to the HTCondor  
-workload management system. We will then investigate how the jobs run in the HTCondor  
-environment and explore how to manage submitted jobs.  
-
-### The submit file  
-
-The submit file is the least complex way to submit one or more jobs to HTCondor. It consists of  
-the necessary information to specify what work to perform (what executable to call, which arguments  
-to provide, etc.) and any requirements for your job to complete successfully (a GPU, a certain release of  
-NMRbox, etc.)  
-
-#### The basics  
-
-Let's look at a simple submit file in detail to see what arguments are necessary for a submission.  
-
-
-```  
-universe = vanilla  
-executable = /bin/ls  
-arguments = /  
-  
-log = logs/basic.log  
-output = logs/basic.out  
-error = logs/basic.err  
-queue  
-```  
-
-Going through the lines in order:  
-
-* The universe specified which HTCondor universe to run in. These are discussed  
-in the slides. The most straightforward is the `vanilla` universe. As a beginner this  
-universe should be suitable for all your jobs.  
-* The executable specifies which command should actually be ran when the job runs. You  
-must use an absolute path.  
-* The arguments line, while technically optional, will almost always be used. It allows you  
-to specify which command line arguments to provide to the executable.  
-* The log, output, and error commands specify where HTCondor will log. The output and  
-error will be populated with whatever the job writes to STDERR and STDOUT while the  
-log will contain the HTCondor logs related to the scheduling and running of the job.  
-* The queue parameter is special. Each time this argument is encountered in the file  
-HTCondor will submit a job to the queue with whatever arguments were specified prior  
-in the file. You can specify queue multiple times on different lines, and optionally  
-change any of the parameters between the queue lines to customize the various runs  
-of the job. You can also provide a number after `queue` to submit multiple jobs  
-with the same parameters. (e.g. if doing a Monte Carlo simulation). You can see an example  
-of this behavior used in practice in the "cluster.sub" file.  
-
-#### Additional submit file options  
-
-Now let's look at additional parameters which can be specified in the submit file.  
-
-```  
-request_cpus = 1  
-request_disk = 1MB  
-request_memory = 4GB  
-```  
-
-These arguments speak for themselves. They are telling the HTCondor matchmaker  
-that in order to run your job the machine must be able to provide the specified  
-resources.  
-
-```  
-requirements = ((Target.Release == "2022.22") || (Target.Release == "2022.21"))  
-+Production = True  
-```  
-
-These arguments allow us to further refine which machines our jobs will execute on.  
-* The `requirements` line is specifying that we only want to run our job on two specific  
-releases of NMRbox. This is useful if you require a specific version of an installed software  
-package which is not available on all releases of NMRbox. Remember - you can check which releases  
-of NMRbox have a given version of a software package in them using the NMRbox software registry on  
-the NMRbox web site. Here is an example [for NMRpipe](https://nmrbox.org/software/nmrpipe).  
-* The `+Production=True` job ensures that our job only runs on [production NMRbox machines](https://nmrbox.org/hardware). (Normal  
-NMRbox computational machines). This line is automatically added if you submit your job from a production  
-NMRbox machine, but it's good to have it explicitly present. We have additional compute nodes in our pool  
-which you can access, but they don't have the full set of NMR software packages installed, so using them  
-requires extra work when setting up your submission. If you don't require any specific software to be installed  
-on the worker node (for example, if you are running a staticly compiled binary executable or a pure-python   
-program) you can specify `+Production=False` to get access to these additional computational nodes.  
-
-```  
-transfer_executable = FALSE  
-should_transfer_files = NO  
-```  
-
-In most circumstances, HTCondor expects you to specify exactly which input files your job needs and then  
-it will manage transferring those files to the worker node for you, and transferring any created output files back.  
-While this is maximally portable, it has increased overhead and requires more work by the job submitter. On the  
-NMRbox computational pool, all the nodes share a filesystem, so it's possible to bypass this transfer mechanism  
-and rely on the shared file system, which avoids you needing to specify which input files the job needs.  
-
-The only thing to be careful of is that you have your job located in your home directory (or a subdirectory)  
-and not in your scratch directory or other machine-specific temporary directory.  
-
-```  
-getenv=True  
-```  
-
-This line ensures that the HTCondor job runs with the same shell environment as existed in the shell  
-where you submitted the job. Without specifying this, your job may fail due to not being able to locate  
-a called executable in the path, or due to other issues. This is not a very portable option when submitting  
-to a heterogeneous computing environment, but it works well on production NMRbox machines, and simplifies  
-your life when developing your first submit files.  
-
-#### Variables!  
-
-It is possible to specify custom variables in your submit file, and have them interpreted in the appropriate  
-places. Here is an example of the concept:  
-
-```  
-arguments = example $(JobName).pdb  
-output = logs/$(JobName).stdout  
-error = logs/$(JobName).stderr  
-  
-JobName = 2dog  
-queue  
-JobName = 2cow  
-queue  
-JobName = 1rcf  
-queue  
-```  
-
-You should be familiar with the `arguments`, `output`, and `error` line from earlier in the tutorial, but now  
-you can see that they are using the value of a variable called `JobName` to determine their value. Where is  
-this variable coming from? From us, later in the submit file! You can define any variables you want, other  
-than ones that would conflict with built-in HTCondor parameter names. In this example, we specify three  
-different JobNames, and queue up three different jobs. Each of those three jobs will write to a different  
-output and error file, and will use a different file for their input.  
-
-#### Variables, continued  
-
-In addition to defining your own variables, some are automatically available. In the  
-code above, the example shows running a given executable against three different input files,  
-and performing some action. But what if you have only a single input file, and you want to run a given  
-computation 1000 times? For example, imagine you are using Monte Carlo methods.  
-
-Fortunately it is easy to achieve this as well. Look at this example `.sub` file stub:  
-
-```  
-executable = monte_carlo_method.py  
-arguments = -seed $(PROCID)  
-  
-log = logs/mc.log  
-output = logs/mc_$(PROCID).out  
-error = logs/mc_$(PROCID).err  
-queue 1000  
-```  
-
-In this case we've submitted 1000 processes with this single submit file, all of which belong to a single  
-cluster. `$(PROCID)` is automatically interpreted for each individual process, triggering all 1000  
-processes to run with a different seed.  
-
-When you submit such a job, Condor will print out the job ID. Furthermore, while  
-each time queue is specified in the file creates an additional process, they share they same  
-job ID, it is the process ID that is incremented. When looking at the queue, you  
-can see these numbers in the form "jobID.procID". Running `condor_rm` with a job ID  
-will remove all of the jobs with that ID, while running `condor_rm` with a jobID.procID  
-will only remove the specific instance of the job with the given procID.  
-
-### The queue  
-
-Condor manages a queue of submitted jobs which are waiting to run. By default, the jobs run in  
-roughly the order they were submitted. To view your job queue, you can run the `condor_q` command.  
-
-You will get a result similar to this one:  
-
-```
--- Schedd: wiscdaily.nmrbox.org : <127.0.0.1:51028?...  
- ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD 1.0   jwedell         7/10 14:59   0+00:00:00 I  0   0.1  zenity --info --te  
-1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended  
-```
-
-This example shows a single job (with ID 1) is idle. If you have a submit file with multiple "queue"  
-statements but only see one job, you can use `condor_q -nobatch` to see each individual job separately.  
-
-Here are some common states of jobs that you should be familiar  
-with:  
-
-* `I/Idle` - This means the job is idle. All jobs will be in this state for at least a brief period before  
-they begin running. If they remain in this state for an extended period of time, it may be the case  
-that no machines match your job requirements. To check why a job hasn't yet run, you can use the  
-`condor_q` command to investigate the status of the job. Adding `-better` will perform a "better analysis"  
-of the job. So for the example queue above, `condor_q -better 1` would explain why the job hasn't yet  
-started running.  
-* `R/Running` - This means the job is currently executing on a machine in the pool. Great!  
-* `H/Held` - This means that something went wrong when attempting to run the job. There are many reasons  
-this could happen - to check the error message, you can again use `condor_q -better` with the job ID, and   
-you can also look in the file you defined as the "log" for the job.  
-
-If you realize there is a problem with your requirements or submit file, you can remove the job with the  
-`condor_rm` command followed by the job ID. If you have a job with the queue statement used more than once,  
-you will remove all of them if you use the integer ID for the submission. You can then make any necessary  
-changes and resubmit the job.  
-
-### The pool  
-
-You can also investigate the Condor pool to see what resources are available. To see the pool,  
-use the `condor_status` command. Note that our pool uses something called "dynamic partitioning". What   
-this means is that a machine with 128 cores and 500GB of RAM will only appear as one resource when you  
-run this command. But that doesn't mean it can only run one job. Instead, what happens is that HTCondor automatically  
-divides up the resources on this machine according to what was requested in the submit file.  
-
-Therefore, such a machine may wind up running 100 jobs which only require 1 GB of RAM and a single core,  
-1 job which requires 20 cores and 10 GB of RAM, and 1 job which requires 1 CPU and 1 GPU. This ensures  
-that our resources can be used most effectively, and it's why it is important that you enter realistic numbers  
-for `request_memory` so that you don't ask for more memory than you'll need.  
-
-Here is an example of what you may see when you run the `condor_status` command:  
-
-```
-Name                        OpSys      Arch   State     Activity LoadAv Mem      ActvtyTime  
-  
-slot1@babbage.nmrbox.org    LINUX      X86_64 Unclaimed Idle      0.000 2063937  5+13:10:01  
-slot1@barium.nmrbox.org     LINUX      X86_64 Unclaimed Idle      0.000  193335 11+22:11:08  
-slot1@bismuth.nmrbox.org    LINUX      X86_64 Unclaimed Idle      0.000  364790  0+16:49:41  
-slot1@bromine.nmrbox.org    LINUX      X86_64 Unclaimed Idle      0.000  385574 11+22:24:05  
-slot1@calcium.nmrbox.org    LINUX      X86_64 Unclaimed Idle      0.000  191880 23+21:50:56  
-slot1@chlorine.nmrbox.org   LINUX      X86_64 Unclaimed Idle      0.000  191880  6+20:35:04  
-slot1@chromium.nmrbox.org   LINUX      X86_64 Unclaimed Idle      0.000  191880 45+12:21:35  
-slot1@cobalt.nmrbox.org     LINUX      X86_64 Unclaimed Idle      0.000  176292 24+17:32:17  
-slot1@compute-1.bmrb.io     LINUX      X86_64 Unclaimed Idle      0.000  515522  4+05:21:42  
-...  
-slot1@zinc.nmrbox.org       LINUX      X86_64 Unclaimed Idle      0.000  385576  1+00:39:18  
-  
- Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain  
- X86_64/LINUX    50     0       9        41       0          0        0      0  
- Total    50     0       9        41       0          0        0      0
- ```
-
-
-This shows the current slots in the pool and their status. Remember that a machine can spawn multiple  
-slots depending on what resources it has available and are requested.  
-
-If you want to check a given requirement against the pool to see which machines would be eligible  
-to run your job, you can do that using the `-const` argument to `condor_status`. Here is an example  
-to check which machines are on release 2022.22:  
-
-`condor_status -const '(Release == "2022.22")'`  
-
-
-### Submitting a job  
-
-With an understanding of the essential parts of a Condor submission file, let's go ahead and  
-actually submit a job to Condor.  
-
-In the HTCondor folder in your home directory (`~/EVENTS/2022-nmrbox-summer/HTCondor`) you should  
-have a file called `pdb.sub` which uses many of the options described above. The executable simply  
-echoes some text to STDOUT, but it illustrates how you could run a given computation  
-against a set of PDB structures.  
-
-To submit a `.sub` file to Condor, use the `condor_submit` command as such:  
-
-```bash  
-condor_submit pdb.sub
-```  
-
-The moment you enter this command your job will be in the condor job queue. (If you have  
-multiple "queue" statements in your job, those are still considered one clustered job with  
-multiple processes). It will print off the job ID:  
-
-```  
-Submitting job(s)...  
-3 job(s) submitted to cluster 1.  
-```  
-
-### Check the results  
-
-If you're able to run the `condor_q` command quickly enough before  
-your job completes running, you will see the job you just submitted in the queue.   
-
-```  
--- Schedd: strontium.nmrbox.org : <155.37.253.57:9618?... @ 06/07/22 06:47:35  
-OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS  
-jwedell ID: 1        6/7  06:47      _      _      3      3 1.0-2  
-  
-Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for jwedell: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for all users: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended  
-```  
-
-With `-nobatch`:  
-
-```
--- Schedd: strontium.nmrbox.org : <155.37.253.57:9618?... @ 06/07/22 06:48:54  
- ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD 1.0   jwedell         6/7  06:47   0+00:00:00 I  0    0.0 echo example 2dog.pdb 1.1   jwedell         6/7  06:47   0+00:00:00 I  0    0.0 echo example 2cow.pdb 1.2   jwedell         6/7  06:47   0+00:00:00 I  0    0.0 echo example 1rcf.pdb  
-Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for jwedell: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Total for all users: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended  
-```
-
-Continue to use `condor_q` to watch and see when your job has completed. When it completes,  
-take a look at the output files in the `output` folder and the `log` folder.  
-
-### Interactive debugging  
-
-Let's explore a very useful technique for examining why a job might not be running the way  
-that you expect. Update the submit file to replace the executable with `/bin/sleep` and update  
-the `arguments` to be `1000`. This will just run the linux sleep command for 1000 seconds, giving you  
-time to see your job running in the queue. Submit it again using `condor_submit` as before.  
-
-After a few seconds, you should see that one or more of the processes in your submission have entered  
-the running job state. (Remember, you can use `condor_q` to check.) Determine the job and process ID of one  
-you would like to explore futher, and run  
-
-`condor_ssh_to_job clusterID.procID` where clusterID and procID are replaced with the value for your job,  
-which you can get from `condor_q` or `condor_q -nobatch`.  
-
-This will open up an interactive SSH session to the exact machine and location your job is running. You  
-can use this to manually step through the actions your job would take and explore and unexpected behavior.  
-
-#### File transfer  
-
-In the vanilla universe, HTCondor will automatically transfer back any files created  
-during the execution of a job to your local machine. (Though it will not transfer  
-back folders created or files within them automatically - you must manually specify  
-those using the `transfer_output_files` argument if you want them to be preserved.)  
-
-As mentioned before, you can avoid this entirely using the `should_transfer_files` and `transfer_executables`  
-options, and relying on the shared filesystem.  
-
-### Helpful hints  
-
-Here are some things to keep in mind while working to use Condor to take advantage of distributed computing:  
-
-1) Try to keep the runtime of an individual job process to less than 8 hours.  
-2) You'll generally make better use of the resources if you can break your work up into compute units  
-of a single core. This is because it is much easier to find a single available core in the pool to start  
-running your job than to find, for example, a machine with 48 simultaneously free cores. So rather than  
-run a multiprocess workflow with `request_cpus=48` see if you can instead run 48 condor jobs/processes  
-which each require just one CPU.  
-3) To take advantage of our additional "compute-only" nodes, you can't count on NMR software being installed  
-and will need to ensure your executable can stand alone. A python virtual environment would allow this, as would  
-a staticly compiled binary.  
-4) Don't be afraid to [reach out](mailto:support@nmrbox.org)! We're happy to help you take maximum utility from our compute resources to  
-advance your research.  
-
-
-## Summary  
-
-This has been a very brief demonstration of the basic features of HTCondor job  
-submission and management. For more details on the submit file format, please  
-see the [HTCondor documentation](https://htcondor.readthedocs.io/en/latest/users-manual/quick-start-guide.html).