Jobs are run through the slurm
cluster-management software.
Generally jobs a run via either srun
(for a binary) or sbatch
(for executing a bash script).
See also the sbatch documentation or the srun documentation
srun --partition naples -n1 --mem 1G --pty bash
will give you a single CPU and 1G of memory.
--mem
allocated memory, supports type-modifiers (e.g.--mem 15G
for 15 gigabytes)--exclusive
allocate an entire node, allocates all the memory and cpus--partition
use a specific partition (should be set tonaples
unless you know what you are doing)--time
limits the execution-time, use dd:hh:mm:ss format.
See also the sbatch documentation or the srun documentation
You can set constant values in the top of your scripts for sbatch
by prepending them to your script as follows:
#!/bin/bash
#SBATCH --time=1:05:00
#SBATCH --mail-user=pgj@cs.aau.dk
#SBATCH --mail-type=FAIL
#SBATCH --partition=naples
#SBATCH --mem=15000
##SBATCH --mem=64G
echo "hello world"
Assume that the previous script is called helloworld.sh
, executing sbatch helloworld.sh
will allocate 15G memory on the naples-partition and send pgj
an email on fail. The job will be forcefully terminated after 1 hour and 5 minutes. Use double #
to comment out a Sbatch-comment, when experimenting with Sbatch options.
You can see all running jobs with
squeue
To see only your jobs
squeue -u $(whoami)
To investigate more details about the job, use
scontrol show jobid=$JOBID
where $JOBID
is one of the ID's given by squeue.
You cancel a job by running
scancel $JOBID
Where $JOBID
is the id given by, e.g. squeue
.
If you want to cancel a range of jobs (say from jobid 100 to 900), you can conveniently do so by this one-liner
scancel {100..900}
You can also cancel all of your jobs by
scancel --user=$(whoami)
You can conveniently use /usr/bin/time
to measure the performance of your binary.
Just prepend the following command to your call
/usr/bin/time -f "@@@%e,%M@@@" echo "hello timing"
This will output the following:
hello timing
@@@0.00,1960@@@
which is @@@
followed by the timing in seconds and memory in kb.
We can dump this result into a file
/usr/bin/time -f "@@@%e,%M@@@" echo "hello timing" &> filename
You can conveniently pick this up with grep as follows:
grep -oP "(?<=@@@).*(?=@@@)" filename
which will give you 0.00,1960
Cancel jobs in state DependencyNeverSatisfied
squeue -u$(whoami) | grep DependencyNeverSatisfied | squeue -u$(whoami) | grep Never | awk -F" " '{print $1}' | xargs scancel