This post provides general overview of tools for querying, filtering, cancelling and tracing of Torque jobs. The suite of command line utilities described in the article can be installed using the
torque-package-clients-linux-x86_64.sh self-extracting script.
Querying and Filtering Jobs
The status of batch jobs can be printed using the
qstat program. By default it displays a list of all jobs in the queue on the server:
qstat queue-name@server # Output: Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1.server.cluster.org job.sh vagrant 0 R batch
server are skipped the jobs in the default queue and on the default server will be listed.
S column in the output contains the current status of a job. The list of all available states is presented in the table below.
|C||Completed after having run|
|E||Exiting after having run|
|Q||Queued, eligible to run or be routed|
|T||Transferred to a new location|
|W||Waiting for execution conditions to be satisfied|
Total number of results can be limited to jobs owned by the specified user by executing
qstat with the
qstat -u $USER # Output: server.cluster.org: Req\'d Req\'d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 1.server.cluster.org vagrant batch test.sh 9038 1 24 -- 00:10:00 C --
$USER is the environment variable set to current user in the Bash shell.
The columns of the table above contain the following information:
- the job identifier assigned by the batch server
- the job owner
- the queue in which the job currently resides
- the job name given by the submitter
- the session id if the job is running
- the number of nodes requested by the job
- the number of cpus or tasks requested by the job
- the amount of memory requested by the job
- either the cpu time, if specified, or wall time requested by the job, (hh:mm:ss)
- the jobs current state
- the amount of cpu time or wall time used by the job (hh:mm:ss)
qstat output can also be narrowed to a specified job or a set of jobs and displayed in a more verbose manner.
qstat -a first-job-id second-job-id # Output: server.cluster.org: Req\'d Req\'d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 1.server.cluster.org vagrant batch test.sh 9091 1 24 -- 00:10:00 C --
qstat -f 1.server.cluster.org # Output: Job Id: 1.server.cluster.org Job_Name = test.sh Job_Owner = firstname.lastname@example.org resources_used.cput = 00:00:00 resources_used.energy_used = 0 resources_used.mem = 780kb resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = C queue = batch server = server.cluster.org Checkpoint = u ctime = Sun Mar 19 17:39:12 2017 Error_Path = node1.cluster.org:/home/vagrant/test.sh.e25 exec_host = server.cluster.org/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Sun Mar 19 17:39:21 2017 Output_Path = node1.cluster.org:/home/vagrant/test.sh.o25 Priority = 0 qtime = Sun Mar 19 17:39:12 2017 Rerunable = True Resource_List.mppwidth = 24 Resource_List.walltime = 00:10:00 Resource_List.mppnodect = -1 Resource_List.nodes = 1 Resource_List.nodect = 1 Resource_List.neednodes = 1 session_id = 9091 substate = 59 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/vagrant, PBS_O_LOGNAME=vagrant, PBS_O_PATH=/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/ bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games, PBS_O_MAIL=/var/mail/vagrant,PBS_O_SHELL=/bin/bash, PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/vagrant, PBS_O_HOST=node1.cluster.org,PBS_O_SERVER=server euser = vagrant egroup = vagrant hashname = 25.server.cluster.org queue_rank = 2 queue_type = E sched_hint = Unable to copy files back - please see the mother superior\'s log for exact details. comment = Job started on Sun Mar 19 at 17:39 etime = Sun Mar 19 17:39:12 2017 exit_status = 0 submit_args = test.sh start_time = Sun Mar 19 17:39:12 2017 start_count = 1 fault_tolerant = False comp_time = Sun Mar 19 17:39:21 2017 job_radix = 0 total_runtime = 8.200034 submit_host = node1.cluster.org init_work_dir = /home/vagrant request_version = 1 req_information.task_count.0 = 1 req_information.lprocs.0 = 1 req_information.thread_usage_policy.0 = allowthreads req_information.hostlist.0 = server.cluster.org:ppn=1 req_information.task_usage.0.task.0.cpu_list = 0 req_information.task_usage.0.task.0.mem_list = 0 req_information.task_usage.0.task.0.cores = 0 req_information.task_usage.0.task.0.threads = 1 req_information.task_usage.0.task.0.host = server.cluster.org
qstat can also be used to print information about the status of a batch server, queues and allocated nodes or display results in a different format. For the full list of supported comamnd line options check the Administration guide or Linux man pages.
Running and queued jobs are cancelled using the
qdel utility. Users allowed to perform this action on a job are its owner, the batch server operators and administrators.
If a job is already running a SIGTERM signal is sent the process to allow graceful finish. After a preconfigured delay a SIGKILL signal is sent. The time span between these two signals is a configuration parameter of the execution queue. It can also be specified in the request using
-W option, which takes precedence over the default value.
qdel job-id -m "message"
Cancel all jobs owned by the current user. If a job is running, send a SIGTERM signal and wait for 10 seconds before attempting to kill the process.
qdel all -W 10
For detailed explaination of flags supported by the
qdel refer to the Administrator guide or Linux man pages.
Full information about the a Torque job track record is available using the
tracejob utility. Executing the program as a privileged user with the
-v flag allows to locate the computing node assigned by the scheduling process, which may be useful for debugging configuration issues.
sudo tracejob -v job-id # Output: /var/spool/torque/server_priv/accounting/20170320: Successfully located matching job records /var/spool/torque/server_logs/20170320: Successfully located matching job records /var/spool/torque/mom_logs/20170320: No such file or directory /var/spool/torque/sched_logs/20170320: Successfully located matching job records Job: server.torque.org 03/20/2017 14:26:52.274 S enqueuing into batch, state 1 hop 1 03/20/2017 14:26:52.529 S Job Modified at request of email@example.com 03/20/2017 14:26:52.556 L Job Run 03/20/2017 14:26:52.530 S Job Run at request of firstname.lastname@example.org 03/20/2017 14:26:52.556 S Not sending email: User does not want mail of this type. 03/20/2017 14:26:52 A queue=batch 03/20/2017 14:26:52 A user=user group=user jobname=test.sh queue=batch ctime=1490020012 qtime=1490020012 etime=1490020012 start=1490020012 email@example.com exec_host=node.torque.ccds.org/0 Resource_List.nodes=1:ppn=1 Resource_List.mem=16mb Resource_List.walltime=00:10:00 Resource_List.host=XXX.XX.XX.XX Resource_List.nodect=1 Resource_List.neednodes=1:ppn=1
Hopefully, you will not have to use this command often.