Building Pyramids

Matt Polnik's blog

Job Management in Torque

Linux Torque

Torque Job Management Cover Image

This post provides general overview of tools for querying, filtering, cancelling and tracing of Torque jobs. The suite of command line utilities described in the article can be installed using the torque-package-clients-linux-x86_64.sh self-extracting script.

Querying and Filtering Jobs

The status of batch jobs can be printed using the qstat program. By default it displays a list of all jobs in the queue on the server:

qstat queue-name@server

# Output:
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1.server.cluster.org      job.sh           vagrant                0 R batch

If the queue-name and server are skipped the jobs in the default queue and on the default server will be listed.

The S column in the output contains the current status of a job. The list of all available states is presented in the table below.

Status Description
C Completed after having run
E Exiting after having run
H Held
Q Queued, eligible to run or be routed
R Running
T Transferred to a new location
W Waiting for execution conditions to be satisfied
S Suspended

Total number of results can be limited to jobs owned by the specified user by executing qstat with the -u option.

qstat -u $USER

# Output:
server.cluster.org:
                                                                                  Req\'d       Req\'d       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.server.cluster.org    vagrant     batch    test.sh            9038     1     24       --   00:10:00 C       --

$USER is the environment variable set to current user in the Bash shell.

The columns of the table above contain the following information:

  1. the job identifier assigned by the batch server
  2. the job owner
  3. the queue in which the job currently resides
  4. the job name given by the submitter
  5. the session id if the job is running
  6. the number of nodes requested by the job
  7. the number of cpus or tasks requested by the job
  8. the amount of memory requested by the job
  9. either the cpu time, if specified, or wall time requested by the job, (hh:mm:ss)
  10. the jobs current state
  11. the amount of cpu time or wall time used by the job (hh:mm:ss)

The qstat output can also be narrowed to a specified job or a set of jobs and displayed in a more verbose manner.

qstat -a first-job-id second-job-id

# Output:
server.cluster.org:
                                                                                  Req\'d       Req\'d       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.server.cluster.org    vagrant     batch    test.sh            9091     1     24       --   00:10:00 C       --
qstat -f 1.server.cluster.org

# Output:
Job Id: 1.server.cluster.org
    Job_Name = test.sh
    Job_Owner = vagrant@node1.cluster.org
    resources_used.cput = 00:00:00
    resources_used.energy_used = 0
    resources_used.mem = 780kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = C
    queue = batch
    server = server.cluster.org
    Checkpoint = u
    ctime = Sun Mar 19 17:39:12 2017
    Error_Path = node1.cluster.org:/home/vagrant/test.sh.e25
    exec_host = server.cluster.org/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sun Mar 19 17:39:21 2017
    Output_Path = node1.cluster.org:/home/vagrant/test.sh.o25
    Priority = 0
    qtime = Sun Mar 19 17:39:12 2017
    Rerunable = True
    Resource_List.mppwidth = 24
    Resource_List.walltime = 00:10:00
    Resource_List.mppnodect = -1
    Resource_List.nodes = 1
    Resource_List.nodect = 1
    Resource_List.neednodes = 1
    session_id = 9091
    substate = 59
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/vagrant,
        PBS_O_LOGNAME=vagrant,
        PBS_O_PATH=/usr/local/bin:/usr/local/sbin:/usr/local/sbin:/usr/local/
        bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games,
        PBS_O_MAIL=/var/mail/vagrant,PBS_O_SHELL=/bin/bash,
        PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/vagrant,
        PBS_O_HOST=node1.cluster.org,PBS_O_SERVER=server
    euser = vagrant
    egroup = vagrant
    hashname = 25.server.cluster.org
    queue_rank = 2
    queue_type = E
    sched_hint = Unable to copy files back - please see the mother superior\'s
        log for exact details.
    comment = Job started on Sun Mar 19 at 17:39
    etime = Sun Mar 19 17:39:12 2017
    exit_status = 0
    submit_args = test.sh
    start_time = Sun Mar 19 17:39:12 2017
    start_count = 1
    fault_tolerant = False
    comp_time = Sun Mar 19 17:39:21 2017
    job_radix = 0
    total_runtime = 8.200034
    submit_host = node1.cluster.org
    init_work_dir = /home/vagrant
    request_version = 1
    req_information.task_count.0 = 1
    req_information.lprocs.0 = 1
    req_information.thread_usage_policy.0 = allowthreads
    req_information.hostlist.0 = server.cluster.org:ppn=1
    req_information.task_usage.0.task.0.cpu_list = 0
    req_information.task_usage.0.task.0.mem_list = 0
    req_information.task_usage.0.task.0.cores = 0
    req_information.task_usage.0.task.0.threads = 1
    req_information.task_usage.0.task.0.host = server.cluster.org

qstat can also be used to print information about the status of a batch server, queues and allocated nodes or display results in a different format. For the full list of supported comamnd line options check the Administration guide or Linux man pages.

Job Cancellation

Running and queued jobs are cancelled using the qdel utility. Users allowed to perform this action on a job are its owner, the batch server operators and administrators.

If a job is already running a SIGTERM signal is sent the process to allow graceful finish. After a preconfigured delay a SIGKILL signal is sent. The time span between these two signals is a configuration parameter of the execution queue. It can also be specified in the request using -W option, which takes precedence over the default value.

qdel job-id -m "message"

Cancel all jobs owned by the current user. If a job is running, send a SIGTERM signal and wait for 10 seconds before attempting to kill the process.

qdel all -W 10

For detailed explaination of flags supported by the qdel refer to the Administrator guide or Linux man pages.

Job Troubleshooting

Full information about the a Torque job track record is available using the tracejob utility. Executing the program as a privileged user with the -v flag allows to locate the computing node assigned by the scheduling process, which may be useful for debugging configuration issues.

sudo tracejob -v job-id
 
# Output:
/var/spool/torque/server_priv/accounting/20170320: Successfully located matching job records
/var/spool/torque/server_logs/20170320: Successfully located matching job records
/var/spool/torque/mom_logs/20170320: No such file or directory
/var/spool/torque/sched_logs/20170320: Successfully located matching job records

Job: server.torque.org

03/20/2017 14:26:52.274 S    enqueuing into batch, state 1 hop 1
03/20/2017 14:26:52.529 S    Job Modified at request of
                          root@server.torque.org
03/20/2017 14:26:52.556 L    Job Run
03/20/2017 14:26:52.530 S    Job Run at request of
                          root@server.torque.org
03/20/2017 14:26:52.556 S    Not sending email: User does not want mail of this
                          type.
03/20/2017 14:26:52  A    queue=batch
03/20/2017 14:26:52  A    user=user group=user jobname=test.sh
                          queue=batch ctime=1490020012 qtime=1490020012
                          etime=1490020012 start=1490020012
                          owner=user@server.torque.org
                          exec_host=node.torque.ccds.org/0
                          Resource_List.nodes=1:ppn=1 Resource_List.mem=16mb
                          Resource_List.walltime=00:10:00
                          Resource_List.host=XXX.XX.XX.XX
                          Resource_List.nodect=1 Resource_List.neednodes=1:ppn=1

Hopefully, you will not have to use this command often.