Building Pyramids

Matt Polnik's blog

Troque Installation on Ubuntu

Linux Torque

Torque Installation Cover Image

Torque is an open-source cluster management system deployed in many supercomputing facilities around the world. A quick search in Google returns me a list of Torque user guides from servers in USA, UK, India and Germany on the first page. Despite wide Torque adoption and extensive Administrator Guide I found it difficult to set up a managed cluster on a dozen of computing nodes with no prior experience. Sources of my problems were no official support for Ubuntu by Adaptive Computing, the company who maintains Torque, and lack of examples for Ubuntu in the Administrator Guide. The goal for this article is to fill this gap by providing detailed setup instructions for configuration of a Torque cluster with two computing nodes running Ubuntu Trusty Tahr 14.04 LTS.

The tutorial is complemented by Vagrant provisioning scripts for a Torque server and a computing node that automatically execute all steps explained in the article.

Torque Compilation

We are going to compile Torque with support for Control Groups. A Linux kernel mechanism for grouping, tracking and limiting the resource usage. The feature allows to set and enforce limits on hardware resources like memory and CPU on the Torque computing nodes. To learn more about Control Groups in Ubuntu refer to the Ubuntu Server Guide. The table below provides a short overview of other software components required for Torque compilation.

Software Component Usage
libtool Generic library providing a consident and portable interface for using shared libraries.
cpuset Pseudo-filesystem interface to the kernel mechanism, which is used to control the CPU and memory placement of processes.
openssl Toolkit for the Transport Layer Security and Secure Sockets Layer protocols.
Tool Command Language (TCL) Dynamic programming language suitable for web and desktop applications, networking, administration and testing.
Tk Graphical user interface toolkit.
Libxml2 XML C parser and toolkit.
Portable Hardware Locality (hwloc) A portable, cross-platform abstraction of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces or GPUs.

As of this writing the last version of the hwloc library available in the official Ubuntu Thrusty repositories does not satisfy the minimum version requirement of the Torque project, so we are going to install the latest hwloc version from sources.

Install packages required for Torque compilation.

apt-get update
apt-get install --assume-yes gcc g++ make cpuset openssl cgroup-bin \
    libtool tcl8.6-dev tk8.6-dev libxml2-dev libcgroup-dev \
    libhwloc-dev libboost-all-dev libssl-dev

pushd .

mkdir -p ~/Applications && cd ~/Applications
wget --quiet https://www.open-mpi.org/software/hwloc/v1.11/downloads/hwloc-1.11.4.tar.gz
tar -xvzf hwloc-1.11.4.tar.gz
rm hwloc-1.11.4.tar.gz
cd hwloc-1.11.4
./configure && make install

popd

Download and compile Torque. The command below downloads sources of the latest version of Torque, enables a subset of non-default Torque features, compiles sources and builds Torque installation packages. The names of switches for enabling extra Torque features are self-explanatory, so we are not going to cover them here. For the full list of available features refer to the Customizing the Install chapter in the Administrator Guide.

pushd .

mkdir -p ~/Applications && cd ~/Applications
wget --quiet http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.0.tar.gz
tar -xvzf torque-6.1.0.tar.gz
rm torque-6.1.0.tar.gz

cd torque-6.1.0
./configure --enable-gcc-warnings --enable-shared --enable-static \
--enable-gui --enable-fast-install \
--enable-syslog --enable-cgroups --enable-unixsockets --enable-tcl-qstat --with-scp \
--with-server-home=/var/spool/torque \
--with-boost-path=/usr/share/doc/libboost-all-dev \
--with-tcl=/usr/lib/tcl8.6 --with-tclinclude=/usr/include/tcl8.6 \
--with-tk=/usr/lib/tk8.6 --with-tkinclude=/usr/include/tcl8.6/tk-private/generic

make
make packages

popd

Torque is originally distributed as self-extracting Bash scripts. Use the --listfiles switch for more information about the content of a specific package.

./torque-package-devel-linux-x86_x64.sh --listfiles

Make Torque packages available for other hosts in your network. We are going to set up Apache HTTP server and make packages available for download.

pushd .

cd ~/Applications/torque-6.1.0
mkdir -p /var/www/deb/amd64
cp -rf torque-package-* /var/www/deb/amd64

cd contrib/init.d
cp -rf debian.pbs_mom /var/www/deb/amd64/pbs_mom
cp -rf debian.pbs_server /var/www/deb/amd64/pbs_server
cp -rf debian.trqauthd /var/www/deb/amd64/trqauthd

apt-get install --assume-yes apache2
ln -sf /var/www/deb /var/www/html/deb

popd

The Torque packages can be downloaded using http://{ hostname }/deb/amd64 url assuming no firewall is configured on the server host and the Apache daemon is running on the default HTTP port.

Server Setup

Installing and configuring the Torque server is a multistep process similar to setting up a database server. An administrator installs packages, configures Linux deamons, sets up the Troque batch server and finally grants permissions to users. To make the process more user friendly I developed a Bash script that performs all steps automatically.

Download the script, modify the source if necessary and execute on the server machine.

./torque-server-install-configure.sh

The remaining part of this section explains configuration actions executed by the setup script.

If the Torque server deamons are already running kill their processes.

if [ -z $(pgrep pbs_server) ]; then
  killall pbs_server
fi

if [ -z $(pgrep pbs_sched) ]; then
  killall pbs_sched
fi

if [ -z $(pgrep trqauthd) ]; then
  killall trqauthd
fi

Install the following Torque packages: torque-package-devel-linux-x86_64.sh, torque-package-client-linux-x86_64.sh, torque-package-server-linux-x86_64.sh. Optionally install the torque-package-doc-linux-x86_64.sh package which adds the Linux man pages for the Torque command line utilities.

for package_name in devel clients server
do
/var/www/deb/amd64/torque-package-$package_name-linux-x86_64.sh --install
done

Refresh dynamic linker runtime bindings to include libraries installed in the previous step.

ldconfig

Set up the thrqauthd, pbs_server and pbs_sched deamons required to authenticate the Torque users, accept jobs submitted to the batch server and schedule them between the computing nodes.

for service_name in trqauthd pbs_server pbs_sched
do
update-rc.d -f $service_name remove
cp -rf ~/Applications/torque-6.1.0/contrib/init.d/debian.$service_name /etc/init.d/$service_name
update-rc.d $service_name defaults
update-rc.d $service_name enable
done

Start the trqauthd deamon.

service trqauthd restart

Grant the current user manager permissions to the batch server. This set of permissions is required to set up initial configuration of the batch server.

pushd .
server_fdqn=`hostname -f`
cd /var/spool/torque
echo $server_fdqn > server_name

cd ./server_priv/acl_svr
echo $server_fdqn > acl_hosts
echo $USER@$server_fdqn | tee operators managers

popd

Set up initial configuration of the batch server. It creates a default queue named batch on the Torque server. User vagrant logged in from any host in the cluster.org subdomain is given the manager permissions to the batch server.

pbs_server -ft create
# Starting in TORQUE 3.1 the server is multi-threaded.
# We need to pause a second to allow the server to finish coming
# up. If we go to qmgr right away it will fail.
sleep 2
pbs_server --about
pbs_server_pid=`pgrep pbs_server`
if [ -z "$pbs_server_pid" ] ; then
  echo "ERROR: pbs_server failed to start, check syslog and server logs for more information"
  exit 1;
fi

read -d '' server_config << EOF
set server managers += vagrant@*.cluster.org;
set server operators += vagrant@*.cluster.org;
set server scheduling = true;
set server keep_completed = 300;
set server mom_job_sync = true;
EOF

echo "$server_config" | qmgr -e
if [ "$?" -ne "0" ] ; then
  echo "ERROR: cannot configure server";
  qterm;
  exit 1;
fi

read -d '' default_queue << EOF 
create queue batch;
set queue batch queue_type = execution;
set queue batch started = true;
set queue batch enabled = true;
set queue batch resources_default.walltime = 1:00:00;
set queue batch resources_default.nodes = 1;
set server default_queue = batch;
EOF

echo "$default_queue" | qmgr -e
if [ "$?" -ne "0" ] ; then
  echo "ERROR: cannot configure default queue";
  qterm;
  exit 1;
fi

Start the pbs_sched daemon responsible for assigning submitted jobs to the computing nodes.

service pbs_sched start

Optionally you can install the computing node daemon on the batch server host using the torque-mom-install-configure.sh script.

./torque-mom-install-configure.sh

Mom Setup

Adding a new computing node to the Torque cluster requires administration actions both on the batch server and the computing node. These actions are should be executed every time a new node is added. To make the process less error prone I recommend using the Bash script again.

Execute the torque-mom-install-configure.sh script on the computing node passing server hostname, server IP address and URL to the directory with Torque packages.

./torque-mom-install-configure.sh server1.torque.org 192.168.90.2 http://192.168.90.2/deb/amd64

The script will execute the following actions:

  1. Install packages required for running Torque.
  2. Add the batch server address and hostname to the /etc/hosts file if necessary.
  3. Download and install Torque packages: torque-package-devel-linux-x86_64.sh, torque-package-client-linux-x86_64.sh and torque-package-mom-linux-x86_64.sh.
  4. Set up trqauthd and pbs_mom deamons.
  5. Configure add the pbs_mom to connect the batch server.
  6. Start trqauthd and pbs_mom deamons.

Run the torque-register-mom.sh script on the batch server passing hostname and IP address of the computing node.

torque-register-mom.sh node1.torque.org 192.168.90.3

Restart the pbs_server to reload the configuration.

service pbs_server restart

Ensure the computing node is up and running by executing the following command on the batch server.

pbsnodes

Next Steps

I hope you reached the final section without troubleshooting.

Depending on the size of your computing cluster and user requirements you may consider adding extra capabilities to the Torque network. The following list contains a few example issues you may need to address. The list is by no means complete.

  • Consider upgrade to a commercial scheduler. They are advertised to increase job throughput of a supercomputing facility by better allocation of computing nodes. Commercial extensions also may come with extra command line utilities for example to check remaining time in the waiting queue for the specified job.

  • Configure control groups and CPU dynamic scaling on the computing nodes.

  • Mount a network drive on the computing nodes.

If you found an error in the article or would like to propose an enhancement, please leave your comment below. Thank you for your time and suggestions!