Slurm/PBS Requirements#
This document describes the specific requirements for deploying Determined on Slurm or PBS workload managers.
For general environment requirements, please refer to HPC Environment Requirements.
Slurm Requirements#
Determined should function with your existing Slurm configuration. To optimize how Determined interacts with Slurm, we recommend the following steps:
Enable Slurm for GPU Scheduling.
Configure Slurm with SelectType=select/cons_tres. This enables Slurm to track GPU allocation instead of tracking only CPUs. When enabled, Determined submits batch jobs by specifying
--gpus={slots_per_trial}
. If this is not available, you must change the slurm sectiontres_supported
option tofalse
.Configure GPU Generic Resources (GRES).
Determined works best when allocating GPUs. Information about what GPUs are available is available using GRES. You can use the AutoDetect feature to configure GPU GRES automatically. Otherwise, you should manually configure GRES GPUs such that Slurm can schedule nodes with the GPUs you want.
For the automatic selection of nodes with GPUs, Slurm must be configured for
GresTypes=gpu
and nodes with GPUs must have properly configured GRES indicating the presence of any GPUs. When enabled, Determined can ensure GPUs are available by specifying--gres=gpus:1
. If Slurm GRES cannot be properly configured, specify the slurm sectiongres_supported
option tofalse
, and it is the user’s responsibility to ensure that GPUs will be available on nodes selected for the job using other configurations such as targeting a specific resource pool with only GPU nodes, or specifying a Slurm constraint in the experiment configuration.Ensure homogeneous Slurm partitions.
Determined maps Slurm partitions to Determined resource pools. It is recommended that the nodes within a partition be homogeneous for Determined to effectively schedule GPU jobs.
A Slurm partition with GPUs is identified as a CUDA/ROCm resource pool. The type is inherited from the
resource_manager.slot_type
configuration. It can be also be specified-per partition usingresource_manager.partition_overrides
A Slurm partition with no GPUs is identified as an AUX resource pool.
The Determined default resource pool is set to the Slurm default partition. Override this default using the slurm section
default_compute_resource_pool
ordefault_aux_resource_pool
option.If a Slurm partition is not homogeneous, you may create a resource pool that provides homogenous resources out of that partition using a custom resource pool. Configure a resource pool with
provider_type: hpc
, specify the underlying Slurm partition name to receive the job and include a task_container_defaults section with the necessaryslurm
options to select the desired homogenous set of resources from that partition.
Ensure the
MaxNodes
value for each partition is not less than the number of GPUs in the partition.Determined delegates node selection for a job to Slurm by specifying a node range (1-
slots_per_trial
). Ifslots_per_trial
exceeds theMaxNodes
value for the partition, the job will remain in statePENDING
with reason codePartitionNodelimit
. Make sure that all partitions that haveMaxNodes
specified use a value larger than the number of GPUs in the partition.Enable multiple jobs per compute node.
Determined uses GPU or CPU resource requests to Slurm. When Slurm schedules jobs, however, it also considers the memory requirements of the job. In order to enable multiple jobs to be scheduled on a node concurrently, configuration is required in slurm.conf.
The default memory allocated for a job is
UNLIMITED
. This prevents multiple jobs from executing on the same node unless this value is reduced. The default memory allocation for a job is derived from one of the slurm.conf configuration variablesDefMemPerNode
,DefMemPerGPU
, orDefMemPerCPU
. In order to enable individual GPUs/CPUs scheduling by default configureDefMemPerNode
(which provides a total amount of memory for each job) orDefMemPerGPU
andDefMemPerCPU
(which derives the memory allocation from the number of GPU or CPU associated with the job). Configure one or more of these values to reduce the default memory allocation and enable jobs to divide up the available memory on compute nodes.An alternative to changing the default memory configuration via slurm.conf, is to provide explicit options on each job via the Determined configuration (task_container_defaults, resource pool configuration, or experiment configuration slurm.sbatch_args).
For details about how those requests are derived, see HPC Launching Architecture.
Enable resource separation using cgroups.
While Slurm always allocates distinct resources for each job, by default there is no enforced separation when the resources are co-located in the same compute node. Such enforcement can be enabled using cgroups. GPU allocation is communicated to the application via the environment variables
CUDA_VISIBLE_DEVICES
orROCR_VISIBLE_DEVICES
. Determined uses those specifications to utilize only the GPU resources scheduled by Slurm for the job, but CPU and memory have no enforcement. If desired, you can enable such enforcement with the Slurm cgroups configuration. Enable cgroups support in slurm.conf, then enable enforcement of specific resource classes in cgroup.conf (ConstrainCores
for CPU,ConstrainDevices
for GPU, andConstrainRAMSpace
for memory).Tune the Slurm configuration for Determined job preemption.
Slurm preempts jobs using signals. When a Determined job receives SIGTERM, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended to set
GraceTime (secs)
high enough to permit the job to complete an entire Determinedscheduling_unit
.To enable GPU job preemption, use
PreemptMode=CANCEL
orPreemptMode=REQUEUE
, becausePreemptMode=SUSPEND
does not release GPUs so does not allow a higher-priority job to access the allocated GPU resources. Determined manages the requeue of a successfully preempted job so even withPreemptMode=REQUEUE
, the Slurm job will be canceled and resubmitted.
PBS Requirements#
Determined should function with your existing PBS configuration. To optimize how Determined interacts with PBS, we recommend the following steps:
Enable PBS to store job history.
Job completion detection requires that the job history feature be enabled. PBS administrators can employ the following command to set the value of
job_history_enable
:sudo qmgr -c "set server job_history_enable = True"
Configure PBS to manage GPU resources.
To optimize GPU allocation, Determined automatically selects compute nodes with GPUs by default using the
-select={slots_per_trial}:ngpus=1
option. If PBS cannot identify GPUs in this way, set the pbs sectiongres_supported
option tofalse
when configuring Determined. In this case, users must ensure GPU availability on nodes by other means, such as targeting GPU-only resource pools, or specifying a PBS constraint in the experiment configuration.PBS should be configured to set the environment variable
CUDA_VISIBLE_DEVICES
(orROCR_VISIBLE_DEVICES
for ROCm) using a PBS cgroup hook, as explained in the PBS Administrator’s Guide. If PBS is not configured to setCUDA_VISIBLE_DEVICES
, Determined will utilize only a single GPU on each node. To fully utilize multiple GPUs, you must either manually configureCUDA_VISIBLE_DEVICES
or set thepbs.slots_per_node
setting in your experiment configuration file to indicate the desired number of GPU slots for Determined.Ensure the
ngpus
resource is defined with the correct values.To ensure the successful operation of Determined, define the
ngpus
resource value for each node on the cluster. Additionally, the resource should have the appropriate flags to enable proper processing by PBS when scheduling jobs. You can check if thengpus
resource is defined with the appropriate flags using the following command:[~]$ qmgr -c "list resource ngpus" Resource ngpus type = long flag = hn
The output should indicate that
ngpus
is defined as a typelong
resource with flagsh
(host-level resource) andn
(consumable resource). If your HPC cluster nodes do not produce the same output, use the following commands to create and set the resourcengpus
for each node:[~]$ qmgr -c "create resource ngpus type=long, flag=hn" # create the ngpus resource [~]$ qmgr -c "set node <nodename> ngpus=<number of GPUs>" # set the value for ngpus
If you use virtual nodes (vnodes), make sure the
ngpus
value is set only on the vnodes, not the parent node. Below are the commands and sample output to ensure this:[~]$ sudo qmgr -c "list node node002[0] resources_available" Node node002[0] resources_available.arch = linux resources_available.host = node002 resources_available.hpmem = 0b resources_available.mem = 45943mb resources_available.ncpus = 18 resources_available.ngpus = 2 # ngpus value is set on vnode node002[0] resources_available.vmem = 46933mb resources_available.vnode = node002[0] [~]$ sudo qmgr -c "list node node002 resources_available" Node node002 resources_available.accel_type = tesla resources_available.arch = linux resources_available.host = node002 resources_available.hpmem = 0b resources_available.mem = 0b resources_available.ncpus = 0 # ngpus value is not set on parent node node002 resources_available.Qlist = gpuQ,gpu_hi_priQ resources_available.vmem = 0b resources_available.vnode = node002
If the
ngpus
value is set on the parent node, use the following command to unset it:[madagund@node003 ~]$ sudo qmgr -c "unset node <node_name> resources_available.ngpus"
Next, make sure that
ngpus
is listed as a resource in the<sched_priv_directory>/sched_config
file.[~]$ sudo cat <sched_priv_directory>/sched_config | grep "resources:" resources: "ngpus, ncpus, mem, arch, host, vnode, ..., foo"
Finally, restart the pbs server to apply the changes.
[~]$ sudo systemctl restart pbs
Configure PBS to report GPU Accelerator type.
It is recommended that PBS administrators set the value for
resources_available.accel_type
on each node that contains an accelerator. Otherwise, the Cluster tab on the Determined WebUI will showunconfigured
for theAccelerator
field in the Resource Pool information.PBS administrators can use the following set of commands to set the value of
resources_available.accel_type
on a single node:Check if the
resources_available.accel_type
value is set.pbsnodes -v node001 | grep resources_available.accel_type
If required, set the desired value for
resources_available.accel_type
.sudo qmgr -c "set node node001 resources_available.accel_type=tesla"
When there are multiple types of GPUs on the node, use a comma-separated value.
sudo qmgr -c "set node node001 resources_available.accel_type=tesla,kepler"
Verify that the
resources_available.accel_type
value is now set.pbsnodes -v node001 | grep resources_available.accel_type
Repeat the above steps to set the
resources_available.accel_type
value for every node containing GPU. Once theresources_available.accel_type
value is set for all the necessary nodes, admins can verify the Accelerator field on the Cluster pane of the WebUI.Ensure homogeneous PBS queues.
Determined maps PBS queues to Determined resource pools. It is recommended that the nodes within a queue be homogeneous for Determined to effectively schedule GPU jobs.
A PBS queue with GPUs is identified as a CUDA/ROCm resource pool. The type is inherited from the
resource_manager.slot_type
configuration. It can be also be specified per partition usingresource_manager.partition_overrides
.A PBS queue with no GPUs is identified as an AUX resource pool.
The Determined default resource pool is set to the PBS default queue. Override this default using the pbs section
default_compute_resource_pool
ordefault_aux_resource_pool
option.If a PBS queue is not homogeneous, you may create a resource pool that provides homogenous resources out of that queue using a custom resource pool. Configure a resource pool with
provider_type: hpc
, specify the underlying PBS queue name to receive the job and include a task_container_defaults section with the necessarypbs
options to select the desired homogenous set of resources from that queue.
Tune the PBS configuration for Determined job preemption.
PBS supports a wide variety of criteria to trigger job preemption, and you may use any per your system and job requirements. Once a job is identified for preemption, PBS supports four different options for job preemption which are specified via the
preemption_order
scheduling parameter. The preemption order value is'SCR'
. The preemption methods are specified by the following letters:S
- Suspend the job.This is not applicable for GPU jobs.
C
- Checkpoint the job.This requires a custom checkpoint script is added to PBS.
R
- Requeue the job.Determined does not support the re-queueing of a task. Determined jobs specify the
-r n
option to PBS to prevent this case.D
- Delete the job.Determined jobs support this option without configuration.
Given those options, the simplest path to enable Determined job preemption is by including
D
in thepreemption_order
. You may includeR
in thepreemption_order
, but it is disabled for Determined jobs. You may includeC
to thepreemption_order
if you additionally configure a checkpoint script. Refer to the PBS documentation for details. If you choose to implement a checkpoint script, you may initiate a Determined checkpoint by sending aSIGTERM
signal to the Determined job. When a Determined job receives aSIGTERM
, it begins a checkpoint and graceful shutdown. To prevent unnecessary loss of work, it is recommended that you wait for at least one Determinedscheduling_unit
for the job to complete after sending theSIGTERM
. If after that period of time the job has not terminated, then send aSIGKILL
to forcibly release all resources.
Apptainer/Singularity Requirements#
Determined supports Apptainer (formerly known as Singularity) for container runtime in HPC environments. Ensure that Apptainer or Singularity is properly installed and configured on all compute nodes of your cluster.
Note
In addition to the core Apptainer/Singularity installation package, the apptainer-suid
or
singularity-suid
component is also required for full Determined functionality.
Singularity has numerous options that may be customized in the singularity.conf
file. Determined
has been verified using the default values and therefore does not require any special configuration
on the compute nodes of the cluster.
Podman Requirements#
When Determined is configured to use Podman, the containers are launched in rootless mode. Your HPC cluster administrator should have completed most of the configuration for you, but there may be additional per-user configuration that is required. Before attempting to launch Determined jobs, verify that you can run simple Podman containers on a compute node. For example:
podman run hello-world
If you are unable to do that successfully, then one or more of the following configuration changes
may be required in your $HOME/.config/containers/storage.conf
file:
Podman does not support rootless container storage on distributed file systems (e.g. NFS, Lustre, GPSF). On a typical HPC cluster, user directories are on a distributed file system and the default container storage location of
$HOME/.local/share/containers/storage
is therefore not supported. If this is the case on your HPC cluster, configure thegraphroot
option in yourstorage.conf
to specify a local file system available on compute nodes. Alternatively, you can request that your system administrator configure therootless_storage_path
in/etc/containers/storage.conf
on all compute nodes.Podman utilizes the directory specified by the environment variable
XDG_RUNTIME_DIR
. Normally, this is provided by the login process. Slurm and PBS, however, do not provide this variable when launching jobs on compute nodes. WhenXDG_RUNTIME_DIR
is not defined, Podman attempts to create the directory/run/user/$UID
for this purpose. If/run/user
is not writable by a non-root user, then Podman commands will fail with a permission error. To avoid this problem, configure therunroot
option in yourstorage.conf
to a writeable local directory available on all compute nodes. Alternatively, you can request your system administrator to configure the/run/user
to be user-writable on all compute nodes.
Create or update $HOME/.config/containers/storage.conf
as required to resolve the issues above.
The example storage.conf
file below uses the file system /tmp
, but there may be a more
appropriate file system on your HPC cluster that you should specify for this purpose.
[storage]
driver = "overlay"
graphroot = "/tmp/$USER/storage"
runroot = "/tmp/$USER/run"
Any changes to your storage.conf
should be applied using the command:
podman system migrate
Enroot Requirements#
Install and configure Enroot on all compute nodes of your cluster as per the Enroot Installation instructions for your platform. There may be additional per-user configuration that is required.
Enroot utilizes the directory
${ENROOT_RUNTIME_PATH}
(with default value${XDG_RUNTIME_DIR}/enroot
) for temporary files. NormallyXDG_RUNTIME_DIR
is provided by the login process, but Slurm and PBS do not provide this variable when launching jobs on compute nodes. When neither ENROOT_RUNTIME_PATH/XDG_RUNTIME_DIR is defined, Enroot attempts to create the directory /run/enroot for this purpose. This typically fails with a permission error for any non-root user. Select one of the following alternatives to ensure thatXDG_RUNTIME_DIR
orENROOT_RUNTIME_PATH
is defined and points to a user-writable directory when Slurm/PBS jobs are launched on the cluster.- Have your HPC cluster administrator configure Slurm/PBS to provide
XDG_RUNTIME_DIR
, or change the default
ENROOT_RUNTIME_PATH
defined in/etc/enroot/enroot.conf
on each node in your HPC cluster.
- Have your HPC cluster administrator configure Slurm/PBS to provide
If using Slurm, provide an
ENROOT_RUNTIME_PATH
definition intask_container_defaults.environment_variables
in master.yaml.task_container_defaults: environment_variables: - ENROOT_RUNTIME_PATH=/tmp/$(whoami)
If using Slurm, provide an
ENROOT_RUNTIME_PATH
definition in your experiment configuration.
Unlike Singularity or Podman, you must manually download the Docker image file to the local file system (
enroot import
) and then each user must create an Enroot container using that image (enroot create
). When the HPC launcher generates the enroot command for a job, it automatically applies the same transformation to the name that Enroot does on import (/
and:
characters are replaced with+
) to enable Docker image references to match the associated Enroot container. The following shell commands will download and then create an Enroot container for the current user. If other users have read access to/shared/enroot/images
, they need only perform theenroot create
step to make the container available for their use.image=determinedai/pytorch-ngc:0.38.0 cd /shared/enroot/images enroot import docker://$image enroot create /shared/enroot/images/${image//[\/:]/\+}.sqsh
The Enroot container storage directory for the user
${ENROOT_CACHE_PATH}
(which defaults to$HOME/.local/share/enroot
) must be accessible on all compute nodes.A convenience script,
/usr/bin/manage-enroot-cache
, is provided by the HPC launcher installation to simplify the management of enroot images.