HPC Environment Requirements#

This document describes how to prepare your environment for installing Determined on an HPC cluster managed by Slurm or PBS workload managers.

Tip

Store your installation commands and flags in a shell script for future use, particularly for upgrading.

Environment Requirements#

Hardware Requirements#

The recommended requirements for the admin node are:

  • 1 admin node for the master, the database, and the launcher with the following specs:
    • 16 cores

    • 32 GB of memory

    • 1 TB of disk space (depends on the database, see “Database Requirements” section below)

The minimal requirements are:

  • 1 admin node with 8 cores, 16 GB of memory, and 200 GB of disk space

Note

While the node can be virtual, a physical one is preferred.

Network Requirements#

The admin node requires the following network configurations:

Admin Node#

Ports: 8080, 8443 Type: TCP Description: Provide HTTP(S) access to the master node for web UI access and agent API access

Note

Ensure these ports are open in your firewall settings to allow proper communication with the admin node.

Additional Requirements:

  • The admin node must reach the HPC shared area (the scratch file system).

  • Recommended: 10 Gbps Ethernet link between the admin node and the HPC worker nodes.

  • Minimal: 1 Gbps Ethernet link.

Important

The admin node must be connected to the Internet to download container images and Python packages. If Internet access is not possible, the local container registry and package repository must be filled manually with external data.

Storage Requirements#

Determined requires shared storage for experiment checkpoints, container images, datasets, and pre-trained models. All worker nodes connected to the cluster must be able to access it. The storage can be a network file system (like VAST, Ceph FS, Gluster FS, Lustre) or a bucket (on cloud or on-prem if it exposes an S3 API).

Space requirements depend on the model complexity/size:

  • 10-30 TB of HDD space for small models (up to 1GB in size)

  • 20-60 TB of SSD space for medium to large models (more than 1GB in size)

Software Requirements#

The following software components are required:

Component

Version

Installation Node

Operating System

RHEL 8.5+ or 9.0+ SLES 15 SP3+ Ubuntu 22.04+

Admin

Java

>= 1.8

Admin

Python

>= 3.8

Admin

Podman

>= 4.0.0

Admin

PostgreSQL

10 (RHEL 8), 13 (RHEL 9), 14 (Ubuntu 22.04) or newer

Admin

HPC client packages

Same as login nodes

Admin

Container runtime

Singularity >= 3.7 (or Apptainer >= 1.0) Podman >= 3.3.1 Enroot >= 3.4.0

Workers

HPC scheduler

Slurm >= 20.02 (excluding 22.05.5 - 22.05.8) PBS >= 2021.1.2

Workers

NVIDIA drivers

>= 450.80

Workers

Database Requirements#

The solution requires PostgreSQL 10 or newer, which will be installed on the admin node. The required disk space for the database is estimated as follows:

  • 200 GB on small systems (less than 15 workers) or big systems if the experiment logs are sent to Elasticsearch

  • 16 GB/worker on big systems that store experiment logs inside the database

Installation Prerequisites#

Before proceeding with the installation, ensure that:

  • The operating system is installed along with the HPC client packages (a clone of an existing login node could be made if the OS is the same or similar)

  • The node has Internet connectivity

  • The node has the shared file system mounted on /scratch

  • Java is installed

  • Podman is installed

A dedicated OS user named determined should be created on the admin node. This user should:

Note

All subsequent installation steps assume the use of the determined user or root access.

For detailed installation steps, including OS-specific instructions and configuration, refer to the Install Determined on Slurm/PBS document.

Internal Task Gateway#

As of version 0.34.0, Determined supports the Internal Task Gateway feature for Kubernetes. This feature enables Determined tasks running on remote Kubernetes clusters to be exposed to the Determined master and proxies. If you’re using a hybrid setup with both Slurm/PBS and Kubernetes, this feature might be relevant for your configuration.

Important

Enabling this feature exposes Determined tasks to the outside world. Implement appropriate security measures to restrict access to exposed tasks and secure communication between the external cluster and the main cluster.