Using Determined requires a training environment. Your training environment can be a local development machine, an on-premise GPU cluster, or cloud resources.
This checklist helps you get started setting up a new training environment for your organization. After completing these steps, your users will be able to see and access your Determined cluster.
To complete the items in this checklist, ensure your system meets Advanced Installation Requirements.
About Offline Installations#
If your master and compute nodes are offline, you’ll need a local private registry that can satisfy necessary images (PostgreSQL + task container images).
You can install the Determined CLI package on your client machines and then take them offline again.
In addition, a local PyPi mirror for packages is highly recommended for installing packages from the internet in your task environments. See also: Infrastructure Considerations.
Set Up PostgreSQL#
Determined uses a PostgreSQL database to store experiment and trial metadata. Choose the installation method that best fits your environment and requirements.
If you are using Kubernetes, you can skip this step. Installing Determined on Kubernetes uses the Determined Helm Chart which includes deployment of a PostgreSQL database.
Once PostgreSQL is set up, you’ll install Determined. This includes deploying the Determined master, configuring checkpoint storage, setting up resource pools, and configuring the cluster.
Deploy Determined Master#
To install Determined, decide if you want to deploy the Determined master on premises or on cloud.
If the Determined agent is your compute resource, you’ll install the Determined agent along with the Determined master. The preferred method for installing the Agent is to use Linux packages. The recommended alternative to Linux packages is Docker.
To install the Determined master and agent on premises, you’ll first need to meet the installation requirements:
Once you’ve met the installation requirements, install the Determined Master and Agent:
These instructions include editing the YAML configuration files for the master and each agent and for configuring and starting the cluster.
If the Determined agent is your compute resource, you’ll install the Determined agent along with the Determined master. The preferred method for installing the agent is to use Linux packages. The recommended alternative to Linux packages is Docker.
To install the Determined master and agent on premises, you’ll first need to meet the installation requirements:
Once you’ve met the installation requirements, select one of the following options:
To install the Determined master on premises with Kubernetes, follow the steps below:
To install the Determined master and agent on cloud, select one of the following options:
When using AWS or GCP,
det CLI manages the installation of the Determined agent
To install the Determined master on cloud using Kubernetes, start here:
After completing the step above, select one of the following options:
Configure Checkpoint Storage#
A checkpoint contains the architecture and weights of the model being trained. If
checkpoint_storage is not specified, the experiment will default to the checkpoint storage
configured in the master configuration.
To learn more about configuring checkpoint storage, visit Checkpoint Storage.
Configure Resource Pools#
When deploying the Determined master and compute resources (such as a Determined agent), you must also configure resource pools.
How Resource Pools Work
Both the Determined master and the compute resources, such as the Determined agents, come with their individual configuration files. Among other things, these files define the resource pools and specify how resources communicate and are allocated.
For instance, a Determined agent, which is a kind of compute resource, is part of a resource pool. Its configuration file not only helps it communicate with the Determined master but also dictates which resource pool it should connect to. By default, an agent will attempt to connect to the “default” pool. However, if the “default” pool doesn’t exist, the agent will remain unconnected.
Setting Up an On-Prem Determined Agent
For an on-prem Determined agent installation, the process involves the following steps:
Configure resource pools. These resource pools enable the segregation of tasks based on their resource requirements.
Configure the agents to establish a connection to the Determined master. Then link the agents with their respective resource pools. For reference, visit resource_pool under Agent Configuration Reference.
Configure the Cluster#
Once you have set up the necessary components for your environment, configure the cluster. When configuring your cluster, you’ll need to keep the following resources handy:
After installing Determined, set up your security features.
Security features, with the exception of TLS, are only available on Determined Enterprise Edition (Determined EE).
The use of Transport Layer Security (TLS) requires Determined EE and is highly recommended.
According to Wikipedia, Mutual authentication or two-way authentication refers to two parties authenticating each other at the same time in an authentication protocol. To require that agent connections be verified using mutual TLS, use require_authentication (for more information visit Master Configuration Reference.
In an agent-based installation, Determined is the resource manager. To set up TLS for Agents, visit Transport Layer Security (TLS) Agents Configuration.
User Authentication (SSO)#
Determined offers several options for user authentication:
Enable, list, and remove OAuth clients.
Integrate OpenID Connect, with and Okta example.
Integrate Security Assertion Markup Language (SAML) authentication to use single sign-on (SSO) with your organizationidentity provider (IdP).
Integrate System for Cross-domain Identity Management (SCIM) for administrators to easily and securely provision users and groups.
For Kubernetes deployments, you modify the master-related configurations through the helm chart.
You can enhance security and limit potential malicious activity by running containers as non-root users. Determined allows you to run tasks as specific agent users and run unprivileged tasks by default.
Red Hat® OpenShift® users should not follow these instructions for configuring non-root containers, as OpenShift’s configuration conflicts with the approach described here.
To run containers as non-root users, you’ll first need to set up your non-root user:
Choose a Determined user for configuration, preferably one who has not undergone the
det user link-with-agent-userprocess and one you plan to eventually link with an agent user. If no suitable Determined user exists, consider creating a test user for this purpose, one which can be disabled afterwards.
Link this user to the actual username/UID and groupname/GID. One way to do this is to use the following command (you can also use the WebUI):
det user link-with-agent-user \ --agent-user $THE_USER \ --agent-uid $THE_UID \ --agent-group $THE_GROUP \ --agent-gid $THE_GID \ $THE_DETERMINED_USER
Start a shell as the specified user:
det -u $THE_DETERMINED_USER shell start
In the shell, verify the username/UID and groupname/GID with
After confirming the non-root containers are operational, you’ll need to perform a test run of each training job you normally run as the modified Determined user. This ensures the training jobs run successfully without root privileges.
For Kubernetes deployments, configure the security context for running containers as a non-root user.
Configure Role-Based Access Control (RBAC)#
Consider configuring role-based access control (RBAC) before creating workspaces and projects. To configure RBAC, visit RBAC.
RBAC is only available on Determined Enterprise Edition.
When setting up Determined, you can adjust certain configurations for enhanced security and performance. While these are particularly crucial for offline installations, they can also benefit online installations by ensuring faster package retrieval and increased security.
Configure Local Docker Image Repositories#
Configuring local Docker image repositories can enhance security and optimize performance. Learn how to configure local Docker image repositories in Customizing Your Environment.
Configure Local PyPi Mirrors#
It’s recommended to consider configuring local PyPi mirrors for:
Security: An airgapped cluster, isolated from the public internet, mandates local mirrors for proper functionality. This also safeguards against potential vulnerabilities associated with fetching packages from external sources.
Performance: Local mirrors can substantially reduce the time taken to fetch packages, eliminating potential lags due to network issues or external server overloads.
Create Workspaces and Projects#
Determined lets you organize and control access to your experiments by team or department. To do this, you can create Workspaces and Projects based on your RBAC groups. Once your workspaces are set up, you can bind resource pools to them.
Set Up Monitoring Tools#
To set up your monitoring tools, visit Prometheus & Grafana.
You may choose to configure InfiniBand when connecting multiple data streams in a single connection.
Set Up Clients#
You can set up clients for interacting with the Determined master through the CLI to provide users with efficient access for task execution without having to go through the WebUI.
Test Your Setup#
Test your setup to ensure it is functioning correctly.
Test that you can run a single CPU/GPU training job.
mnist_pytorch.tgzfile to a local directory.
Open a terminal window, extract the files, and
tar xzvf mnist_pytorch.tgz cd mnist_pytorch
mnist_pytorchdirectory, create an experiment specifying the
det experiment create const.yaml .
You should receive confirmation that the experiment is created:
Preparing files (.../mnist_pytorch) to send to master... 8.6KB and 7 files Created experiment 1
Enter the cluster address in the browser address bar to view experiment progress in the WebUI.
You should be able to see your experiment ID and its status.
Test that you can run a remote distributed training job.
distributed.yaml configuration file for this step is the same as the
file in the previous step, except that a
resources.slots_per_trial field is defined and
set to a value of
resources: slots_per_trial: 8
This is the number of available GPU resources. The
slots_per_trial value must be divisible
by the number of GPUs per machine. You can change the value to match your hardware
To connect to a Determined master running on a remote instance, set the remote IP address and port number in the
Create and run the experiment:
det experiment create distributed.yaml .
You can also use the
-moption to specify a remote master IP address:
det -m http://<ipAddress>:8080 experiment create distributed.yaml .
To view the WebUI dashboard, enter the cluster address in your browser address bar, accept
determinedas the default username, and click Sign In. A password is not required.
Click the Experiment name to view the experiment’s trial display.
Test that your users can access the cluster.
To view the WebUI dashboard, enter the cluster address in the browser address bar, accept the
default username of
determined, and click Sign In. A password is not required.
Congratulations! You have set up your Determined environment! Your users should be able to see and connect to the Determined master.