Your training environment can be a local development machine, an on-premise GPU cluster, or cloud resources. To set up your training environment, follow the steps below.
Step 1 - Set
DET_MASTER Environment Variable#
DET_MASTER environment variable, which is the network address of the HPE Machine
Learning Development Environment master. You can override the value in the command line using the
Step 2 - Install the HPE Machine Learning Development Environment CLI#
The HPE Machine Learning Development Environment CLI is a command-line tool that lets you launch new experiments and interact with an HPE Machine Learning Development Environment cluster. The CLI can be installed on any machine you want to use to access HPE Machine Learning Development Environment. To install the CLI, follow the installation instructions.
--master flag determines the network address of the HPE Machine Learning
Development Environment master that the CLI connects to. If this flag is not specified, the value of
DET_MASTER environment variable is used; if that environment variable is not set, the
default address is
localhost. The master address can be specified in three different formats:
portis omitted, it defaults to
portis omitted, it defaults to
portis omitted, it defaults to
# Connect to localhost, port 8080. $ det experiment list # Connect to example.org, port 8888. $ det -m example.org:8888 e list # Connect to example.org, port 80. $ det -m http://example.org e list # Connect to example.org, port 443. $ det -m https://example.org e list # Connect to example.org, port 8080. $ det -m example.org e list # Set default Determined master address to example.org, port 8888. $ export DET_MASTER="example.org:8888"
Step 3 - Set up Internet Access#
The HPE Machine Learning Development Environment Docker images are hosted on Docker Hub. HPE Machine Learning Development Environment agents need access to Docker Hub for such tasks as building new images for user workloads.
If packages, data, or other resources needed by user workloads are hosted on the public Internet, HPE Machine Learning Development Environment agents need to be able to access them. Note that agents can be configured to use proxies when accessing network resources.
For best performance, it is recommended that the HPE Machine Learning Development Environment master and agents use the same physical network or VPC. When using VPCs on a public cloud provider, additional steps might need to be taken to ensure that instances in the VPC can access the Internet:
On GCP, the instances need to have an external IP address, or a GCP Cloud NAT should be configured for the VPC.
On AWS, the instances need to have a public IP address, and a VPC Internet Gateway should be configured for the VPC.
Step 4 - Set up Firewall Rules#
The firewall rules must satisfy the following network access requirements for the master and agents.
Inbound TCP to the master’s network port from the HPE Machine Learning Development Environment agent instances, as well as all machines where developers want to use the HPE Machine Learning Development Environment CLI or WebUI. The default port is
8443if TLS is enabled and
Outbound TCP to all ports on the HPE Machine Learning Development Environment agents.
Inbound TCP from all ports on the master to all ports on the agent.
Outbound TCP from all ports on the agent to the master’s network port.
Outbound TCP to the services that host the Docker images, packages, data, and other resources that need to be accessed by user workloads. For example, if your data is stored on Amazon S3, ensure the firewall rules allow access to this data.
Inbound and outbound TCP on all ports to and from each HPE Machine Learning Development Environment agent. The details are as follows:
Inbound and outbound TCP ports 1734 and 1750 are used for synchronization between trial containers.
Inbound and outbound TCP port 12350 is used for internal SSH-based communication between trial containers.
DeepSpeedTrial, port 29500 is used by for rendezvous between trial containers.
PyTorchTrialwith the “torch” distributed training backend, port 29400 is used for rendezvous between trial containers
For all other distributed training modes, inbound and outbound TCP port 12355 is used for GLOO rendezvous between trial containers.
Inbound and outbound ephemeral TCP ports in the range 1024-65536 are used for communication between trials via GLOO.
For every GPU on each agent machine, an inbound and outbound ephemeral TCP port in the range 1024-65536 is used for communication between trials via NCCL.
Two additional ephemeral TCP ports in the range 1024-65536 are used for additional intra-trial communication between trial containers.
Each TensorBoard uses a port in the range 2600–2899
Each notebook uses a port in the range 2900–3199
Each shell uses a port in the range 3200–3599
Step 5 - Transfer the Context Directory#
-c <directory> option to transfer files from a directory on your local machine, called
the context directory, to the container. The context directory contents are placed in the container
working directory before the command or shell run. Files in the context can be accessed using
$ mkdir context $ echo 'print("hello world")' > context/run.py $ det cmd run -c context python run.py
The total size of the files in the context directory must be less than 95 MB. Larger files, such as datasets, must be mounted into the container, downloaded after the container starts, or included in a custom Docker image.
Step 6 - Install the HPE Machine Learning Development Environment Cluster#
An HPE Machine Learning Development Environment cluster comprises a master and one or more agents. The cluster can be installed on Amazon Web Services (AWS), Google Cloud Platform (GCP), on-premise, or on a local development machine.
Step 7 - Configure the Cluster#
Common configuration reference: Common Configuration Options
Master configuration reference: Master Configuration Reference
Agent configuration reference: Agent Configuration Reference
The behavior of the master and agent can be controlled by setting configuration variables; this can be done using a configuration file, environment variables, or command-line options. Although values from different sources will be merged, we generally recommend sticking to a single source for each service to keep things simple.
The master and the agent both accept an optional
--config-file command-line option, which
specifies the path of the configuration file to use. Note that when running the master or agent
inside a container, you will need to make the configuration file accessible inside the container
(e.g., via a bind mount). For example, this command starts the agent using a configuration file:
docker run \ -v `pwd`/agent-config.yaml:/etc/determined/agent-config.yaml \ determinedai/determined-agent --config-file /etc/determined/agent-config.yaml
agent-config.yaml file might contain
master_host: 127.0.0.1 master_port: 8080
to configure the address of the HPE Machine Learning Development Environment master that the agent will attempt to connect to.
Each option in the master or agent configuration file can also be specified as an environment
variable or a command-line option. To configure the behavior of the master or agent using
environment variables, specify an environment variable starting with
DET_ followed by the name
of the configuration variable. Underscores (
_) should be used to indicate nested options: for
logging.type master configuration option can be specified via an environment
The equivalent of the agent configuration file shown above can be specified by setting two
DET_MASTER_PORT. When starting the agent as a
container, environment variables can be specified as part of
docker run \ -e DET_MASTER_HOST=127.0.0.1 \ -e DET_MASTER_PORT=8080 \ determinedai/determined-agent
The equivalent behavior can be achieved using command-line options:
determined-agent run --master-host=127.0.0.1 --master-port=8080
The same behavior applies to master configuration settings as well. For example, configuring the host where the Postgres database is running can be done via a configuration file containing:
db: host: the-db-host
Equivalent behavior can be achieved by setting the
DET_DB_HOST=the-db-host environment variable
--db-host the-db-host command-line option.
In the rest of this document, we will refer to options using their names in the configuration file.
.) will be used to indicate nested options; for example, the option above would be
Additional configuration settings for both commands and
shells can be set using the
--config-file options. Typical settings include:
bind_mounts: Specifies directories to be bind-mounted into the container from the host machine. (Due to the structured values required for this setting, it needs to be specified in a config file.)
resources.slots: Specifies the number of slots the container will have access to. (Distributed commands and shells are not supported; all slots will be on one machine and attempting to use more slots than are available on one machine will prevent the container from being scheduled.)
environment.image: Specifies a custom Docker image to use for the container.
description: Specifies a description for the command or shell to distinguish it from others.