Create and Submit an Experiment#
Run training code by submitting your code to a cluster and running it as an experiment.
Configuring a Launcher#
To run an experiment, specify a launcher in the experiment configuration’s endpoint
field.
Launcher options include:
legacy bare-Trial-class
HPE Machine Learning Development Environment predefined launchers:
Custom launcher
A command with arguments, run in the container
For distributed training, separate the launcher that starts distributed workers from your training script, which typically runs each worker. The distributed training launcher must:
Launch all workers, passing any required peer info (e.g., rank, chief IP address) to each worker.
Monitor workers. If a worker exits with a non-zero status, the launcher should terminate the remaining workers and exit with a non-zero status.
Exit with zero after all workers exit with zero.
These requirements prevent distributed training jobs from hanging after a single worker failure.
Setting Up the Launcher#
The entry point of a model is configured in the Experiment Configuration file, specifying the model’s path and how it should be launched.
Predefined launchers are provided in the determined.launch
module, and custom scripts are
supported.
The launcher is configurable in the experiment configuration entrypoint
field. The
entrypoint
trial object or script launches the training code.
Example entry points:
entrypoint: python3 -m (LAUNCHER) (TRIAL_DEFINITION)
A custom script:
entrypoint: python3 script.py arg1 arg2
Preconfigured launcher entrypoint
arguments can differ but have the same format:
python3 -m (LAUNCH_MODULE) (--trial TRIAL)|(SCRIPT...)
where (LAUNCH_MODULE)
is an HPE Machine Learning Development Environment launcher and (--trial
TRIAL)|(SCRIPT...)
refers to the training script, which can be in a format recognized by the HPE
Machine Learning Development Environment launcher or a custom script.
For a custom launcher, wrap each rank work in the python3 -m determined.launch.wrap_rank $RANK CMD
[ARGS...]
script to separate final logs by rank in the WebUI.
Training Code Definition#
To launch a model or training script, either pass a trial class path to --trial
or run a custom
script that runs the training code. Only one of these can be used at the same time.
Trial Class#
Use the --trial TRIAL
argument to specify a trial class:
--trial TRIAL
To specify a trial class to be trained, the launcher accepts a TRIAL argument in the following format:
filepath:ClassName
where filepath
is the location of your training class file, and ClassName
is the name of the
Python training class.
Custom Script#
Launch a custom script under a supported launcher with arguments as needed.
Example Python script command:
script.py [args...]
Predefined Launchers#
Horovod Launcher#
Format:
determined.launch.horovod [[HVD_OVERRIDES...] --] (--trial TRIAL)|(SCRIPT...)
The horovod launcher wraps horovodrun automatically
configuring workers for the trial. Pass arguments to horovodrun
as HVD_OVERRIDES
, ending
with --
to separate the overrides from the normal arguments.
Example:
python3 -m determined.launch.horovod --fusion-threshold-mb 1 --cycle-time-ms 2 -- --trial model_def:MyTrial
PyTorch Distributed Launcher#
Format:
determined.launch.torch_distributed [[TORCH_OVERRIDES...] --] (--trial TRIAL)|(SCRIPT...)
This launcher wraps PyTorch’s native distributed training launcher, torch.distributed.run. Pass
override arguments to torch.distributed.run
, ending with --
before the trial specification.
Example:
python3 -m determined.launch.torch_distributed --rdzv_endpoint=$CUSTOM_RDZV_ADDR -- --trial model_def:MyTrial
DeepSpeed Launcher#
Format:
determined.launch.deepspeed [[DEEPSPEED_ARGS...] --] (--trial TRIAL)|(SCRIPT...)
The DeepSpeed launcher runs a training script under deepspeed
, handling IP addresses, sshd
containers, and shutdown.
Example:
python3 -m determined.launch.deepspeed --trial model_def:MyTrial
Use the -h
option to get the latest usage:
python3 -m determined.launch.deepspeed -h
Legacy Launcher#
Format:
entrypoint: model_def:TrialClass
The entry point field supports legacy file and trial class definitions. When specifying a trial class, it must be a subclass of an HPE Machine Learning Development Environment trial class.
Each trial class supports one deep learning application framework. Training or validating models may require loading data from an external source, so the training code needs to define data loaders.
A TrialClass
is located in the model_def
filepath and launched automatically. This legacy
configuration detects distributed training based on slot size and the number of machines, launching
with Horovod for distributed training by default. In a distributed training context, the entry point
is:
python3 -m determined.launch.torch_distributed --trial model_def:TrialClass
Nested Launchers#
The entry point supports nesting multiple launchers in a single script. This can be useful for tasks that need to be run before the training code starts, such as profiling tools (dlprof), custom memory management tools (numactl), or data preprocessing.
Example:
dlprof --mode=simple python3 -m determined.launch.torch_distributed --trial model_def:MnistTrial
Creating an Experiment#
The CLI is the recommended way to create an experiment, but you can also use the WebUI to create from an existing experiment or trial. To create an experiment:
$ det experiment create <configuration file> <context directory>
The Experiment Configuration file is a YAML file that controls your experiment.
The context directory contains relevant training code, which is uploaded to the master.
The total size of the files in the context cannot exceed 95 MB. Use data loaders to read data from an external source for larger datasets. See the Prepare Data section for data loading options.
Use a .detignore
file at the top level to list file paths to be omitted from the model
definition, using the same syntax as .gitignore
. Byte-compiled Python files, including .pyc
files and __pycache__
directories, are always ignored.
Pre-Training Setup#
Trials are created to train the model. The Hyperparameter Tuning searcher specified in the experiment configuration file defines a set of hyperparameter configurations, each corresponding to a single trial.
After the context and experiment configuration reach the master, the experiment waits for the scheduler to assign slots. The master allocates necessary resources as defined in the cluster configuration.
When a trial is ready to run, the master communicates with the agent or distributed training agents, which create containers with the configured environment and training code. Default container images applicable to many deep learning tasks are provided, but you can specify a custom image. If the specified container images do not exist locally, the trial container fetches the images from the registry. See Organize Models in the Model Registry.
Note
To learn more about distributed training with Determined, visit the conceptual overview or the intro to implementing distributed training.
After starting the containers, each trial runs the startup-hook.sh
script in the context
directory. Pre-training activities may incur a delay before training begins, but typically only take
a few seconds.
Pause and Activate#
A trial can be paused and reactivated without losing training progress. Pausing a trial saves its progress by creating a checkpoint before exiting the cluster.
The scheduler can pause a trial to free its resources for another task. You can also manually pause an experiment, which pauses all trials in the experiment, freeing the slots used by the trial. When the trial resumes, either because more slots become available or because you activate an experiment, the saved checkpoint is loaded and training continues from the saved state.
See also: Manage the job queue.