Create Your First Experiment#
Follow these steps to see how to run your first experiment.
Prerequisites#
You must have a running HPE Machine Learning Development Environment cluster with the CLI installed.
To set up a local cluster, visit Quick Installation.
To set up a remote cluster, visit the Installation Guide where you’ll find options for On Prem, AWS, GCP, Kubernetes, and Slurm.
Run an Experiment#
Train a single model for a fixed number of batches, using constant values for all hyperparameters on a single slot. A slot is a CPU or CPU computing device, which the HPE Machine Learning Development Environment master schedules to run.
Note
To run an experiment in a local training environment, your HPE Machine Learning Development Environment cluster requires only a single CPU or GPU. A cluster is made up of a master and one or more agents. A single machine can serve as both a master and an agent.
Create the Experiment
Download and extract the tar file:
mnist_pytorch.tgz
.Open a terminal window and navigate to the directory where you extracted the tar file.
The
const.yaml
file is a YAML-formatted experiment configuration file that corresponds to an example experiment.Create an experiment that specifies the
const.yaml
configuration file by typing the following CLI command.det experiment create const.yaml .
The final dot (.) argument uploads all of the files in the current directory as the context directory for your model. HPE Machine Learning Development Environment copies the model context directory contents to the trial container working directory.
View the Experiment
To view the experiment in your browser:
Enter the following URL: http://localhost:8080/. This is the cluster address for your local training environment.
Accept the default username of
determined
, and click Sign In. You’ll create a strong password in the next section.
Navigate to the home page and then visit your Uncategorized experiments.
Select the experiment to display the experiment’s details such as Metrics.
Create a Strong Password
Select your profile in the upper left corner and then choose Settings.
Edit the Password by typing a strong password.
Select the checkmark to save your changes.
If you are changing your password, the system asks you to confirm your change. The system lets you know your changes have been saved.
Run a remote distributed training job.
Note
To run a remote distributed training job, you’ll need an HPE Machine Learning Development
Environment cluster with multiple GPUs. In distributed training, A cluster is made up of a
master and one or more agents. The master provides centralized management of the agent
resources. By default, the Slots Per Trial value is set to 1
which disables
distributed training.
Download and extract the tar file:
mnist_pytorch.tgz
.Open a terminal window and navigate to the directory where you extracted the tar file.
Using your code editor, examine the
distributed.yaml
file. Notice theresources.slots_per_trial
field is set to a value of8
:resources: slots_per_trial: 8
This is the number of available GPU resources. The
slots_per_trial
value must be divisible by the number of GPUs per machine.If necessary, use your code editor to change the value to match your hardware configuration.
Sign in to your remote instance of HPE Machine Learning Development Environment:
Enter the URL of your remote instance: http://<ipAddress>:8080/.
Sign in using your username and password.
To connect to the HPE Machine Learning Development Environment master running on your remote instance, set the remote IP address and port number in the
DET_MASTER
environment variable:export DET_MASTER=<ipAddress>:8080
To create and run the experiment, run the following command, replacing
<username>
with your username.det -u <username> experiment create distributed.yaml .
The system will ask for your password.
In your browser, navigate to the home page and then visit Your Recent Submissions.
Select the experiment to display the experiment’s details such as Metrics. Notice the loss curve is similar to the locally-run, single-GPU experiment but the time to complete the trial is reduced by about half.
Learn More#
Want to learn how to adapt your existing model code to HPE Machine Learning Development Environment?
The behavior of an experiment is configured via an experiment configuration, or YAML, file. A configuration file is typically passed as a command-line argument when an experiment is created with the CLI.
Visit the Experiment Configuration Reference for a complete description of the experiment configuration file.
Visit the Core API User Guide for a walk-through of how to adapt your existing model code to HPE Machine Learning Development Environment using the PyTorch MNIST model.
Deep Dive Quick Start
To learn more about how to change your configuration settings to run a distributed training job on multiple GPUs, visit the Quickstart for Model Developers.
More Tutorials
For more quick-start guides including API guides, visit the Tutorials.