Create Your First Experiment#

Follow these steps to see how to run your first experiment.

Prerequisites#

You must have a running HPE Machine Learning Development Environment cluster with the CLI installed.

  • To set up a local cluster, visit Quick Installation.

  • To set up a remote cluster, visit the Installation Guide where you’ll find options for On Prem, AWS, GCP, Kubernetes, and Slurm.

Concepts#

  • Single-Trial Run: A single-trial experiment (or run) allows you to establish a baseline performance for your model. Running a single trial is useful for understanding how your model performs with a fixed set of hyperparameters. It serves as a benchmark against which you can compare results from more complex searches.

  • Multi-Trial Search: A multi-trial experiment (or search) allows you to optimize your model by exploring different configurations of hyperparameters automatically. A search systematically tests various hyperparameter combinations to find the best-performing configuration. This is more efficient than manually tuning each parameter.

  • Remote Distributed Training: Remote distributed training jobs enable you to train your model across multiple GPUs or nodes in a cluster, significantly reducing the time required for training large models or datasets. This approach allows for efficient scaling and management of resources, particularly in more demanding machine learning tasks.

Execute and Compare Experiments#

In this section, we’ll first execute a single-trial run before running a search. This will establish the baseline performance of our model and will give us a reference point to compare the results of our multi-trial search. Finally, we’ll run a remote distributed training job.

Follow these steps to train a single model for a fixed number of batches, using constant values for all hyperparameters on a single slot. A slot is a CPU or CPU computing device, which the HPE Machine Learning Development Environment master schedules to run.

Note

To execute an experiment in a local training environment, your HPE Machine Learning Development Environment cluster requires only a single CPU or GPU. A cluster is made up of a master and one or more agents. A single machine can serve as both a master and an agent.

Create the Experiment

  1. Download and extract the tar file: mnist_pytorch.tgz.

  2. Open a terminal window and navigate to the directory where you extracted the tar file.

    The const.yaml file is a YAML-formatted configuration file that corresponds to an example experiment.

  3. Create an experiment that specifies the const.yaml configuration file by typing the following CLI command.

    det experiment create const.yaml .
    

    The final dot (.) argument uploads all of the files in the current directory as the context directory for your model. HPE Machine Learning Development Environment copies the model context directory contents to the trial container working directory.

View the Run

  1. To view the run in your browser:

    • Enter the following URL: http://localhost:8080/. This is the cluster address for your local training environment.

    • Accept the default username of determined, and click Sign In. You’ll create a strong password in the next section.

  2. Navigate to the home page and then visit your Uncategorized experiments.

    • HPE Machine Learning Development Environment displays all runs in a flat view for ease of comparison.

    Determined AI WebUI Dashboard showing a user's recent experiment submissions
  3. Selecting the experiment displays more details such as metrics and checkpoints. With this baseline, we can now execute a multi-trial experiment, or “search”.

Create a Strong Password

  1. Select your profile in the upper left corner and then choose Settings.

  2. Edit the Password by typing a strong password.

  3. Select the checkmark to save your changes.

If you are changing your password, the system asks you to confirm your change. The system lets you know your changes have been saved.

Learn More#

Want to learn how to adapt your existing model code to HPE Machine Learning Development Environment?

The behavior of an experiment is configured via an experiment configuration, or YAML, file. A configuration file is typically passed as a command-line argument when an experiment is created with the CLI.

  • Visit the Experiment Configuration Reference for a complete description of the experiment configuration file.

  • Visit the Core API User Guide for a walk-through of how to adapt your existing model code to HPE Machine Learning Development Environment using the PyTorch MNIST model.

Deep Dive Quick Start

To learn more about how to change your configuration settings to run a distributed training job on multiple GPUs, visit the Quickstart for Model Developers.

More Tutorials

For more quick-start guides including API guides, visit the Tutorials.