How to View Epoch-Based Metrics#

Sometimes, you want to analyze and visualize your model’s training progress and validation performance over multiple epochs.

In this article, we’ll show you how to view epoch-based metric data in the WebUI by reporting an epochs metric to the HPE Machine Learning Development Environment master via the Core API. To do this, we’ll define an epochs metric and use it as the X-Axis label in the WebUI.

Recommended

Set Up Your Training Environment#

To begin, you’ll need an HPE Machine Learning Development Environment cluster. If you are new to HPE (HPE Machine Learning Development Environment), you can install the HPE Machine Learning Development Environment library and start a cluster locally.

Ensure you have Docker running and then run the following command:

pip install determined

# If your machine has GPUs:
det deploy local cluster-up

# If your machine does not have GPUs:
det deploy local cluster-up --no-gpu

Note

When deploying locally, the system prompts you to set a strong password.

The command, pip install determined, installs the determined library which includes the Determined command-line interface (CLI).

Step 1: Get the Tutorial Files & Run the Experiment#

To follow along, you’ll need the tutorial files.

Create a new directory.
Access the tutorial files via the core_api_pytorch_mnist.tgz download link or directly from the Github repository.

For this step, we’ll use our model_def.py script and its accompanying const.yaml experiment configuration file.

From the directory containing our files, we’ll begin by running the following command:

det e create const.yaml . -f

We don’t have any data to plot yet, but we’ll open the HPE Machine Learning Development Environment WebUI to see that our experiment is running.

Go to http://localhost:8080/.
Accept the default username of determined.
Click Sign In.

After signing in, create a strong password.

In the WebUI, we can select our experiment and visit the Logs tab.

Step 2: Report Epoch-Based Metrics#

In this section, we’ll define our epochs metric.

To follow along, use the model_def_metrics.py script and its accompanying metrics.yaml experiment configuration file.

Our script, model_def_metrics.py, is a modification of the model_def.py script. It already reports training and validation metrics to the HPE Machine Learning Development Environment master and contains a steps_completed variable that is needed to plot metrics on a graph in the WebUI.

Note

For a full description of the Core API PyTorch MNIST Tutorial files, visit the Core API User Guide.

Step 2.1: Modify the Train and Validation Methods#

Our script, model_def_metrics.py, contains core_context.train. This is used to report training and validation metrics.

However, we also want to report epoch-based metrics and to allow HPE Machine Learning Development Environment to keep track of the specific epoch for which training loss is being reported.

To do this, we’ll modify the train() method to include epochs as a metric. We will calculate fractional completed epochs based on batches_completed, since this training code reports more frequently than once per epoch:

partial_epoch = batches_completed / len(training_loader)
core_context.train.report_training_metrics(
  steps_completed=batches_completed + epoch_idx * len(train_loader),
  metrics={"train_loss": loss.item(), "epochs": epoch_idx + partial_epoch},
)

Similarly, we’ll include epochs as a metric in the reported validation metrics. This allows HPE Machine Learning Development Environment to track the specific epoch for which the validation loss is being reported:

epochs_completed = epoch_idx + 1
core_context.train.report_validation_metrics(
   steps_completed=steps_completed,
   metrics={"test_loss": test_loss, "epochs": epochs_completed},
)

Now that we’ve reported an epochs metric, Epochs will be an available option for the X-Axis when we view our metric data graph in the WebUI.

Step 2.2: Run the Experiment & View Epoch-Based Metrics#

Our modified script is ready to report epoch-based metrics to the HPE Machine Learning Development Environment master.

To run our experiment, we’ll run the following command:

det e create metrics.yaml .

To view our epoch-based metrics, we’ll open the HPE Machine Learning Development Environment WebUI and select our experiment.

Our experiment opens in the Overview tab.

We’ll go to the Metrics tab, select the X-Axis menu and then choose Epochs.
If we scroll down, we’ll be able to see the epoch-based metrics graph.

Summary#

In this article, you learned how to add a few lines of code to a script for the purpose of reporting epoch-based metrics in addition to training and validation metrics. You also learned how to view epoch-based metric data in the WebUI.

Next Steps#

Now you can try editing your own script for the purpose of reporting epoch-based metrics to the HPE Machine Learning Development Environment master.

For more tutorials, visit the Tutorials to learn the basics of working with HPE Machine Learning Development Environment and how to port your existing code to the HPE Machine Learning Development Environment environment.

How to View Epoch-Based Metrics

Contents

How to View Epoch-Based Metrics#

Set Up Your Training Environment#

Step 1: Get the Tutorial Files & Run the Experiment#

Step 2: Report Epoch-Based Metrics#

Step 2.1: Modify the Train and Validation Methods#

Step 2.2: Run the Experiment & View Epoch-Based Metrics#

Summary#

Next Steps#