Using TensorBoard#

TensorBoard is a popular tool for visualizing and inspecting deep learning models. In HPE Machine Learning Development Environment, you can use TensorBoard to examine individual experiments or compare multiple experiments.

Getting Started#

Launch TensorBoard instances via the WebUI or the HPE Machine Learning Development Environment CLI. Before launching TensorBoard instances from the CLI, install the CLI on your development machine.

Analyzing Experiments#

Single Experiment Analysis#

To analyze a single HPE Machine Learning Development Environment experiment using TensorBoard, use det tensorboard start <experiment-id>:

$ det tensorboard start 7
Scheduling TensorBoard (rarely-cute-man) (id: aab49ba5-3357-4145-861c-7e6ff2d702c5)...
TensorBoard (rarely-cute-man) was assigned to an agent...
Scheduling tensorboard tensorboard (id: c68c9fc9-7eed-475b-a50f-fd78406d7c83)...
TensorBoard is running at: http://localhost:8080/proxy/c68c9fc9-7eed-475b-a50f-fd78406d7c83/
disconnecting websocket

The HPE Machine Learning Development Environment master schedules a TensorBoard instance within the cluster. Once the TensorBoard instance is running, The HPE Machine Learning Development Environment CLI opens the TensorBoard web interface in your local browser.

To view information about scheduled and running TensorBoard instances, use:

$ det tensorboard list
 Id                                   | Owner      | Description                         | State      | Experiment Id   | Trial Ids   | Exit Status
--------------------------------------+------------+-------------------------------------+------------+-----------------+-------------+--------------
 aab49ba5-3357-4145-861c-7e6ff2d702c5 | determined | TensorBoard (rarely-cute-man)       | RUNNING    | 7               | N/A         | N/A

Multiple Experiment Analysis#

To analyze multiple experiments, use det tensorboard start <experiment-id> <experiment-id> ....

Note

Metrics might not be immediately available in TensorBoard upon opening the browser window. It usually takes up to five minutes for TensorBoard to receive data and display visualizations.

Customizing TensorBoard Instances#

HPE Machine Learning Development Environment allows you to initialize TensorBoard with a job configuration (YAML) file. This can be useful for running TensorBoard with a specific container image or for enabling access to additional data through a bind-mount.

Example job configuration file:

environment:
  image: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-0.20.1
bind_mounts:
  - host_path: /my/agent/path
    container_path: /my/container/path
    read_only: true
tensorboard_args: ['--samples_per_plugin=images=100']

The tensorboard_args field allows you to provide optional TensorBoard arguments. In the example above, we set the maximum number of image to display to 100, overriding TensorBoard’s default value.

For detailed configuration settings, refer to the Job Configuration Reference.

To launch TensorBoard with an experiment configuration file, use det tensorboard start <experiment-id> --config-file=my_config.yaml.

To view the configuration of a running TensorBoard instance, use det tensorboard config <tensorboard_id>.

Analyzing Specific Trials#

HPE Machine Learning Development Environment also supports analyzing specific trials from one or more experiments. This can be useful for comparing a small number of trials from an experiment with many trials, or for comparing trials from different experiments.

To analyze specific trials, use det tensorboard start --trial-ids <trial_id 1> <trial_id 2> ....

Data in TensorBoard#

This section provides a brief overview of how HPE Machine Learning Development Environment captures data from TensorFlow models. For a more in depth discussion on how TensorBoard visualizes data, consult the TensorBoard documentation.

TensorBoard visualizes data captured during model training and validation, which is stored in tfevent files. These files are generated by writing TensorFlow summary operations to disk using a tf.summary.FileWriter. Each deep learning framework has support for writing and upload metrics as tfevent files.

FileWriters are configured to write log files, called tfevent files, to a directory known as the logdir. TensorBoard monitors this directory for changes and updates accordingly. The logdir supported by HPE Machine Learning Development Environment is /tmp/tensorboard. All tfevent files written to /tmp/tensorboard in a trial are uploaded to persistent storage when a trial is configured with HPE Machine Learning Development Environment TensorBoard support.

HPE Machine Learning Development Environment Batch Metrics#

At the end of every training workload, batch metrics are collected and stored in the database, providing a granular view of model metrics over time. Batch metrics will appear in TensorBoard under the HPE Machine Learning Development Environment group. The x-axis of each plot corresponds to the batch number.

For example, a point at step 5 of the plot is the metric associated with the fifth batch seen.

Framework-Specific Configuration#

To configure TensorBoard for a specific framework, follow the examples below:

TensorFlow Keras#

For models using TFKerasTrial, add a determined.keras.callabacks.TensorBoard callback to your trial class:

from determined.keras import TFKerasTrial
from determined.keras.callbacks import TensorBoard


class MyModel(TFKerasTrial):
    ...

    def keras_callbacks(self):
        return [TensorBoard()]

PyTorch#

See PyTorchTrialContext.get_tensorboard_writer()

TensorBoard Lifecycle Management#

HPE Machine Learning Development Environment automatically terminates idle TensorBoard instances. A TensorBoard instance is considered idle if it does not receive HTTP traffic (a TensorBoard that is still being viewed by a web browser is not considered idle). TensorBoards are terminated after 5 minutes by default; however, you can change the timeout duration by editing tensorboard_timeout in the master config file.

You can also terminate TensorBoard instances manually by using det tensorboard kill <tensorboard-id>:

$ det tensorboard kill aab49ba5-3357-4145-861c-7e6ff2d702c5

To open a web browser window connected to a previously launched TensorBoard instance, use det tensorboard open. To view the logs of an existing TensorBoard instance, use det tensorboard logs.

Implementation Details#

HPE Machine Learning Development Environment schedules TensorBoard instances in containers that run on agent machines. The HPE Machine Learning Development Environment master will proxy HTTP requests to and from the TensorBoard container. TensorBoard instances are hosted on agent machines but they do not occupy GPUs.

Logging Additional TensorBoard Events#

Any additional TFEvent files that are written to the appropriate path during training are accessible to TensorBoard. The appropriate path varies by worker rank and can be obtained by one of the following functions:

For more details and examples, refer to the TensorBoard How-To Guide.