TensorBoard is a popular tool for visualizing and inspecting deep learning models. In HPE Machine Learning Development Environment, you can use TensorBoard to examine individual experiments or compare multiple experiments.
Launch TensorBoard instances via the WebUI or the HPE Machine Learning Development Environment CLI. Before launching TensorBoard instances from the CLI, install the CLI on your development machine.
Single Experiment Analysis#
To analyze a single HPE Machine Learning Development Environment experiment using TensorBoard, use
det tensorboard start <experiment-id>:
$ det tensorboard start 7 Scheduling TensorBoard (rarely-cute-man) (id: aab49ba5-3357-4145-861c-7e6ff2d702c5)... TensorBoard (rarely-cute-man) was assigned to an agent... Scheduling tensorboard tensorboard (id: c68c9fc9-7eed-475b-a50f-fd78406d7c83)... TensorBoard is running at: http://localhost:8080/proxy/c68c9fc9-7eed-475b-a50f-fd78406d7c83/ disconnecting websocket
The HPE Machine Learning Development Environment master schedules a TensorBoard instance within the cluster. Once the TensorBoard instance is running, The HPE Machine Learning Development Environment CLI opens the TensorBoard web interface in your local browser.
To view information about scheduled and running TensorBoard instances, use:
$ det tensorboard list Id | Owner | Description | State | Experiment Id | Trial Ids | Exit Status --------------------------------------+------------+-------------------------------------+------------+-----------------+-------------+-------------- aab49ba5-3357-4145-861c-7e6ff2d702c5 | determined | TensorBoard (rarely-cute-man) | RUNNING | 7 | N/A | N/A
Multiple Experiment Analysis#
To analyze multiple experiments, use
det tensorboard start <experiment-id> <experiment-id> ....
Metrics might not be immediately available in TensorBoard upon opening the browser window. It usually takes up to five minutes for TensorBoard to receive data and display visualizations.
Customizing TensorBoard Instances#
HPE Machine Learning Development Environment allows you to initialize TensorBoard with a job configuration (YAML) file. This can be useful for running TensorBoard with a specific container image or for enabling access to additional data through a bind-mount.
Example job configuration file:
environment: image: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-0.20.1 bind_mounts: - host_path: /my/agent/path container_path: /my/container/path read_only: true tensorboard_args: ['--samples_per_plugin=images=100']
The tensorboard_args field allows you to provide optional TensorBoard arguments. In the example above, we set the maximum number of image to display to 100, overriding TensorBoard’s default value.
For detailed configuration settings, refer to the Job Configuration Reference.
To launch TensorBoard with an experiment configuration file, use
det tensorboard start
To view the configuration of a running TensorBoard instance, use
det tensorboard config
Analyzing Specific Trials#
HPE Machine Learning Development Environment also supports analyzing specific trials from one or more experiments. This can be useful for comparing a small number of trials from an experiment with many trials, or for comparing trials from different experiments.
To analyze specific trials, use
det tensorboard start --trial-ids <trial_id 1> <trial_id 2> ....
Data in TensorBoard#
This section provides a brief overview of how HPE Machine Learning Development Environment captures data from TensorFlow models. For a more in depth discussion on how TensorBoard visualizes data, consult the TensorBoard documentation.
TensorBoard visualizes data captured during model training and validation, which is stored in tfevent files. These files are generated by writing TensorFlow summary operations to disk using a tf.summary.FileWriter. Each deep learning framework has support for writing and upload metrics as tfevent files.
FileWriters are configured to write log files, called tfevent files, to a directory known as the
logdir. TensorBoard monitors this directory for changes and updates accordingly. The
supported by HPE Machine Learning Development Environment is
/tmp/tensorboard. All tfevent files
/tmp/tensorboard in a trial are uploaded to persistent storage when a trial is
configured with HPE Machine Learning Development Environment TensorBoard support.
HPE Machine Learning Development Environment Batch Metrics#
At the end of every training workload, batch metrics are collected and stored in the database, providing a granular view of model metrics over time. Batch metrics will appear in TensorBoard under the HPE Machine Learning Development Environment group. The x-axis of each plot corresponds to the batch number.
For example, a point at step 5 of the plot is the metric associated with the fifth batch seen.
To configure TensorBoard for a specific framework, follow the examples below:
For models using
TFKerasTrial, add a
determined.keras.callabacks.TensorBoard callback to your trial class:
from determined.keras import TFKerasTrial from determined.keras.callbacks import TensorBoard class MyModel(TFKerasTrial): ... def keras_callbacks(self): return [TensorBoard()]
TensorBoard Lifecycle Management#
HPE Machine Learning Development Environment automatically terminates idle TensorBoard instances. A
TensorBoard instance is considered idle if it does not receive HTTP traffic (a TensorBoard that is
still being viewed by a web browser is not considered idle). TensorBoards are terminated after 5
minutes by default; however, you can change the timeout duration by editing
in the master config file.
You can also terminate TensorBoard instances manually by using
det tensorboard kill
$ det tensorboard kill aab49ba5-3357-4145-861c-7e6ff2d702c5
To open a web browser window connected to a previously launched TensorBoard instance, use
tensorboard open. To view the logs of an existing TensorBoard instance, use
HPE Machine Learning Development Environment schedules TensorBoard instances in containers that run on agent machines. The HPE Machine Learning Development Environment master will proxy HTTP requests to and from the TensorBoard container. TensorBoard instances are hosted on agent machines but they do not occupy GPUs.
Logging Additional TensorBoard Events#
Any additional TFEvent files that are written to the appropriate path during training are accessible to TensorBoard. The appropriate path varies by worker rank and can be obtained by one of the following functions:
For CoreAPI users:
For PyTorchTrial users:
For DeepSpeedTrial users:
For TFKerasTrial users:
For more details and examples, refer to the TensorBoard How-To Guide.