Using TensorBoard#
TensorBoard is a popular tool for visualizing and inspecting deep learning models. In HPE Machine Learning Development Environment, you can use TensorBoard to examine individual experiments or compare multiple experiments.
Getting Started#
Launch TensorBoard instances via the WebUI or the HPE Machine Learning Development Environment CLI. Before launching TensorBoard instances from the CLI, install the CLI on your development machine.
Analyzing Experiments#
Single Experiment Analysis#
To analyze a single HPE Machine Learning Development Environment experiment using TensorBoard, use
det tensorboard start <experiment-id>
:
$ det tensorboard start 7
Scheduling TensorBoard (rarely-cute-man) (id: aab49ba5-3357-4145-861c-7e6ff2d702c5)...
TensorBoard (rarely-cute-man) was assigned to an agent...
Scheduling tensorboard tensorboard (id: c68c9fc9-7eed-475b-a50f-fd78406d7c83)...
TensorBoard is running at: http://localhost:8080/proxy/c68c9fc9-7eed-475b-a50f-fd78406d7c83/
disconnecting websocket
The HPE Machine Learning Development Environment master schedules a TensorBoard instance within the cluster. Once the TensorBoard instance is running, The HPE Machine Learning Development Environment CLI opens the TensorBoard web interface in your local browser.
To view information about scheduled and running TensorBoard instances, use:
$ det tensorboard list
Id | Owner | Description | State | Experiment Id | Trial Ids | Exit Status
--------------------------------------+------------+-------------------------------------+------------+-----------------+-------------+--------------
aab49ba5-3357-4145-861c-7e6ff2d702c5 | determined | TensorBoard (rarely-cute-man) | RUNNING | 7 | N/A | N/A
Multiple Experiment Analysis#
To analyze multiple experiments, use det tensorboard start <experiment-id> <experiment-id> ...
.
Note
Metrics might not be immediately available in TensorBoard upon opening the browser window. It usually takes up to five minutes for TensorBoard to receive data and display visualizations.
Customizing TensorBoard Instances#
HPE Machine Learning Development Environment allows you to initialize TensorBoard with a job configuration (YAML) file. This can be useful for running TensorBoard with a specific container image or for enabling access to additional data through a bind-mount.
Example job configuration file:
environment:
image: determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.8-gpu-0.20.1
bind_mounts:
- host_path: /my/agent/path
container_path: /my/container/path
read_only: true
tensorboard_args: ['--samples_per_plugin=images=100']
The tensorboard_args field allows you to provide optional TensorBoard arguments. In the example above, we set the maximum number of image to display to 100, overriding TensorBoard’s default value.
For detailed configuration settings, refer to the Job Configuration Reference.
To launch TensorBoard with an experiment configuration file, use det tensorboard start
<experiment-id> --config-file=my_config.yaml
.
To view the configuration of a running TensorBoard instance, use det tensorboard config
<tensorboard_id>
.
Analyzing Specific Trials#
HPE Machine Learning Development Environment also supports analyzing specific trials from one or more experiments. This can be useful for comparing a small number of trials from an experiment with many trials, or for comparing trials from different experiments.
To analyze specific trials, use det tensorboard start --trial-ids <trial_id 1> <trial_id 2> ...
.
Data in TensorBoard#
This section provides a brief overview of how HPE Machine Learning Development Environment captures data from TensorFlow models. For a more in depth discussion on how TensorBoard visualizes data, consult the TensorBoard documentation.
TensorBoard visualizes data captured during model training and validation, which is stored in tfevent files. These files are generated by writing TensorFlow summary operations to disk using a tf.summary.FileWriter. Each deep learning framework has support for writing and upload metrics as tfevent files.
FileWriters are configured to write log files, called tfevent files, to a directory known as the
logdir
. TensorBoard monitors this directory for changes and updates accordingly. The logdir
supported by HPE Machine Learning Development Environment is /tmp/tensorboard
. All tfevent files
written to /tmp/tensorboard
in a trial are uploaded to persistent storage when a trial is
configured with HPE Machine Learning Development Environment TensorBoard support.
HPE Machine Learning Development Environment Batch Metrics#
At the end of every training workload, batch metrics are collected and stored in the database, providing a granular view of model metrics over time. Batch metrics will appear in TensorBoard under the HPE Machine Learning Development Environment group. The x-axis of each plot corresponds to the batch number.
For example, a point at step 5 of the plot is the metric associated with the fifth batch seen.
Framework-Specific Configuration#
To configure TensorBoard for a specific framework, follow the examples below:
TensorFlow Keras#
For models using TFKerasTrial
, add a
determined.keras.callabacks.TensorBoard
callback to your trial class:
from determined.keras import TFKerasTrial
from determined.keras.callbacks import TensorBoard
class MyModel(TFKerasTrial):
...
def keras_callbacks(self):
return [TensorBoard()]
PyTorch#
TensorBoard Lifecycle Management#
HPE Machine Learning Development Environment automatically terminates idle TensorBoard instances. A
TensorBoard instance is considered idle if it does not receive HTTP traffic (a TensorBoard that is
still being viewed by a web browser is not considered idle). TensorBoards are terminated after 5
minutes by default; however, you can change the timeout duration by editing tensorboard_timeout
in the master config file.
You can also terminate TensorBoard instances manually by using det tensorboard kill
<tensorboard-id>
:
$ det tensorboard kill aab49ba5-3357-4145-861c-7e6ff2d702c5
To open a web browser window connected to a previously launched TensorBoard instance, use det
tensorboard open
. To view the logs of an existing TensorBoard instance, use det tensorboard
logs
.
Implementation Details#
HPE Machine Learning Development Environment schedules TensorBoard instances in containers that run on agent machines. The HPE Machine Learning Development Environment master will proxy HTTP requests to and from the TensorBoard container. TensorBoard instances are hosted on agent machines but they do not occupy GPUs.
Logging Additional TensorBoard Events#
Any additional TFEvent files that are written to the appropriate path during training are accessible to TensorBoard. The appropriate path varies by worker rank and can be obtained by one of the following functions:
For CoreAPI users:
get_tensorboard_path()
For PyTorchTrial users:
get_tensorboard_path()
For DeepSpeedTrial users:
get_tensorboard_path()
For TFKerasTrial users:
get_tensorboard_path()
For more details and examples, refer to the TensorBoard How-To Guide.