Profiling#
Optimizing a model’s performance is often a trade-off between accuracy, time, and resource requirements. Training deep learning models is a time and resource intensive process, where each iteration can take several hours and accumulate heavy hardware costs. Though sometimes this cost is inherent to the task, unnecessary resource consumption can be caused by suboptimal code or bugs. Thus, achieving optimal model performance requires an understanding of how your model interacts with the system’s computational resources.
Profiling collects metrics on how computational resources like CPU, GPU, and memory are being utilized during a training job. It can reveal patterns in resource utilization that indicate performance bottlenecks and pinpoint areas of the code or pipeline that are causing slowdowns or inefficiencies.
A training job can be profiled at many different layers, from generic system-level metrics to individual model operators and GPU kernels. HPE Machine Learning Development Environment provides a few options for profiling, each targeting a different layer in a training job at various levels of detail:
HPE Machine Learning Development Environment system metrics profiler collects general system-level metrics and provides an overview of hardware usage during an experiment.
Native profiler integration enables model profiling in training APIs that provides fine-grained metrics specific to your model.
Prometheus/Grafana integration can be set up to track detailed hardware metrics and monitor overall cluster health.
HPE Machine Learning Development Environment Profiler#
HPE Machine Learning Development Environment comes with a built-in profiler that provides out-of-the-box tracking for system-level metrics. System metrics are statistics around hardware usage, such as GPU utilization, disk usage, and network throughput.
These metrics provide a general overview of resource usage during a training run, and can be useful for quickly identifying ineffective usages of computational resources. When the system metrics reported for an experiment do not match hardware expectations, that is a sign that the software may be able to be optimized to make better use of the hardware resources.
The HPE Machine Learning Development Environment profiler collects a set of system metrics throughout an experiment which can be visualized in the WebUI under the experiment’s “Profiler” tab. It is supported for all training APIs, but is not enabled by default.
Visit Core API Profiler to find out how to enable and configure the HPE Machine Learning Development Environment profiler for your experiment.
The following system metrics are tracked:
GPU utilization (percent): utilization of a GPU device
GPU free memory (bytes): amount of free memory available on a GPU device
Network throughput - sent (bytes/s): bytes sent system-wide
Network throughput (received): bytes received system-wide
Disk IOPS (operations/s): number of read + writes system-wide
Disk throughput - reads (bytes/s): bytes read system-wide
Disk throughput - writes (bytes/s): bytes written system-wide
Host available memory (bytes): amount of memory available (not including swap) system-wide
CPU utilization (percent): utilization of CPU cores, averaged across all cores in the system
For distributed training, these metrics are collected for every agent. The data is broken down by agent, and GPU metrics can be further broken down by GPU.
Note
System Metrics record agent-level metrics, so when there are multiple experiments on the same agent, it is difficult to analyze.
Native Training Profilers#
Sometimes system-level profiling doesn’t capture enough data to help debug bottlenecks in model training code. Identifying inefficiencies in individual training operations or steps requires a more fine-grained context than generic system metrics can provide. For this level of profiling, HPE Machine Learning Development Environment supports integration with training profilers that are native to their frameworks:
PyTorch Profiler (PyTorch API)
DeepSpeed Profiler (DeepSpeed API)
TensorFlow Keras Profiler (Keras API)
Please see your framework’s profiler documentation and the HPE Machine Learning Development Environment Training API guide for usage details.
Prometheus and Grafana Integration#
For a more resource-centric view of HPE Machine Learning Development Environment jobs, HPE Machine Learning Development Environment provides a Prometheus endpoint along with a pre-configured Grafana dashboard. These can be set up to track detailed hardware usage metrics for an HPE Machine Learning Development Environment cluster, and can be helpful for alerting and monitoring cluster health.
The Prometheus endpoint aggregates system metrics and associates them with HPE Machine Learning Development Environment concepts such as experiments, tags, and resource pools, which can be viewed in Grafana. HPE Machine Learning Development Environment provides a Grafana dashboard that shows real-time resource metrics across an entire cluster as well as experiments, containers, and resource pools.
Visit Prometheus & Grafana to find out how to enable this functionality.