AMD ROCm Support#
Overview#
Note
AMD ROCm support in HPE Machine Learning Development Environment is experimental. Features and configurations may change in future releases. We recommend testing thoroughly in a non-production environment before deploying to production.
HPE Machine Learning Development Environment provides experimental support for AMD ROCm GPUs in Kubernetes deployments. HPE Machine Learning Development Environment provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed support for MI300x users:
You can build these images locally based on the Dockerfiles found in the environments repository.
For more detailed information about configuration, visit the Helm Chart Configuration Reference or visit Known Issues and Limitations for details on current limitations and troubleshooting.
Configuring Kubernetes for AMD ROCm GPUs#
To use AMD ROCm GPUs in your Kubernetes deployment:
Ensure your Kubernetes cluster has nodes with ROCm-capable GPUs and the necessary drivers installed.
In your Helm chart values or HPE Machine Learning Development Environment configuration, set the following:
resourceManager: defaultComputeResourcePool: rocm-pool resourcePools: - pool_name: rocm-pool gpu_type: rocm max_slots: <number_of_rocm_gpus>
When submitting experiments or launching tasks, specify
slot_type: rocm
in your experiment configuration.
Using AMD ROCm Images in Experiments#
To use AMD ROCm images in your experiments, specify the image in your experiment configuration:
environment:
image: determinedai/pytorch-infinityhub-dev:rocm6.1-pytorch2.1-deepspeed0.10.0
Ensure that your experiment configuration also specifies slot_type: rocm
to use ROCm GPUs.
Known Issues and Limitations#
Agent Deprecation: Agent-based deployments are deprecated for ROCm support. Use Kubernetes with ROCm support for your deployments.
HIP GPU Errors: Launching experiments with
slot_type: rocm
may fail with the errorRuntimeError: No HIP GPUs are available
. Ensure compute nodes have compatible ROCm drivers and libraries installed and available in default locations or added to thePATH
and/orLD_LIBRARY_PATH
.Boost Filesystem Errors: You may encounter the error
boost::filesystem::remove: Directory not empty
during ROCm operations. A workaround is to disable per-container/tmp
using bind mounts in your experiment configuration or globally using thetask_container_defaults
section in your master configuration:bind_mounts: - host_path: /tmp container_path: /tmp