AMD ROCm Support#

Overview#

Note

AMD ROCm support in HPE Machine Learning Development Environment is experimental. Features and configurations may change in future releases. We recommend testing thoroughly in a non-production environment before deploying to production.

HPE Machine Learning Development Environment provides experimental support for AMD ROCm GPUs in Kubernetes deployments. HPE Machine Learning Development Environment provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed support for MI300x users:

You can build these images locally based on the Dockerfiles found in the environments repository.

For more detailed information about configuration, visit the Helm Chart Configuration Reference or visit Known Issues and Limitations for details on current limitations and troubleshooting.

Configuring Kubernetes for AMD ROCm GPUs#

To use AMD ROCm GPUs in your Kubernetes deployment:

  1. Ensure your Kubernetes cluster has nodes with ROCm-capable GPUs and the necessary drivers installed.

  2. In your Helm chart values or HPE Machine Learning Development Environment configuration, set the following:

    resourceManager:
      defaultComputeResourcePool: rocm-pool
    
    resourcePools:
      - pool_name: rocm-pool
        gpu_type: rocm
        max_slots: <number_of_rocm_gpus>
    
  3. When submitting experiments or launching tasks, specify slot_type: rocm in your experiment configuration.

Using AMD ROCm Images in Experiments#

To use AMD ROCm images in your experiments, specify the image in your experiment configuration:

environment:
  image: determinedai/pytorch-infinityhub-dev:rocm6.1-pytorch2.1-deepspeed0.10.0

Ensure that your experiment configuration also specifies slot_type: rocm to use ROCm GPUs.

Known Issues and Limitations#

  • Agent Deprecation: Agent-based deployments are deprecated for ROCm support. Use Kubernetes with ROCm support for your deployments.

  • HIP GPU Errors: Launching experiments with slot_type: rocm may fail with the error RuntimeError: No HIP GPUs are available. Ensure compute nodes have compatible ROCm drivers and libraries installed and available in default locations or added to the PATH and/or LD_LIBRARY_PATH.

  • Boost Filesystem Errors: You may encounter the error boost::filesystem::remove: Directory not empty during ROCm operations. A workaround is to disable per-container /tmp using bind mounts in your experiment configuration or globally using the task_container_defaults section in your master configuration:

    bind_mounts:
       - host_path: /tmp
         container_path: /tmp