Troubleshooting#

Important

TensorFlow users must configure their environment image in their experiment configuration file before submitting an experiment.
environment:
   image:
      cpu: determinedai/tensorflow-ngc-dev:f17151a
      gpu: determinedai/tensorflow-ngc-dev:f17151a

Error messages#

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=35777 /var/lib/docker/devicemapper/mnt/7b5b6d59cd4fe9307b7523f1cc9ce3bc37438cc793ff4a5a18a0c0824ec03982/rootfs]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.

This error message indicates that the GPU hardware and/or NVIDIA drivers installed on the agent are not compatible with CUDA 10, and you are trying to run a Docker image that depends on CUDA 10.

To resolve this issue, run the following commands. If the first succeeds and the second fails, you should be able to use HPE Machine Learning Development Environment as long as you use Docker images based on CUDA 9.

docker run --gpus all --rm nvidia/cuda:9.0-runtime nvidia-smi
docker run --gpus all --rm nvidia/cuda:10.0-runtime nvidia-smi

Debug Database Migration Failures#

Dirty database version <a long number>. Fix and force version.

If you see the above error message, a database migration was likely interrupted while running and the database is now in a dirty state.

Make sure you back up the database and temporarily shut down the master before proceeding further.

To fix this error message, locate the up migration with a suffix of .up.sql and a prefix matching the long number in the error message in this directory <https://github.com/determined-ai/determined/tree/main/master/static/migrations>_ and carefully run the SQL within the file manually against the database used by HPE Machine Learning Development Environment. For convenience, all the information needed to connect except the password can be found with:

det master config | jq .db

If this proceeds successfully, then mark the migration as successful by running the following SQL:

UPDATE schema_migrations SET dirty = false;

And restart the master. Otherwise, please seek assistance in the community Slack.

Validate NVIDIA Container Toolkit#

To verify that an HPE Machine Learning Development Environment agent instance can run containers that use GPUs, run:

docker run --gpus all --rm debian:10-slim nvidia-smi

You should see output that describes the GPUs available on the agent instance, such as:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 56%   84C    P2   177W / 250W |  10729MiB / 11176MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   62C    P0    56W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 31%   64C    P0    57W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 20%   36C    P0    57W / 250W |      0MiB / 12196MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4638      C   python3                                    10719MiB |
+-----------------------------------------------------------------------------+

Troubleshooting

Contents

Troubleshooting#

Error messages#

Debug Database Migration Failures#

Validate NVIDIA Container Toolkit#