Debugging Models#
Using HPE Machine Learning Development Environment to debug models depends on your environment.
Your code on an HPE Machine Learning Development Environment cluster differs from typical training scripts in the following ways:
The code conforms to the Trial APIs as a subclass of the HPE Machine Learning Development Environment
Trial
class, indirectly, by using one of the concrete subclasses, such asDeepSpeedTrial
.The code runs in a Docker container on another machine.
Your model can run many times in a hyperparameter search.
Your model can be distributed across multiple GPUs or machines.
These debugging steps introduce code changes incrementally, working toward a fully functioning HPE Machine Learning Development Environment model. Follow the nine steps as applicable to your environment:
Prerequisite#
Ensure you have successfully installed an HPE Machine Learning Development Environment cluster. The cluster can be installed on a local development machine, on-prem, or on cloud. For installation guides, visit Install and Set Up HPE Machine Learning Development Environment.
Step 1 - Verify that your code runs locally#
This step assumes you have ported (converted) your model from code outside of HPE Machine Learning Development Environment. Otherwise, skip to Step 2.
Confirm that your code works as expected before continuing.
Step 2 - Verify that each Trial subclass method works locally#
This step assumes you have a working local environment for training. If you typically run your code in a Docker environment, skip to Step 4. This step also ensures that your class performs as expected by calling its methods and verifying the output.
PyTorchTrial
supports a fully-local training mode, which can be useful
for debugging. See PyTorch Trainer for usage details.
Create simple tests to verify each
Trial
subclass method.Examples of what these tests might look like for
DeepSpeedTrial
andTFKerasTrial
can be found in thedetermined.TrialContext.from_config()
documentation, but only you can verify what is reasonable for your test.Diagnose failures:
If you experience issues running the
Trial
subclass methods locally, it is likely there are errors are in your trial class or thehyperparameters
section of your configuration file. Ideally, method-by-method evaluation makes it easier to find and solve issues.
Step 3 - Verify local test mode#
Step 2 validated that your Trial API calls work as expected. This step uses your code to run an actual HPE Machine Learning Development Environment training loop with abbreviated workloads to make sure that it meets HPE Machine Learning Development Environment requirements.
This step assumes you have a working local environment for training. If you do not, skip to Step 4.
Create an experiment using the following command:
det experiment create myconfig.yaml my_model_dir --local --test
The
--local
argument specifies that training occurs where you launched the experiment instead of occurring on a cluster. The--test
argument runs abbreviated workloads to try to detect bugs sooner and exits immediately.The test is considered to have passed if the command completes successfully.
Diagnose failures:
Local test mode performs the following actions:
Builds a model.
Runs a single batch of training data.
Evaluates the model.
Saves a checkpoint to a dummy location.
If your per-method checks in Step 2 passed but local test mode fails, your
Trial
subclass might not be implemented correctly. Double-check the documentation. It is also possible that you have found a bug or an invalid assumption in the HPE Machine Learning Development Environment software and should file a GitHub issue or contact HPE Machine Learning Development Environment on Slack.
Step 4 - Verify that the original code runs in a notebook or shell#
This step is the same as Step 1, except the original code runs on the HPE Machine Learning Development Environment cluster instead of locally.
Launch a notebook or shell on the cluster:
Pass the root directory containing your model and training scripts in the
--context
argument:If you prefer a Jupyter notebook, enter:
det notebook start --context my_model_dir # Your browser should automatically open the notebook.
If you prefer to use SSH to interact with your model, enter:
det shell start --context my_model_dir # Your terminal should automatically connect to the shell.
Note that changes made to the
--context
directory while inside the notebook or shell do not affect the original files outside of the notebook or shell. See Save and Restore Notebook State for more information.Verify code execution:
After you are on the cluster, testing is the same as Step 1.
Diagnose failures:
If you are unable to start the container and receive a message about the context directory exceeding the maximum allowed size, it is because the
--context
directory cannot be larger than 95MB. If you need larger model definition files, consider setting up a bind mount using thebind_mounts
field of the task configuration. The Prepare Data document lists additional strategies for accessing files inside a containerized environment.You might be referencing files that exist locally but are outside of the
--context
directory. If the files are small, you may be able to copy them into the--context
directory. Otherwise, bind mounting the files can be an option.If you get dependency errors, dependencies might be installed locally that are not installed in the Docker environment used on the cluster. See Customize Your Environment and Custom Images for available options.
If you need environment variables to be set for your model to work, see Job Configuration Reference.
Step 5 - Verify that each Trial subclass method works in a notebook or shell#
This step is the same as Step 2, except the original code runs on the HPE Machine Learning Development Environment cluster instead of locally.
Launch a notebook or shell:
If you prefer to use Jupyter notebook, enter:
det notebook start --context my_model_dir # Your browser should automatically open the notebook.
If you prefer to use SSH to interact with your model, enter:
det shell start --context my_model_dir # Your terminal should automatically connect to the shell.
When interacting with the shell or notebook, testing is the same as Step 2.
Diagnose failures:
Combine the failure diagnosis steps used in Step 2 and Step 4.
Step 6 - Verify that local test mode works in a notebook or shell#
This step is the same as Step 3, except the original code runs on the HPE Machine Learning Development Environment cluster instead of locally.
Launch a notebook or shell as described in Step 4.
On the cluster, testing is the same as Step 3, except that the second model definition argument of the
det experiment create
command should be/run/determined/workdir
or.
if you have not changed the working directory after connecting to the cluster. This is because the--context
specified when creating the shell or notebook is copied to the/run/determined/workdir
directory inside the container, the same as the model definition argument is copied todet experiment create
.Diagnose failures following the same steps described in Step 3 and Step 4.
Step 7 - Verify that cluster test mode works with slots_per_trial set to 1#
This step is similar to Step 6, except instead of launching the command from an interactive environment, it is submitted to the cluster and managed by HPE Machine Learning Development Environment.
Apply customizations:
If you customized your command environment in testing Step 3, Step 4, or Step 5, make sure to apply the same customizations in your experiment configuration file.
Set
resources.slots_per_trial
:Confirm that your experiment config does not specify
resources.slots_per_trial
or that it is set to1
. For example:resources: slots_per_trial: 1
Create an experiment with the
--test
argument, omitting the--local
argument:det experiment create myconfig.yaml my_model_dir --test
Diagnose failures:
If you can run local test mode inside a notebook or shell but are unable to successfully submit an experiment, make sure that notebook or shell customizations you might have made are replicated in your experiment configuration, such as:
If required, a custom Docker image is set in the experiment configuration.
pip install
orapt install
commands needed in the interactive environment are built into a custom Docker image or included in thestartup-hook.sh
file in the model definition directory root. See Startup Hooks for more information.Custom bind mounts required in the interactive environment are specified in the experiment configuration.
Environment variables are correctly set in the experiment configuration.
If no customizations are missing, the following new layers introduced with a cluster-managed experiment could be the cause of the problem:
The
checkpoint_storage
settings are used for cluster-managed training. Ifcheckpoint_storage
is not configured in the experiment configuration or the master configuration, an error message can occur during experiment configuration validation before the experiment or trials are created. Correct this by providing acheckpoint_storage
configuration in one of the following locations:For a cluster-based experiment, configured
checkpoint_storage
settings are validated before training starts. The messageCheckpoint storage validation failed
, indicates that you should review thecheckpoint_storage
setting values.The experiment configuration is more strictly validated for cluster-managed experiments than for
--local --test
mode. Errors related toinvalid experiment configuration
when attempting to submit the experiment to the cluster indicate that the experiment configuration has errors. Review the experiment configuration.
If you are unable to identify the cause of the problem, contact HPE Machine Learning Development Environment community support!
Step 8 - Verify that a single-GPU experiment works#
This step is similar to Step 7, except that it introduces hyperparameter search and executes full training for each trial.
Configure your system the same as Step 7:
Confirm that your experiment configuration does not specify
resources.slots_per_trial
or that it is set to1
. For example:resources: slots_per_trial: 1
Create an experiment without the
--test
or--local
arguments:You might find the
--follow
, or-f
, argument helpful:det experiment create myconfig.yaml my_model_dir -f
Diagnose failures:
If Step 7 worked but this step does not, check:
Check if the error happens when the experiment configuration has
searcher.source_trial_id
set. One possibility in an actual experiment that does not occur in a--test
experiment is the loading of a previous checkpoint. Errors when loading from a checkpoint can be caused by architectural changes, where the new model code is not architecturally compatible with the old model code.Generally, issues in this step are caused by doing training and evaluation continuously. Focus on how that change can cause issues with your code.
Step 9 - Verify that a multi-GPU experiment works#
This step is similar to Step 8, except that it introduces distributed training. This step only applies if you have multiple GPUs and want to use distributed training.
Configure your system the same as Step 7:
Set
resources.slots_per_trial
to a number greater than1
. For example:resources: slots_per_trial: 2
Create your experiment:
det experiment create myconfig.yaml my_model_dir -f
Diagnose failures:
If you are using the
determined
library APIs correctly, distributed training should work without error. Otherwise, common problems might be:If your experiment is not being scheduled on the cluster, ensure that the
slots_per_trial
setting is valid for your cluster. For example:If you have four HPE Machine Learning Development Environment agents running with four GPUs each, your
slots_per_trial
could be1
,2
,3
, or4
, which fits on a single machine.A
slots_per_trial
value of8
,12
, or16
completely utilizes a number of agent machines.A
slots_per_trial
value of5
implies more than one agent but it is not a multiple of agent size so this is an invalid case.A
slots_per_trial
value of32
is too large for the cluster and is also an invalid case.
Ensure that there are no other notebooks, shells, or experiments on the cluster that might consume too many resources and prevent the experiment from starting.
HPE Machine Learning Development Environment is designed to control the details of distributed training for you. If you also try to control those details, such as by calling
tf.config.set_visible_devices()
in aTFKerasTrial
, it is likely to cause issues.Some classes of metrics must be specially calculated during distributed training. Most metrics, such as loss or accuracy, can be calculated piecemeal on each worker in a distributed training job and averaged afterward. Those metrics are handled automatically by HPE Machine Learning Development Environment and do not need special handling. Other metrics, such as F1 score, cannot be averaged from individual worker F1 scores. HPE Machine Learning Development Environment has tooling for handling these metrics. See the documentation for using custom metric reducers with PyTorch.