DeepSpeed Autotune: User Guide#
Getting the most out of DeepSpeed (DS) requires aligning the many DS parameters with the specific
properties of your hardware and model. Determined AI’s DeepSpeed Autotune (
dsat) helps to
optimize these settings through an easy-to-use API with very few changes required in user-code, as
we describe in the remainder of this user guide.
dsat can be used with
DeepSpeedTrial, Core API, and
How it Works#
You do not need to create a special configuration file to use
dsat. Assuming you have DeepSpeed
code which already functions, autotuning is as easy as inserting one or two helper functions into
your code and modifying the launch command.
For instance, let’s say your directory contains DeepSpeed code and a corresponding
experiment configuration file
deepspeed.yaml. Then, after inserting a line or two of
dsat-specific code per the instructions in the following sections, launching the
experiments is as easy as replacing the usual experiment-launching command:
det experiment create deepspeed.yaml .
python3 -m determined.pytorch.dsat asha deepspeed.yaml .
The above uses Determined AI’s DeepSpeed Autotune with the
asha algorithm, one of three
available search methods:
asha: Adaptively searches over randomly selected DeepSpeed configurations, allocating more compute resources to well-performing configurations. See this introduction to ASHA for more details.
binary: Performs a simple binary search over the batch size for randomly-generated DS configurations.
random: Conducts a search over random DeepSpeed configurations with an aggressive early-stopping criteria based on domain-knowledge of DeepSpeed and the search history.
DeepSpeed Autotune is built on top of Custom Searcher (see Custom Search Methods) which starts up two separate experiments:
singleSearch Runner Experiment: This experiment coordinates and schedules the trials that run the model code.
customExperiment: This experiment contains the trials referenced above whose results are reported back to the search runner.
Initially, a profiling trial is created to gather information regarding the model and computational
resources. The search runner experiment takes this initial profiling information and creates a
series of trials to search for the DS settings which optimize
(samples/second), or latency timing information. The results of all such trials can be viewed in the
custom experiment above. The search is informed both by the initial profiling trial and the
results of each subsequent trial, all of whose results are fed back to the search runner.
Determined’s DeepSpeed Autotune is not compatible with pipeline or model parallelism. The
to-be-trained model must be a
DeepSpeedEngine instance (not a
User Code Changes#
DeepSpeedTrial, Core API, and
HuggingFace Trainer, specific changes must be made to your user code. In the following sections, we
will describe specific use cases and the changes needed for each.
To use Determined’s DeepSpeed Autotune with
DeepSpeedTrial, you must meet the following
First, it is assumed that a base DeepSpeed configuration exists in a file (written following the
DeepSpeed documentation here). We then require that
yaml configuration points to the location of that file through a
deepspeed_config key in its
hyperparameters section. For example, if your default DeepSpeed
configuration is stored in
ds_config.json at the top-level of your model directory, your
hyperparameters section should include:
hyperparameters: deepspeed_config: ds_config.json
DeepSpeedTrial code must use our
get_ds_config_from_hparams() helper function to get the DeepSpeed
configuration dictionary which is generated by DeepSpeed Autotune for each trial. These dictionaries
are generated by overwriting certain fields in the base DeepSpeed configuration referenced in the
step above. The returned dictionary can then be passed to
deepspeed.initialize as usual:
from determined.pytorch.deepspeed import DeepSpeedTrial, DeepSpeedTrialContext from determined.pytorch import dsat class MyDeepSpeedTrial(DeepSpeedTrial): def __init__(self, context: DeepSpeedTrialContext) -> None: self.hparams = self.context.get_hparams() config = dsat.get_ds_config_from_hparams(self.hparams) model = ... model_parameters= ... model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize( model=model, model_parameters=model_parameters, config=config )
Using Determined’s DeepSpeed Autotune with a
instance requires no further changes to your code.
For a complete example of how to use DeepSpeed Autotune with
DeepSpeedTrial, visit the
Determined GitHub Repo
and navigate to
To find out more about
DeepSpeedTrial, visit Usage Guide.
When using DeepSpeed Autotune with a Core API experiment, there is one additional change to be made following the steps in the DeepSpeedTrial section above.
step methods of the
DeepSpeedEngine class need to be
wrapped in the
dsat_reporting_context() context manager. This
addition ensures that the autotuning metrics from each trial are captured and reported back to the
Here is an example sketch of
dsat code with Core API:
for op in core_context.searcher.operations(): for (inputs, labels) in trainloader: with dsat.dsat_reporting_context(core_context, op): # <-- The new code outputs = model_engine(inputs) loss = criterion(outputs, labels) model_engine.backward(loss) model_engine.step()
In this code snippet,
core_context is the
Context instance which was
determined.core.init(). The context manager requires access to both
core_context and the current
SearcherOperation instance (
appropriately report results. Outside of a
dsat_reporting_context is a no-op,
so there is no need to remove the context manager after the
dsat trials have completed.
For a complete example of how to use DeepSpeed Autotune with Core API, visit the Determined GitHub
and navigate to
You can also use Determined’s DeepSpeed Autotune with the HuggingFace (HF) Trainer and Determined’s
DetCallback callback object to optimize your DeepSpeed parameters.
Similar to the previous case (Core API), you need to add a
deepspeed_config field to the
hyperparameters section of your experiment configuration file, specifying the relative path to
json config file.
Reporting results back to the Determined master requires both the
context manager and
dsat performs a search over different batch sizes and HuggingFace expects
parameters to be specified as command-line arguments, an additional helper function,
get_hf_args_with_overwrites(), is needed to create consistent
Here is an example code snippet from a HuggingFace Trainer script that contains key pieces of relevant code:
from determined.transformers import DetCallback from determined.pytorch import dsat from transformers import HfArgumentParser,Trainer, TrainingArguments, hparams = self.context.get_hparams() parser = HfArgumentParser(TrainingArguments) args = sys.argv[1:] args = dsat.get_hf_args_with_overwrites(args, hparams) training_args = parser.parse_args_into_dataclasses(args, look_for_args_file=False) det_callback = DetCallback(core_context, ...) trainer = Trainer(args=training_args, ...) with dsat.dsat_reporting_context(core_context, op=det_callback.current_op): train_result = trainer.train(resume_from_checkpoint=checkpoint)
dsat_reporting_contextcontext manager shares the same initial
DetCallbackinstance through its
trainmethod of the HuggingFace trainer is wrapped in the
To find examples that use DeepSpeed Autotune with HuggingFace Trainer, visit the Determined GitHub
The command-line entrypoint to
dsat has various available options, some of them
search-algorithm-specific. All available options for any given search method can be found through
python3 -m determined.pytorch.dsat asha --help
and similar for the
random search methods.
Flags that are particularly important are detailed below.
The following options are available for every search method.
--max-trials: The maximum number of trials to run. Default:
--max-concurrent-trials: The maximum number of trials that can run concurrently. Default:
--max-slots: The maximum number of slots that can be used concurrently. Defaults to
None, i.e., there is no limit by default.
--metric: The metric to be optimized. Defaults to
FLOPS-per-gpu. Other available options are
--run-full-experiment: If specified, after the
dsatexperiment has completed, a
singleexperiment will be launched using the specifications in the
deepspeed.yamloverwritten with the best-found DS configuration parameters.
--zero-stages: This flag allows the user to limit the search to a subset of the stages by providing a space-separated list, as in
--zero-stages 2 3. Default:
1 2 3.
asha search algorithm randomly generates various DeepSpeed configurations and attempts to
tune the batch size for each configuration through a binary search.
asha adaptively allocates
resources to explore each configuration (providing more resources to promising lineages) where the
resource is the number of steps taken in each binary search (i.e., the number of trials).
asha can be configured with the following flags:
--max-rungs: The maximum total number of rungs to use in the ASHA algorithm. Larger values allow for longer binary searches. Default:
--min-binary-search-trials: The minimum number of trials to use for each binary search. The
rparameter in the ASHA paper. Default:
--divisor: Factor controlling the increased computational allotment across rungs, and the decrease in their population size. The
etaparameter in the ASHA paper. Default:
--search-range-factor: The inclusive, initial
hibound on the binary search is set by an approximate computation (the
lobound is always initialized to
1). This parameter adjusts the
hibound by a factor of
binary search algorithm performs a straightforward search over the batch size for a
collection of randomly-drawn DS configurations. A single option is available for this search:
--search-range-factor, which plays precisely the same role as in the asha Options section
random search algorithm performs a search over randomly drawn DS configurations and uses a
semi-random search over the batch size.
random can be configured with the following flags:
--trials-per-random-config: The maximum batch size configuration which will tested for a given DS configuration. Default:
--early-stopping: If provided, the experiment will terminate if a new best-configuration has not been found in the last
None, corresponding to no such early stopping.