Usage Guide#
This usage guide introduces DeepSpeed and guides you through how to train a PyTorch model with the
DeepSpeed engine. To implement DeepSpeedTrial
, you need to
overwrite specific functions corresponding to common training aspects. It is helpful to work from a
skeleton trial to keep track of what is required, as the following example template shows:
from typing import Any, Dict, Iterator, Optional, Union
from attrdict import AttrDict
import torch
import deepspeed
import determined.pytorch import DataLoader, TorchData
from determined.pytorch.deepspeed import DeepSpeedTrial, DeepSpeedTrialContext
class MyTrial(DeepSpeedTrial):
def __init__(self, context: DeepSpeedTrialContext) -> None:
self.context = context
self.args = AttrDict(self.context.get_hparams())
def build_training_data_loader(self) -> DataLoader:
return DataLoader()
def build_validation_data_loader(self) -> DataLoader:
return DataLoader()
def train_batch(
self,
dataloader_iter: Optional[Iterator[TorchData]],
epoch_idx: int,
batch_idx: int,
) -> Union[torch.Tensor, Dict[str, Any]]:
return {}
def evaluate_batch(
self, dataloader_iter: Optional[Iterator[TorchData]], batch_idx: int
) -> Dict[str, Any]:
return {}
The DeepSpeed API organizes training routines into common steps like creating the data loaders and training and evaluating the model. The provided template shows the function signatures, including the expected return types, for these methods.
Because DeepSpeed is built on top of PyTorch, there are many similarities between the API for
PyTorchTrial
and DeepSpeedTrial
.
The following steps show you how to implement each of the
DeepSpeedTrial
methods beginning with training objects
initialization.
Step 1- Configure and Initialize Training Objects#
DeepSpeed training initialization consists of two steps:
Initialize the distributed backend.
Create the DeepSpeed model engine.
Refer to the DeepSpeed Getting Started guide for more information.
Outside of Determined, this is typically done in the following way:
deepspeed.init_distributed(dist_backend=args.backend)
net = ...
model_engine, optimizer, lr_scheduler, _ = deepspeed.initialize(args=args, net=net, ...)
DeepSpeedTrial
automatically initializes the distributed
training backend so all you need to do is initialize the model engine and other training objects in
the DeepSpeedTrial
__init__()
method.
Configuration#
DeepSpeed behavior during training is configured by passing arguments when initializing the model engine. This can be done in two ways:
Using a configuration file specified as an argument with a field named
deepspeed_config
.Using a dictionary, which is passed in directly when initializing a model engine.
Both approaches can be used in combination with the Determined experiment configuration. See the DeepSpeed documentation for more information on what can be specified in the configuration.
If you want to use a DeepSpeed configuration file, the hyperparameters section can be used as
arguments to pass to deepspeed.initialize
. For example, if the DeepSpeed configuration file is
named ds_config.json
, the hyperparameter section of the Determined experiment configuration is:
hyperparameters:
deepspeed_config: ds_config.json
...
If you want to overwrite some values in an existing DeepSpeed configuration file, use
overwrite_deepspeed_config()
and an experiment configuration
similar to:
hyperparameters:
deepspeed_config: ds_config.json
overwrite_deepspeed_args:
train_batch_size: 16
optimizer:
params:
lr: 0.005
...
If you want to use a dictionary directly, specify a DeepSpeed configuration dictionary in the hyperparameters section:
hyperparameters:
optimizer:
type: Adam
params:
betas:
- 0.8
- 0.999
eps: 1.0e-08
lr: 0.001
weight_decay: 3.0e-07
train_batch_size: 16
zero_optimization:
stage: 0
allgather_bucket_size: 50000000
allgather_partitions: true
contiguous_gradients: true
cpu_offload: false
overlap_comm: true
reduce_bucket_size: 50000000
reduce_scatter: true
Initialization#
After configuration, you can initialize the model engine in the
DeepSpeedTrial
. The following example corresponds to the
experiment configuration above, with a field in the hyperparameters
section named
overwrite_deepspeed_args
.
class MyTrial(DeepSpeedTrial):
def __init__(self, context: DeepSpeedTrialContext) -> None:
self.context = context
self.args = AttrDict(self.context.get_hparams())
model = Net(self.args)
ds_config = overwrite_deepspeed_config(
self.args.deepspeed_config, self.args.get("overwrite_deepspeed_args", {})
)
parameters = filter(lambda p: p.requires_grad, model.parameters())
model_engine, __, __, __ = deepspeed.initialize(
model=model, model_parameters=parameters, config=ds_config
)
self.model_engine = self.context.wrap_model_engine(model_engine)
After the model engine is initialized, you need to register it with Determined by calling
wrap_model_engine()
. Differing from
PyTorchTrial, you do not need to register the optimizer or learning rate scheduler with Determined
because both are attributes of the model engine.
If you want to use pipeline parallelism with a given model, pass layers of the model for partitioning to the DeepSpeed PipelineModule before creating the pipeline model engine:
net = ...
net = deepspeed.PipelineModule(
layers=get_layers(net),
loss_fn=torch.nn.CrossEntropyLoss(),
num_stages=args.pipeline_parallel_size,
...,
)
Step 2 - Load Data#
The next step is to build the data loader used for training and validation. The same process is used to download the data for PyTorchTrial. Building the data loaders is also similar, except for the batch size specification for the returned data loaders, which differs because the DeepSpeed model engines automatically handle gradient aggregation.
Automatic gradient aggregation in DeepSpeed is specified in configuration fields:
train_batch_size
train_micro_batch_size
gradient_accumulation_steps
which are related as follows:
train_batch_size = train_micro_batch_size * gradient_accumulation_steps * data_parallel_size,
where data_parallel_size
is the number of model replicas across all GPUs used during training.
Therefore, a single train batch consists of multiple micro batches, specified by the
gradient_accumulation_steps
argument. Given a model parallelization scheme, you can specify two
fields and the third can be inferred.
The DeepSpeed model engines assume the model is processing micro batches and automatically handle
stepping the optimizer and learning rate scheduler every gradient_accumulation_steps
micro
batches. This means that the build_training_data_loader
should return batches of size
train_micro_batch_size_per_gpu
. In most cases, build_validation_data_loader
also returns
batches of size train_micro_batch_size_per_gpu
.
If you want exact epoch boundaries to be respected, the number of micro batches in the training data
loader should be divisible by gradient_accumulation_steps
.
If you are using pipeline parallelism, the validation data loader needs to have at least
gradient_accumulation_steps
worth of batches.
Step 3 - Training and Evaluation#
This step covers the training and evaluation routine for the standard data parallel model engine and the pipeline parallel engine available in DeepSpeed.
After you create the DeepSpeed model engine and data loaders, define the training and evaluation
routines for the DeepSpeedTrial
. Differing from
PyTorchTrial
,
train_batch()
and
evaluate_batch()
take an iterator over the
corresponding data loader built from
build_training_data_loader()
and
build_validation_dataloader()
instead of a batch.
Data Parallel Training#
For data parallel training, only, the training and evaluation routines are:
def train_batch(
self,
dataloader_iter: Optional[Iterator[TorchData]],
epoch_idx: int,
batch_idx: int,
) -> Union[torch.Tensor, Dict[str, Any]]:
inputs = self.context.to_device(next(dataloader_iter))
loss = self.model_engine(inputs)
self.model_engine.backward(loss)
self.model_engine.step()
return {"loss": loss}
def evaluate_batch(
self, dataloader_iter: Optional[Iterator[TorchData]], batch_idx: int
) -> Dict[str, Any]:
inputs = self.context.to_device(next(dataloader_iter))
loss = self.model_engine(inputs)
return {"loss": loss}
You need to manually get a batch from the iterator and move it to the GPU using the provided
to_device()
helper function, which knows
the GPU assigned to a given distributed training process.
Pipeline Parallel Training#
When using pipeline parallelism, the forward and backward steps during training are combined into a
single function call because DeepSpeed automatically interleaves multiple micro batches for
processing in a single training step. In this case, there is no need to manually get a batch from
the dataloader_iter
iterator because the pipeline model engine requests it as needed while
interleaving micro batches:
def train_batch(
self,
dataloader_iter: Optional[Iterator[TorchData]],
epoch_idx: int,
batch_idx: int,
) -> Union[torch.Tensor, Dict[str, Any]]:
loss = self.model_engine.train_batch()
return {"loss": loss}
def evaluate_batch(
self, dataloader_iter: Optional[Iterator[TorchData]], batch_idx: int
) -> Dict[str, Any]:
loss = self.model_engine.eval_batch()
return {"loss": loss}
Fully Sharded Data Parallelism (FSDP)#
To use FSDP, use the PyTorch FSDP package as usual along with the Core API. To see an example of how this works, visit the FSDP + Core API for LLM Training example.
Note
PytorchTrial
API does not support FSDP.
For more info about the PyTorch FSDP package, visit PyTorch tutorials > Getting Started with Fully Sharded Data Parallel (FSDP).
Profiling#
Deepspeed experiments can be profiled using PyTorch Profiler and results will automatically be
uploaded to TensorBoard (accessible via the Determined UI). To configure profiling, call
set_profiler()
on the
DeepSpeedTrialContext
class in the trial’s __init__
method.
set_profiler()
is a thin wrapper around PyTorch profiler, torch-tb-profiler. It overrides the
on_trace_ready
parameter to the Determined TensorBoard path, while all other arguments are
passed directly into torch.profiler.profile
. Stepping the profiler will be handled automatically
during the training loop.
See the PyTorch profiler plugin for details.
The snippet below will profile GPU and CPU usage, skipping batch 1, warming up on batch 2, and profiling batches 3 and 4.
class MyTrial(DeepSpeedTrial):
def __init__(self, context: DeepSpeedTrialContext) -> None:
self.context = context
...
self.context.set_profiler(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2
),
)
Note
Though configuring a profiling schedule torch.profiler.schedule
is optional, profiling every
batch may cause a large amount of data to be uploaded to TensorBoard. This may result in long
rendering times for TensorBoard and memory issues. For long-running experiments, it is
recommended to configure a profiling schedule.
Known DeepSpeed Constraints#
Some DeepSpeed constraints are inherited concerning supported feature compatibility:
Pipeline parallelism can only be combined with Zero Redundancy Optimizer (ZeRO) stage 1.
Parameter offloading is only supported with ZeRO stage 3.
Optimizer offloading is only supported with ZeRO stage 2 and 3.