Keras API#

In this guide, you’ll learn how to use Determined’s keras.DeterminedCallback while training your Keras model.

Visit the API reference

det.keras API Reference

This document guides you through training a Keras model in Determined. You will need to update your model.fit() call to include a DeterminedCallback and submit it to a Determined cluster.

To learn about this API, you can start by reading the train.py script in the Iris categorization example.

Configure Entrypoint#

Determined requires you to launch training jobs by submitting them with an Experiment Configuration Reference, which tells the Determined master how to start your container. For Keras training, you should always wrap your training script in Determined’s TensorFlow launcher:

entrypoint: >-
  python3 -m determined.launch.tensorflow --
  python3 my_train.py --my-arg...

Determined’s TensorFlow launcher will automatically configure your training script with the right TF_CONFIG environment variable for distributed training when distributed resources are available, and will safely do nothing when they are not.

Obtain a det.core.Context and a tf.distribute.Strategy#

When using distributed training, TensorFlow requires you to create your Strategy early in the process lifetime, before creating your model.

Since you wrapped your training script in Determined’s TensorFlow launcher, you can use Determined’s core.DistributedContext.from_tf_config() helper, which will create both a suitable DistributedContext and Strategy for the training environment in your training job. Then you can feed that DistributedContext to det.core.init() to get a core.Context, and feed all of that to your main() function (or equivalent) in your training script:

if __name__ == "__main__":
    distributed, strategy = det.core.DistributedContext.from_tf_config()
    with det.core.init(distributed=distributed) as core_context:
        main(core_context, strategy)

Build the Model#

Building a distributed-capable model is easy in Keras; you just need to wrap your model building and compiling in the strategy.scope(). See the TensorFlow documentation for more details

def main(core_context, strategy):
    with strategy.scope():
        model = my_build_model()
        model.compile(...)

Create the DeterminedCallback#

The DeterminedCallback automatically integrates your training with the Determined cluster. It reports both train and test metrics, reports progress, saves checkpoints, and uploads them to checkpoint storage. Additionally, it manages preemption signals from the Determined master (for example, when you pause your experiment), gracefully halting training and later resuming from where it left off.

The DeterminedCallback has only three required inputs:
  • the core_context you already created

  • a checkpoint UUID to start training from, or None

  • a continue_id used to decide how to treat the checkpoint

In training jobs, an easy value for checkpoint is det.get_cluster_info().latest_checkpoint, which will automatically be populated with the latest checkpoint saved by this trial, or None. If, for example, you wanted to start training from a checkpoint and support pausing and resuming, you could use info.latest_checkpoint or my_starting_checkpoint.

The continue_id helps the DeterminedCallback decide if the provided checkpoint represents just the starting weights and training should begin at epoch=0, or if the checkpoint represents a partially complete training that should pick up where it left off (at epoch > 0). The provided continue_id is saved along with every checkpoint, and when loading the starting checkpoint, if the continue_id matches what was in the checkpoint, training state is also loaded from the checkpoint. In training jobs, an easy value for continue_id is det.get_cluster_info.trial.trial_id.

See the reference for DeterminedCallback for details on its optional parameters.

info = det.get_cluster_info()
assert info and info.task_type == "TRIAL", "this example only runs as a trial on the cluster"

det_cb = det.keras.DeterminedCallback(
    core_context,
    checkpoint=info.latest_checkpoint,
    continue_id=info.trial.trial_id,
 )

Load Data#

Loading data is done as usual, though additional considerations may arise if your existing data-loading code is not container-ready. For more details, see Prepare Data.

If you want to take advantage Determined’s distributed training, you may need to ensure that your input data is properly sharded. See TensorFlow documentation for details.

Note

To learn more about distributed training with Determined, visit the conceptual overview or the intro to implementing distributed training.

TensorBoard Integration#

Optionally, you can use Determined’s TensorBoard callback, which extends Keras’ TensorBoard callback with the ability to automatically upload metrics to Determined’s checkpoint storage. Determined’s TensorBoard callback is configured identically to Keras’ except it takes an additional core_context initial argument:

tb_cb = det.keras.TensorBoard(core_context, ...)

Then simply include it in your model.fit() as normal.

Calling model.fit()#

The only remaining step is to pass your callbacks to your model.fit():

model.fit(
    ...,
    callbacks=[det_cb, tb_cb],
)