Deploy on GCP#

This page describes how Determined runs on Google Cloud Platform (GCP). For installation, see Install Determined.

Important

This does not automatically provision a Google Kubernetes Engine (GKE) cluster. If you intend to use Kubernetes, please refer to Set up and Manage a Google Kubernetes Engine (GKE) Cluster.

Determined uses Google Compute Engine (GCE) instances as its base unit. The cluster is managed by a master node (a single non-GPU instance), which in turn provisions and deprovisions other agent nodes (GPU instances) depending on the current volume of experiments being run on the cluster.

For instance, if only a master node is running, charges are incurred solely for the master. When an experiment starts, the master dynamically provisions GPU instances as agents. Once the experiment concludes, these agents are deactivated to avoid unnecessary charges.

In addition, the master maintains experiment metadata in a dedicated database accessible via the Determined WebUI or CLI. All nodes in the cluster communicate with each other internally within the Virtual Private Cloud (VPC) and the user interacts with the master via a designated port configured during installation.

The diagram below depicts a Determined cluster in GCP.

Diagram showing Determined Cloud Deployment Architecture on GCP

Following the diagram, a standard execution would be:

  1. User submits experiment to master

  2. Master creates one or more agents (depending on experiment) if they don’t exist

  3. Agent accesses required data, images, etc.

  4. Agent completes experiment and communicates completion to master

  5. Master shuts down agents that are no longer needed

There are two types of resources used to run Determined: core resources that enable the Determined platform, and periphery resources that add optional functionality. The section below provides additional detail on these resources. You can deploy these resources in GCP by following the Install Determined guide.

Core Resources#

  • Master Node: A single Google Compute Engine (GCE) instance is designated as the master. The master’s primary function is to:

    • host the cluster’s WebUI (browser) where users will monitor their experiments

    • respond to commands from the Determined CLI installed by users locally

    • schedule experiments

    • manage other GCE instances (agents) which run experiments

  • Agent Node(s): For most Determined clusters in GCP, the volume of active experiments dictate the number of agents. All agents are managed by the master and users need not interact with the agents directly.

  • Database: Determined uses a CloudSQL (Postgres) database for storing all experiment metadata.

  • Service Account: A service account is used to manage the creation of compute (GCE) resources and access to Google Cloud Storage (GCS) buckets for checkpoints, TensorBoards, and other data storage as needed.

  • Firewall Rules: Firewall rules are set to ensure each node in the cluster can communicate with each other.

Periphery Resources#

  • Network/Subnetwork: The Determined cluster can be configured inside an existing VPC or be set to create a new VPC.

  • Static IP: For production clusters, a static IP is recommended for the master; otherwise an ephemeral IP is automatically generated by GCP.

  • Google Filestore: The Determined cluster can leverage an existing GCS Filestore (assuming it has the correct associated permissions), or the Terraform script can create a Filestore instance with the cluster.

  • Google Cloud Storage (GCS) bucket: The Determined cluster can leverage an existing GCS bucket (assuming it has the correct associated permissions), or the Terraform script can create a bucket with the cluster.