Elasticsearch-Backed Logging#

Use this guide as a reference when considering a shift from the default logging backend to Elasticsearch for optimized log storage and analysis.

We’ll discuss the limitations of the default logging backend and provide tips and guidelines for migrating to Elasticsearch including how to tune Elasticsearch to work best with HPE Machine Learning Development Environment.

Elasticsearch is a search engine commonly used for storing application logs for search and analytics. HPE Machine Learning Development Environment supports using Elasticsearch as the storage backend for task logs. Configuring HPE Machine Learning Development Environment to use Elasticsearch is simple; however, managing an Elasticsearch cluster at scale is an involved task, so this guide is recommended for users who have hit the limitations of the default logging backend.

Using the default logging backend, with a standard deployment using det deploy, the cluster can ingest logs about as fast as Postgres can persist them. For example, with det deploy aws using Aurora Serverless with 2 capacity units, ingestion speed maxes out around 10-15 MB/s (where the database’s CPU hits ~90%). To get a little more mileage from the default, we recommend increasing the capacity of the database. At a certain point, the master instance itself will become the bottleneck, since it has limited incoming network bandwidth for HTTP requests delivering logs and limited resources to process them. The master instance size can be increased, but vertical scaling is likely to be limited to a log throughput of around hundreds of megabytes per second; we recommend moving to Elasticsearch to get past that limit.

HPE Machine Learning Development Environment offers some additional recommendations for the Elasticsearch cluster configuration based on how the cluster will be used:

  • Tune the default shards per index to your expected throughput (or use index templates). HPE Machine Learning Development Environment ships logs in Logstash format rolling over to a new index each day. Depending on your log volume, the default number of shards could be too high or too low. The general rule of thumb is not to exceed 50 GB per shard while minimizing the number of shards per index. For high-utilization clusters, this may entail increasing the shards per index and rotating indices older than a few months out of the cluster periodically, to avoid the overhead accumulated from having too many shards. A more in-depth guide can be found here.

  • Though it may increase latency for end users, increasing the refresh interval may help increase total throughput.

  • Apply the following index template to optimize the mappings in HPE Machine Learning Development Environment log indices for ingest speed. This turns off analysis and in some cases indexing on properties for which HPE Machine Learning Development Environment does not use these features.

{
  "index_patterns": ["determined-tasklogs-*"],
  "mappings": {
    "properties": {
        "task_id": {"type": "keyword", "index": true},
        "allocation_id": {"type": "keyword": "index": true},
        "agent_id": {"type": "keyword", "index": true},
        "container_id": {"type": "keyword", "index": true},
        "level": {"type": "keyword", "index": true},
        "log": {"type": "text", "index": false},
        "message": {"type": "text", "index": false},
        "source": {"type": "keyword", "index": true},
        "stdtype": {"type": "keyword", "index": true}
    }
  }
}

The configuration settings to enable Elasticsearch as the task log backend are described in the cluster configuration reference.