Model Engine Architecture¶

Audience: Service owners and deployment engineers installing, operating, or debugging model engine in a customer environment.

Scope: This document covers system structure, lifecycle flows, cross-cutting concerns, and component deep-dives. Configuration reference is in helm-values.md. Per-cloud behavior differences are in cloud-matrix.md.

1. System Structure¶

1.1 Architecture Overview¶

Model engine consists of five core pods and a set of external dependencies. The control plane (Gateway, Service Builder, K8s Cacher) runs in the model engine namespace. Inference pods run in a separate endpoint namespace, typically llm-engine.

┌──────────────────────────────────────────────────────────────────────────────────┐
│  Control Plane Namespace (e.g. model-engine)                                     │
│                                                                                  │
│  ┌─────────────┐   REST    ┌─────────────────┐                                  │
│  │   Gateway   │──────────▶│  Service Builder │                                  │
│  │  (FastAPI)  │           │  (Celery worker) │                                  │
│  └──────┬──────┘           └────────┬─────────┘                                 │
│         │  read                     │ write K8s                                  │
│         │  endpoint                 │ resources                                  │
│         ▼  status                   ▼                                            │
│  ┌─────────────┐           ┌─────────────────┐                                  │
│  │  K8s Cacher │──────────▶│     Redis       │                                  │
│  │ (Deployment)│  write     │  (cache store)  │                                  │
│  └─────────────┘  TTL 60s  └─────────────────┘                                 │
│                                                                                  │
│  ┌──────────────────┐                                                            │
│  │ Celery Autoscaler│  (scales async endpoint workers by queue depth)            │
│  │  (StatefulSet)   │                                                            │
│  └──────────────────┘                                                            │
│                                                                                  │
│  ┌──────────────┐                                                                │
│  │ Balloon Pods │  (low-priority GPU placeholder pods, one Deployment per GPU)  │
│  └──────────────┘                                                                │
└──────────────────────────────────────────────────────────────────────────────────┘

External Dependencies
┌────────────────┐  ┌──────────────┐  ┌──────────────────────┐  ┌────────────────┐
│   PostgreSQL   │  │    Redis     │  │   Message Broker     │  │ Object Storage │
│  (endpoint DB) │  │ (K8s cache)  │  │ SQS / ASB / Redis    │  │ S3 / GCS / ABS │
└────────────────┘  └──────────────┘  └──────────────────────┘  └────────────────┘

┌──────────────────────────────────────────────────────────────────────────────────┐
│  Endpoint Namespace (e.g. llm-engine)                                            │
│                                                                                  │
│  ┌─────────────────────────────────────────────────────┐                        │
│  │  Sync/Streaming Endpoint (Deployment)               │                        │
│  │  ┌──────────────┐  ┌──────────────────────────────┐ │                        │
│  │  │ HTTP Forwarder│  │  vLLM / inference process   │ │                        │
│  │  └──────────────┘  └──────────────────────────────┘ │                        │
│  └─────────────────────────────────────────────────────┘                        │
│                                                                                  │
│  ┌─────────────────────────────────────────────────────┐                        │
│  │  Async Endpoint (Deployment)                        │                        │
│  │  ┌──────────────────────────────────────────────┐   │                        │
│  │  │  Celery worker (reads from SQS/ASB/Redis)    │   │                        │
│  │  └──────────────────────────────────────────────┘   │                        │
│  └─────────────────────────────────────────────────────┘                        │
│                                                                                  │
│  ┌─────────────────────────────────────────────────────┐                        │
│  │  Multi-node Endpoint (LeaderWorkerSet / LWS)        │                        │
│  │  ┌──────────────┐  ┌──────────────┐                 │                        │
│  │  │  Leader pod  │  │  Worker pods │  (no HPA/KEDA)  │                        │
│  │  └──────────────┘  └──────────────┘                 │                        │
│  └─────────────────────────────────────────────────────┘                        │
└──────────────────────────────────────────────────────────────────────────────────┘

Data flow summary:

Endpoint creation: Client → Gateway REST → broker queue → Service Builder Celery worker → K8s API
Sync inference: Client → Gateway → HTTP forward to inference pod → response
Async inference: Client → Gateway → broker queue → Celery worker in inference pod → result stored → Client polls
Streaming inference: Client → Gateway → SSE stream from inference pod
Status reads: Gateway → Redis (written by K8s Cacher, not K8s API directly)

1.2 Kubernetes Resource Inventory¶

Resources created and managed by the helm chart (control plane):

Resource	Kind	Notes
`model-engine-gateway`	Deployment	FastAPI server; replicas configured via `replicaCount.gateway`
`model-engine-builder`	Deployment	Celery worker for endpoint creation; replicas via `replicaCount.builder`
`model-engine-cacher`	Deployment	K8s cache loop; typically 1 replica (`replicaCount.cacher`)
`model-engine-celery-autoscaler`	StatefulSet	Scales async endpoint workers; shards via `celery_autoscaler.num_shards`
`model-engine-gateway`	HPA	Autoscales gateway replicas based on concurrency
`model-engine-config`	ConfigMap	Runtime config mounted into all control plane pods
`model-engine`	ServiceAccount	Used by control plane pods
`model-engine`	ClusterRole + ClusterRoleBinding	K8s API access for Service Builder and Cacher
Balloon Deployments	Deployment (one per GPU type)	Low-priority placeholder pods; see `balloons` in values

Resources created per inference endpoint (in endpoint namespace):

Resource	Kind	Condition
Inference Deployment	Deployment	All non-LWS endpoints
LeaderWorkerSet	LeaderWorkerSet (CRD)	Multi-node endpoints only
K8s Service	Service	Sync and streaming endpoints
HPA	HorizontalPodAutoscaler	Sync/streaming, `min_workers > 0`
KEDA ScaledObject	ScaledObject (CRD)	Sync/streaming, `min_workers == 0`
PodDisruptionBudget	PodDisruptionBudget	All endpoints (configurable)
Istio VirtualService	VirtualService	Sync/streaming, `istio_enabled: true`
Istio DestinationRule	DestinationRule	Sync/streaming, `istio_enabled: true`
Istio ServiceEntry	ServiceEntry	Multi-node + `istio_enabled: true`
SQS Queue / ASB Topic	Cloud resource	Async endpoints and all endpoints on async clouds

HPA and KEDA are mutually exclusive

The Service Builder enforces this: when creating or updating an endpoint, it deletes the KEDA ScaledObject before creating an HPA (if min_workers > 0), or deletes the HPA before creating a KEDA ScaledObject (if min_workers == 0). Both never coexist on the same endpoint.

1.3 External Dependencies and Prerequisites¶

The following must exist and be reachable from the cluster before helm install:

Dependency	Required For	Notes
PostgreSQL	All operations	Endpoint metadata, bundle records, batch job records
Redis	Gateway routing, cacher, async metrics	Two logical roles: K8s cache and inference autoscaling metrics
Message broker (SQS / ASB / Redis)	Async endpoints; endpoint creation queue	Cloud-dependent; see §3.3
Object storage (S3 / GCS / ABS)	LLM artifacts, fine-tune repos, batch job progress	Cloud-dependent
Image registry (ECR / ACR / GAR)	All image pulls	Must be mirrored from `public.ecr.aws/b2z8n5q1/` in customer envs
Prometheus	KEDA scale-to-zero	Required if any sync endpoint uses `min_workers == 0`; see §3.1
KEDA	Scale-to-zero	Must be installed in cluster if any endpoint uses `min_workers == 0`
Istio	VirtualService routing, mTLS	Optional but strongly recommended; set `istio_enabled: true/false` to match actual state
NVIDIA GPU Operator	GPU inference	Required for GPU workloads; nodes must be labeled and driver-ready

Image registry mirroring

In customer environments, all model engine images must be mirrored from the public ECR source (public.ecr.aws/b2z8n5q1/) to the customer registry before installation. The vllm_repository value defaults to a relative path that resolves to Scale's internal ECR in many deployment configurations and must be overridden. Failing to mirror is the most common silent deployment failure: endpoint creation returns HTTP 200 but the endpoint stays INITIALIZING indefinitely.

2. Lifecycle Flows¶

2.1 Generic Endpoint Creation Flow¶

The endpoint creation path is identical for all endpoint types (sync, async, streaming, multi-node). The LLM API layer (§2.3) is a higher-level wrapper that feeds into the same flow.

Client
  │
  │  POST /v1/model-endpoints
  ▼
Gateway (FastAPI)
  │  Validates request, writes endpoint record to PostgreSQL (status: PENDING)
  │  Enqueues Celery task to endpoint creation queue (SQS / ASB / Redis)
  │
  ▼
Message Broker
  │  Task sits in queue (SQS queue / ASB topic / Redis queue)
  │
  ▼
Service Builder (Celery worker)
  │  Dequeues task
  │  Calls K8s API to create/update:
  │    - Deployment or LeaderWorkerSet
  │    - HPA or KEDA ScaledObject (sync/streaming, non-LWS only)
  │    - K8s Service (sync/streaming only)
  │    - Istio VirtualService + DestinationRule (if istio_enabled, non-LWS)
  │    - Istio ServiceEntry (if istio_enabled, LWS only)
  │    - PodDisruptionBudget
  │  Updates endpoint record in PostgreSQL (status: INITIALIZING → READY)
  │
  ▼
K8s Cacher (background loop, every 15s)
  │  Reads endpoint state from K8s API
  │  Writes to Redis with 60s TTL
  │
  ▼
Gateway
  │  Reads endpoint status from Redis (not K8s API directly)
  │  Returns status to client via GET /v1/model-endpoints/{id}

Timing constraints:

The Celery task has a 30-minute hard timeout. Endpoint creation that exceeds this ceiling (e.g., very large image pulls on cold nodes) will fail with no retry, and the endpoint will be stuck INITIALIZING.
The K8s Cacher runs on a 15-second poll cycle. After the Service Builder marks an endpoint READY in PostgreSQL, there is a brief window (up to 15s) before the Gateway's Redis cache reflects the new state. During this window, status reads may lag.

Celery task timeout is a hard ceiling

The 30-minute Celery task timeout applies to the entire endpoint creation operation, including image pull time. For large model images on cold nodes, image pull alone can approach this limit. Plan capacity accordingly and ensure balloon pods keep GPU nodes warm so image pulls start quickly.

2.2 Inference Flows¶

Synchronous Inference¶

Client
  │  POST /v1/model-endpoints/{id}/predict
  ▼
Gateway
  │  Looks up endpoint URL from Redis cache
  │  HTTP POST directly to inference pod's HTTP forwarder
  ▼
Inference Pod (HTTP Forwarder + vLLM / model process)
  │  Processes request, returns response
  ▼
Gateway → Client  (response forwarded synchronously)

The Gateway does not queue synchronous requests. The inference pod must be reachable at the time of the request. If the pod is not yet ready or has been evicted, the client receives an error immediately.

Asynchronous Inference¶

Client
  │  POST /v1/model-endpoints/{id}/predict  (async endpoint)
  ▼
Gateway
  │  Enqueues Celery task to inference queue
  │  (per-endpoint SQS queue / ASB topic / Redis queue)
  │  Returns task_id immediately (HTTP 200)
  ▼
Message Broker (per-endpoint queue)
  │
  ▼
Celery Worker (inside inference pod)
  │  Dequeues task
  │  Runs inference
  │  Stores result in Celery result backend (Redis / SQS)
  ▼
Client polls GET /v1/tasks/{task_id}
  │
  ▼
Gateway
  │  Reads task result from Celery result backend
  │  Returns status: PENDING / SUCCESS / FAILURE

Each async endpoint has its own dedicated queue: one SQS queue per endpoint on AWS, one ASB topic per endpoint on Azure. The Celery Autoscaler monitors queue depth and scales the Deployment's replica count accordingly (see §3.1).

Streaming Inference¶

Streaming follows the same routing path as synchronous inference. The Gateway establishes a Server-Sent Events (SSE) connection to the inference pod and streams response chunks back to the client as they arrive. The inference pod must support streaming — vLLM does natively via its /v1/chat/completions and /v1/completions endpoints with stream=true.

2.3 LLM API Layer¶

Model engine exposes two API surfaces for LLM inference:

API Surface	Routes	Description
Generic endpoint API	`GET/POST /v1/model-endpoints`, `/v1/model-endpoints/{id}/predict`	Low-level; caller specifies image, resources, and all parameters explicitly
LLM endpoint API v1	`/v1/llms/...`	Higher-level; opinionated defaults, auto-selects vLLM image and hardware
LLM endpoint API v2	`/v2/...`	OpenAI-compatible; same infrastructure as v1 LLM API

v1 vs v2:

v1 (/v1/llms/...): Model engine's native LLM API. Returns model engine response format.
v2 (/v2/...): OpenAI-compatible API. Accepts and returns the same request/response format as OpenAI's API, including stream=true for SSE streaming. Pydantic models are generated from OpenAI's official OpenAPI spec. Endpoints: POST /v2/chat/completions, POST /v2/completions.

How LLM endpoints use Service Builder:

The LLM endpoint API (LiveLLMModelEndpointService) is a thin wrapper over the generic LiveModelEndpointService. When a client calls POST /v1/llms to create an LLM endpoint, the service translates a CreateLLMModelEndpointV1Request into a CreateModelEndpointV1Request with opinionated defaults — vLLM image from vllm_repository, resource sizing from recommendedHardware, GPU type selection — and then delegates to the same Service Builder queue path described in §2.1. There is no separate infrastructure for LLM endpoints. They are regular model endpoints with a curated configuration. All failure modes from §2.1 apply equally.

recommendedHardware auto-selection:

The recommendedHardware helm value contains a lookup table keyed by GPU memory requirement (byGpuMemoryGb) and by model name (byModelName). When an LLM endpoint is created without explicit resource specifications, the service queries this table to select GPU type, GPU count, CPU, memory, storage, and nodes_per_worker. When nodes_per_worker > 1, the service creates a multi-node (LWS) endpoint instead of a regular Deployment. See §3.4 for details.

3. Cross-cutting Concerns¶

3.1 Autoscaling¶

Model engine uses three distinct autoscaling mechanisms depending on endpoint type and configuration. They are not interchangeable, and only one mechanism applies to any given endpoint at a time.

Sync and Streaming Endpoints: HPA (`min_workers > 0`)¶

When min_workers > 0, the Service Builder creates a HorizontalPodAutoscaler targeting the endpoint's Deployment. The HPA scales based on CPU and memory metrics. The autoscaling API version is selected based on cluster version: autoscaling/v2 for Kubernetes >= 1.26, autoscaling/v2beta2 for Kubernetes 1.23–1.25.

min_workers > 0  →  KEDA ScaledObject deleted (if exists)  →  HPA created

Sync and Streaming Endpoints: KEDA (`min_workers == 0`)¶

When min_workers == 0, the Service Builder creates a KEDA ScaledObject instead of an HPA. KEDA uses request concurrency metrics sourced from Prometheus to decide when to scale the endpoint from 0 replicas to 1 replica.

min_workers == 0  →  HPA deleted (if exists)  →  KEDA ScaledObject created

KEDA requires prometheus_server_address

KEDA-based scale-to-zero requires config.values.infra.prometheus_server_address to be set in helm values. Without it, the can_scale_http_endpoint_from_zero_flag is False and scale-to-zero will silently not work. This is enforced in dependencies.py:

can_scale_http_endpoint_from_zero_flag=infra_config().prometheus_server_address is not None

This is one of the most non-obvious configuration dependencies in the system. The endpoint creation will succeed and the KEDA ScaledObject will be created, but scaling will not function.

Known limitation: KEDA only scales 0→1, not 1→N

As of the current codebase, KEDA ScaledObjects only support scaling a sync endpoint from 0 replicas to 1 replica. Scaling from 1 to N is not implemented. This is a documented TODO in k8s_endpoint_resource_delegate.py:

# Right now, keda only will support scaling from 0 to 1
# TODO support keda scaling from 1 to N as well
if request.build_endpoint_request.min_workers > 0:
    # ... create HPA
else:  # min workers == 0, use keda
    # ... create KEDA ScaledObject

For endpoints that need to scale beyond 1 replica, use min_workers >= 1 (which triggers HPA instead of KEDA).

Async Endpoints: Celery Autoscaler¶

Async endpoints are scaled by the Celery Autoscaler StatefulSet, not by HPA or KEDA. The Celery Autoscaler monitors the depth of each endpoint's message queue (SQS queue on AWS, ASB topic on Azure, Redis queue on GCP/on-prem) and adjusts the Deployment's replica count by patching the K8s API directly.

The number of autoscaler shards is configured via celery_autoscaler.num_shards. Multiple shards distribute the monitoring load across many concurrent endpoints. The Celery Autoscaler is enabled via celery_autoscaler.enabled: true.

Multi-node (LWS) Endpoints: No Autoscaling¶

LeaderWorkerSet endpoints do not support autoscaling. min_workers must equal max_workers. No HPA or KEDA ScaledObject is created. Capacity changes require deleting and recreating the endpoint.

Autoscaling Summary¶

Endpoint Type	`min_workers`	Scaler	Metric Source
Sync / Streaming	`> 0`	HPA	CPU / memory
Sync / Streaming	`== 0`	KEDA ScaledObject	Prometheus (request concurrency)
Async	any	Celery Autoscaler StatefulSet	Queue depth (SQS / ASB / Redis)
Multi-node (LWS)	must equal `max_workers`	None	—

3.2 Observability¶

Structured logging: All control plane components emit structured JSON logs. Log verbosity is controlled via debug_mode in helm values.

Datadog APM (optional): Enabled by setting dd_trace_enabled: true in config.values.launch and installing the Datadog agent in the cluster. When enabled, the DatadogMonitoringMetricsGateway is used instead of FakeMonitoringMetricsGateway. This gates distributed tracing and APM metrics. The top-level datadog.enabled helm value controls Datadog agent sidecar injection.

# from dependencies.py
if hmi_config.dd_trace_enabled:
    monitoring_metrics_gateway = DatadogMonitoringMetricsGateway()
else:
    monitoring_metrics_gateway = FakeMonitoringMetricsGateway()

Prometheus metrics: Request concurrency metrics are exposed and consumed by KEDA for scale-to-zero. The Prometheus server must be reachable at the address configured in prometheus_server_address. See §3.1 for the dependency.

OpenTelemetry tracing: An OTel-based telemetry design is in progress and not yet in production. Current tracing is provided via the TracingGateway abstraction, with Datadog as the primary production implementation.

K8s Cacher readiness probe: The K8s Cacher writes a readiness file (READYZ_FPATH) after its first successful loop iteration. This gates the cacher pod's readinessProbe, ensuring the Redis cache has at least one warm cycle before the pod is considered ready.

3.3 Cloud Backend Abstraction¶

The config.values.infra.cloud_provider value is the single switch that drives selection of broker, storage, registry, and auth implementations at runtime. This selection happens in dependencies.py and k8s_cache.py on startup. Changing this value without corresponding infrastructure changes will cause runtime failures.

Broker (message queue) selection¶

`cloud_provider`	Endpoint creation queue	Async inference queue	Queue delegate
`aws` (default)	SQS	SQS	`SQSQueueEndpointResourceDelegate`
`azure`	Azure Service Bus	Azure Service Bus	`ASBQueueEndpointResourceDelegate`
`gcp`	Redis (Memorystore)	Redis (Memorystore)	`RedisQueueEndpointResourceDelegate`
`onprem`	Redis	Redis	`OnPremQueueEndpointResourceDelegate`

Redis broker is the legacy path

Redis was the original broker for all clouds. SQS (AWS) and Azure Service Bus (Azure) replaced it due to reliability and scale limitations. GCP and on-prem still use Redis as the broker. Redis-as-broker has known reliability limitations compared to SQS and ASB.

Azure Service Bus idle connection drops

Azure Service Bus drops idle AMQP connections after approximately 300 seconds. This manifests as random 503 errors on async inference with no obvious configuration cause. The fix is broker_pool_limit=0 (disables connection pooling, forcing reconnection on each use). This was resolved in a recent commit — verify your deployment includes the fix before deploying to Azure.

Storage selection¶

`cloud_provider`	Filesystem gateway	LLM artifact gateway	File storage gateway
`aws` / `onprem`	`S3FilesystemGateway`	`S3LLMArtifactGateway`	`S3FileStorageGateway`
`azure`	`ABSFilesystemGateway`	`ABSLLMArtifactGateway`	`ABSFileStorageGateway`
`gcp`	`GCSFilesystemGateway`	`GCSLLMArtifactGateway`	`GCSFileStorageGateway`

On-prem uses S3-compatible storage (MinIO or equivalent) via the same S3 gateways as AWS.

Registry selection¶

`cloud_provider`	Docker repository class
`aws` (default)	`ECRDockerRepository`
`azure`	`ACRDockerRepository`
`gcp`	`GARDockerRepository`
`onprem`	`OnPremDockerRepository`

Inference autoscaling metrics gateway selection¶

`cloud_provider`	Autoscaling metrics gateway
`azure`	`ASBInferenceAutoscalingMetricsGateway`
all others	`RedisInferenceAutoscalingMetricsGateway`

Fine-tune repository selection¶

`cloud_provider`	Fine-tune repository	Fine-tune events repository
`aws` / `onprem`	`S3FileLLMFineTuneRepository`	`S3FileLLMFineTuneEventsRepository`
`azure`	`ABSFileLLMFineTuneRepository`	`ABSFileLLMFineTuneEventsRepository`
`gcp`	`GCSFileLLMFineTuneRepository`	`GCSFileLLMFineTuneEventsRepository`

3.4 GPU and Hardware Configuration¶

Node selectors and GPU labels¶

Inference pods are scheduled to GPU nodes using the k8s.amazonaws.com/accelerator node label. This label must be present on GPU nodes before endpoints can be created. The GPU types referenced across model engine configuration:

Label value	GPU
`nvidia-ampere-a10`	NVIDIA A10
`nvidia-ampere-a100`	NVIDIA A100
`nvidia-tesla-t4`	NVIDIA T4
`nvidia-hopper-h100`	NVIDIA H100 (full)
`nvidia-hopper-h100-1g20gb`	NVIDIA H100 (MIG 1g.20gb)
`nvidia-hopper-h100-3g40gb`	NVIDIA H100 (MIG 3g.40gb)

GPU nodes must have the nvidia.com/gpu: NoSchedule taint that GPU inference pods tolerate. The NVIDIA GPU Operator must be installed and the driver must be functional on every GPU node (nvidia-smi must succeed).

Balloon pods and GPU node warming¶

The balloons helm value creates one low-priority Deployment per accelerator type. Each balloon Deployment occupies a configurable number of replicas (replicaCount) on the corresponding node type, requesting GPU resources to prevent the cluster autoscaler from scaling down GPU nodes between inference workloads.

The balloonConfig.reserveHighPriority: true flag restricts eviction to only high-priority pods. When a real inference pod is scheduled, it evicts balloon pods to claim GPU resources. Setting replicaCount: 0 for a GPU type disables warming for that node type.

# Example: keep 2 H100 nodes and 1 A10 node warm
balloonConfig:
  reserveHighPriority: true

balloons:
  - acceleratorName: nvidia-hopper-h100
    replicaCount: 2
    gpuCount: 4
  - acceleratorName: nvidia-ampere-a10
    replicaCount: 1
  - acceleratorName: cpu
    replicaCount: 0   # disabled

`recommendedHardware` auto-selection¶

The recommendedHardware helm value provides two lookup tables used by the LLM endpoint service:

byGpuMemoryGb: Matches on gpu_memory_le (less-than-or-equal GB of model GPU memory). Selects GPU type, GPU count, CPU, memory, storage, and nodes_per_worker.
byModelName: Named overrides that take precedence over the byGpuMemoryGb table for specific models.

recommendedHardware:
  byGpuMemoryGb:
    - gpu_memory_le: 24
      cpus: 10
      gpus: 1
      memory: 24Gi
      storage: 80Gi
      gpu_type: nvidia-ampere-a10
      nodes_per_worker: 1
    - gpu_memory_le: 180
      cpus: 20
      gpus: 2
      memory: 160Gi
      storage: 160Gi
      gpu_type: nvidia-hopper-h100
      nodes_per_worker: 1
    - gpu_memory_le: 640
      cpus: 80
      gpus: 8
      memory: 800Gi
      storage: 640Gi
      gpu_type: nvidia-hopper-h100
      nodes_per_worker: 2       # triggers LWS creation
  byModelName:
    - name: deepseek-coder-v2
      cpus: 160
      gpus: 8
      memory: 800Gi
      storage: 640Gi
      gpu_type: nvidia-hopper-h100
      nodes_per_worker: 1

When nodes_per_worker > 1, the LLM endpoint service creates a multi-node (LWS) endpoint instead of a regular Deployment. This is the mechanism by which large models are automatically placed on multi-node configurations without requiring the caller to specify resource details.

`imageCache`¶

The imageCache helm value defines per-node-type image pre-pulling configuration. Each entry specifies a nodeSelector and optional tolerations matching a GPU node pool. Pre-pulling model images onto nodes reduces inference pod startup time. This is distinct from balloon pods: balloon pods keep nodes allocated; imageCache keeps images warm on those nodes.

4. Component Reference¶

4.1 K8s Cacher¶

What it does: The K8s Cacher is a standalone Deployment (typically 1 replica) that runs a continuous polling loop. Every sleep_interval_seconds (default: 15 seconds), it:

Reads the current state of all model endpoint Deployments and LeaderWorkerSets from the K8s API
Writes endpoint status records to Redis with a TTL of ttl_seconds (default: 60 seconds)
Updates the image cache state (for the imageCache feature)

Why it exists: Direct K8s API calls from Gateway pods were unreliable at scale — requests would time out under load. The Cacher decouples Gateway reads from K8s API polling, with Redis as the intermediary. The Gateway reads exclusively from Redis for endpoint status; it never calls the K8s API for status lookups at request time.

Code path:

k8s_cache.py (main loop, --sleep-interval-seconds)
  └─ ModelEndpointCacheWriteService.execute()
       ├─ LiveEndpointResourceGateway  →  K8s API (reads Deployments / LWS)
       └─ RedisModelEndpointCacheRepository.write(ttl=60s)

Startup behavior: The cacher calls load_incluster_config() first (for in-cluster operation), falling back to load_kube_config() for local development. It writes a readiness file after the first successful loop iteration to gate its readinessProbe — the pod is not considered ready until at least one cache cycle has completed successfully.

Failure mode: Redis auth broken → endpoint status unknown

If the cacher cannot write to Redis — due to misconfigured Redis auth, network partition, or expired credentials — it fails silently from the Gateway's perspective. The Gateway reads stale or absent Redis entries and returns endpoint status as "unknown", not an error and not INITIALIZING.

This is the most deceptive failure mode in model engine. An endpoint may be fully READY and serving traffic, but the status API returns "unknown" indefinitely because the cacher-to-Redis path is broken.

How to diagnose: check cacher pod logs for Redis connection errors. Verify Redis auth credentials and network reachability from the cacher pod. In smoke tests, the signature is: Service Builder logs show the endpoint reached READY, but GET /v1/model-endpoints/{id} returns "unknown" without ever transitioning.

Parameters (configurable via CLI args, set in helm Deployment spec):

Parameter	Default	Description
`--ttl-seconds`	`60`	Redis TTL for cache entries
`--sleep-interval-seconds`	`15`	Poll interval between K8s API reads
`--redis-url-override`	None	Override the Redis URL from `hmi_config.cache_redis_url`

TTL must be greater than sleep interval

If ttl_seconds < sleep_interval_seconds, cache entries expire between writes, causing cache misses on every Gateway status request. The cacher logs a warning if this condition is detected, but does not fail or exit. The default values (60s TTL, 15s interval) satisfy this requirement with a 4x margin.

4.2 Balloon Pods¶

What they do: Balloon pods are low-priority Deployments that run an ubuntu container with an infinite sleep command. One Deployment exists per GPU type, configured via the balloons helm value. They request GPU resources, causing the cluster autoscaler to provision GPU nodes and keep them allocated even when no inference pods are running.

Why they exist: GPU nodes are expensive to run continuously but slow to provision (5–15 minutes for a new node to join and be ready). Without balloon pods, the cluster autoscaler scales GPU nodes down during idle periods. When a new endpoint is created, the cluster must provision a fresh GPU node, and the 30-minute Celery task timeout (§2.1) starts counting during this wait. Balloon pods eliminate this cold-start delay.

How eviction works: Balloon pods are created with a low PriorityClass. When a real inference pod needs to be scheduled on a node occupied by a balloon pod, Kubernetes evicts the balloon pod (preemption). The balloonConfig.reserveHighPriority: true setting restricts preemption to only high-priority pods, preventing lower-priority workloads from accidentally evicting balloons and defeating the warming strategy.

Configuration:

balloonConfig:
  reserveHighPriority: true

balloons:
  - acceleratorName: nvidia-ampere-a10
    replicaCount: 1
  - acceleratorName: nvidia-ampere-a100
    replicaCount: 0       # disabled — no A100 node warming
  - acceleratorName: nvidia-hopper-h100
    replicaCount: 2
    gpuCount: 4           # request 4 GPUs per balloon pod
  - acceleratorName: cpu
    replicaCount: 0

replicaCount: 0 disables a balloon type

Setting replicaCount: 0 for a GPU type disables node warming for that type. Cold-start delays will occur on the first endpoint creation after a period of inactivity on that GPU type. This is the default for all GPU types in values_sample.yaml — production deployments should set non-zero counts for GPU types in active use.

4.3 Multi-node Endpoints (LWS)¶

What they are: Multi-node endpoints use LeaderWorkerSet (LWS), a Kubernetes CRD designed for distributed inference workloads that span multiple nodes. LWS is required for models too large to fit on a single node's GPU memory (e.g., 70B+ parameter models requiring more than 8 GPUs).

How they differ from regular Deployments:

Aspect	Regular Deployment	LeaderWorkerSet
K8s resource kind	`Deployment`	`LeaderWorkerSet` (CRD)
Autoscaling	HPA or KEDA	None
`min_workers` vs `max_workers`	Can differ	Must be equal
Istio resources created	VirtualService + DestinationRule	ServiceEntry only
K8s Service template	`service.yaml`	`lws-service.yaml`
Scale-to-zero	Supported (via KEDA)	Not supported
Capacity change	Update `min_workers`/`max_workers`	Delete and recreate

When LWS is used: The LLM endpoint service selects LWS automatically when nodes_per_worker > 1 in the matched recommendedHardware entry. It can also be specified explicitly in a CreateModelEndpointV1Request by setting nodes_per_worker > 1.

Resource creation differences in Service Builder: For LWS endpoints, the Service Builder takes a different code branch:

Creates a LeaderWorkerSet resource instead of a Deployment
Creates the K8s Service from lws-service.yaml (not the standard service.yaml)
If istio_enabled: true, creates a ServiceEntry (not a VirtualService or DestinationRule) — required because LWS routing uses direct IP address resolution rather than Istio's standard hostname-based VirtualService routing
Does not create an HPA or KEDA ScaledObject

Istio and LWS routing: LWS endpoints require a workaround for Istio. The Gateway manually resolves the K8s Service cluster IP and sends requests directly to that IP, bypassing Istio's standard VirtualService routing. A ServiceEntry is created to allow this direct IP traffic to pass through Istio's policy enforcement. See live_sync_model_endpoint_inference_gateway.py and live_streaming_model_endpoint_inference_gateway.py for the implementation details.

No autoscaling for LWS endpoints

LeaderWorkerSet endpoints cannot be autoscaled. min_workers must equal max_workers at creation time. If you need different capacity, delete the endpoint and recreate it with the desired worker count. This is a known limitation with no current workaround.

Appendix: Key Configuration Values Quick Reference¶

The values below have the highest operational impact. Full reference is in helm-values.md.

Value	Default	Risk	Impact if wrong
`db.runDbMigrationScript`	`false`	HIGH	Schema errors on first deploy; no clear error surface
`config.values.infra.prometheus_server_address`	unset	HIGH	KEDA scale-to-zero silently broken
`config.values.launch.vllm_repository`	`vllm` (relative)	HIGH	Resolves to Scale's internal ECR in many envs; image pull fails silently
`celeryBrokerType`	`sqs`	HIGH	Wrong broker for cloud → async endpoints broken
`config.values.infra.cloud_provider`	`aws`	HIGH	Wrong storage, broker, and auth clients loaded for cloud
`balloons[*].replicaCount`	`0`	MEDIUM	No GPU node warming → cold-start delays; risks hitting 30-min Celery timeout
`celery_autoscaler.enabled`	`true`	MEDIUM	Async endpoints never scale if disabled
`config.values.launch.istio_enabled`	`true`	MEDIUM	Must match actual cluster Istio installation state exactly

db.runDbMigrationScript defaults to false

On first install, the database schema must be initialized. The default false means the migration job does not run, resulting in schema errors at runtime that have no clear error surface. Set db.runDbMigrationScript: true on every first install into a new environment. There is an open TODO to change this default to true.