Model Engine Architecture¶
Audience: Service owners and deployment engineers installing, operating, or debugging model engine in a customer environment.
Scope: This document covers system structure, lifecycle flows, cross-cutting concerns, and component deep-dives. Configuration reference is in helm-values.md. Per-cloud behavior differences are in cloud-matrix.md.
1. System Structure¶
1.1 Architecture Overview¶
Model engine consists of five core pods and a set of external dependencies. The control plane (Gateway, Service Builder, K8s Cacher) runs in the model engine namespace. Inference pods run in a separate endpoint namespace, typically llm-engine.
┌──────────────────────────────────────────────────────────────────────────────────┐
│ Control Plane Namespace (e.g. model-engine) │
│ │
│ ┌─────────────┐ REST ┌─────────────────┐ │
│ │ Gateway │──────────▶│ Service Builder │ │
│ │ (FastAPI) │ │ (Celery worker) │ │
│ └──────┬──────┘ └────────┬─────────┘ │
│ │ read │ write K8s │
│ │ endpoint │ resources │
│ ▼ status ▼ │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ K8s Cacher │──────────▶│ Redis │ │
│ │ (Deployment)│ write │ (cache store) │ │
│ └─────────────┘ TTL 60s └─────────────────┘ │
│ │
│ ┌──────────────────┐ │
│ │ Celery Autoscaler│ (scales async endpoint workers by queue depth) │
│ │ (StatefulSet) │ │
│ └──────────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Balloon Pods │ (low-priority GPU placeholder pods, one Deployment per GPU) │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────┘
External Dependencies
┌────────────────┐ ┌──────────────┐ ┌──────────────────────┐ ┌────────────────┐
│ PostgreSQL │ │ Redis │ │ Message Broker │ │ Object Storage │
│ (endpoint DB) │ │ (K8s cache) │ │ SQS / ASB / Redis │ │ S3 / GCS / ABS │
└────────────────┘ └──────────────┘ └──────────────────────┘ └────────────────┘
┌──────────────────────────────────────────────────────────────────────────────────┐
│ Endpoint Namespace (e.g. llm-engine) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Sync/Streaming Endpoint (Deployment) │ │
│ │ ┌──────────────┐ ┌──────────────────────────────┐ │ │
│ │ │ HTTP Forwarder│ │ vLLM / inference process │ │ │
│ │ └──────────────┘ └──────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Async Endpoint (Deployment) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Celery worker (reads from SQS/ASB/Redis) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Multi-node Endpoint (LeaderWorkerSet / LWS) │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Leader pod │ │ Worker pods │ (no HPA/KEDA) │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────┘
Data flow summary:
- Endpoint creation: Client → Gateway REST → broker queue → Service Builder Celery worker → K8s API
- Sync inference: Client → Gateway → HTTP forward to inference pod → response
- Async inference: Client → Gateway → broker queue → Celery worker in inference pod → result stored → Client polls
- Streaming inference: Client → Gateway → SSE stream from inference pod
- Status reads: Gateway → Redis (written by K8s Cacher, not K8s API directly)
1.2 Kubernetes Resource Inventory¶
Resources created and managed by the helm chart (control plane):
| Resource | Kind | Notes |
|---|---|---|
model-engine-gateway |
Deployment | FastAPI server; replicas configured via replicaCount.gateway |
model-engine-builder |
Deployment | Celery worker for endpoint creation; replicas via replicaCount.builder |
model-engine-cacher |
Deployment | K8s cache loop; typically 1 replica (replicaCount.cacher) |
model-engine-celery-autoscaler |
StatefulSet | Scales async endpoint workers; shards via celery_autoscaler.num_shards |
model-engine-gateway |
HPA | Autoscales gateway replicas based on concurrency |
model-engine-config |
ConfigMap | Runtime config mounted into all control plane pods |
model-engine |
ServiceAccount | Used by control plane pods |
model-engine |
ClusterRole + ClusterRoleBinding | K8s API access for Service Builder and Cacher |
| Balloon Deployments | Deployment (one per GPU type) | Low-priority placeholder pods; see balloons in values |
Resources created per inference endpoint (in endpoint namespace):
| Resource | Kind | Condition |
|---|---|---|
| Inference Deployment | Deployment | All non-LWS endpoints |
| LeaderWorkerSet | LeaderWorkerSet (CRD) | Multi-node endpoints only |
| K8s Service | Service | Sync and streaming endpoints |
| HPA | HorizontalPodAutoscaler | Sync/streaming, min_workers > 0 |
| KEDA ScaledObject | ScaledObject (CRD) | Sync/streaming, min_workers == 0 |
| PodDisruptionBudget | PodDisruptionBudget | All endpoints (configurable) |
| Istio VirtualService | VirtualService | Sync/streaming, istio_enabled: true |
| Istio DestinationRule | DestinationRule | Sync/streaming, istio_enabled: true |
| Istio ServiceEntry | ServiceEntry | Multi-node + istio_enabled: true |
| SQS Queue / ASB Topic | Cloud resource | Async endpoints and all endpoints on async clouds |
HPA and KEDA are mutually exclusive
The Service Builder enforces this: when creating or updating an endpoint, it deletes the KEDA ScaledObject before creating an HPA (if min_workers > 0), or deletes the HPA before creating a KEDA ScaledObject (if min_workers == 0). Both never coexist on the same endpoint.
1.3 External Dependencies and Prerequisites¶
The following must exist and be reachable from the cluster before helm install:
| Dependency | Required For | Notes |
|---|---|---|
| PostgreSQL | All operations | Endpoint metadata, bundle records, batch job records |
| Redis | Gateway routing, cacher, async metrics | Two logical roles: K8s cache and inference autoscaling metrics |
| Message broker (SQS / ASB / Redis) | Async endpoints; endpoint creation queue | Cloud-dependent; see §3.3 |
| Object storage (S3 / GCS / ABS) | LLM artifacts, fine-tune repos, batch job progress | Cloud-dependent |
| Image registry (ECR / ACR / GAR) | All image pulls | Must be mirrored from public.ecr.aws/b2z8n5q1/ in customer envs |
| Prometheus | KEDA scale-to-zero | Required if any sync endpoint uses min_workers == 0; see §3.1 |
| KEDA | Scale-to-zero | Must be installed in cluster if any endpoint uses min_workers == 0 |
| Istio | VirtualService routing, mTLS | Optional but strongly recommended; set istio_enabled: true/false to match actual state |
| NVIDIA GPU Operator | GPU inference | Required for GPU workloads; nodes must be labeled and driver-ready |
Image registry mirroring
In customer environments, all model engine images must be mirrored from the public ECR source (public.ecr.aws/b2z8n5q1/) to the customer registry before installation. The vllm_repository value defaults to a relative path that resolves to Scale's internal ECR in many deployment configurations and must be overridden. Failing to mirror is the most common silent deployment failure: endpoint creation returns HTTP 200 but the endpoint stays INITIALIZING indefinitely.
2. Lifecycle Flows¶
2.1 Generic Endpoint Creation Flow¶
The endpoint creation path is identical for all endpoint types (sync, async, streaming, multi-node). The LLM API layer (§2.3) is a higher-level wrapper that feeds into the same flow.
Client
│
│ POST /v1/model-endpoints
▼
Gateway (FastAPI)
│ Validates request, writes endpoint record to PostgreSQL (status: PENDING)
│ Enqueues Celery task to endpoint creation queue (SQS / ASB / Redis)
│
▼
Message Broker
│ Task sits in queue (SQS queue / ASB topic / Redis queue)
│
▼
Service Builder (Celery worker)
│ Dequeues task
│ Calls K8s API to create/update:
│ - Deployment or LeaderWorkerSet
│ - HPA or KEDA ScaledObject (sync/streaming, non-LWS only)
│ - K8s Service (sync/streaming only)
│ - Istio VirtualService + DestinationRule (if istio_enabled, non-LWS)
│ - Istio ServiceEntry (if istio_enabled, LWS only)
│ - PodDisruptionBudget
│ Updates endpoint record in PostgreSQL (status: INITIALIZING → READY)
│
▼
K8s Cacher (background loop, every 15s)
│ Reads endpoint state from K8s API
│ Writes to Redis with 60s TTL
│
▼
Gateway
│ Reads endpoint status from Redis (not K8s API directly)
│ Returns status to client via GET /v1/model-endpoints/{id}
Timing constraints:
- The Celery task has a 30-minute hard timeout. Endpoint creation that exceeds this ceiling (e.g., very large image pulls on cold nodes) will fail with no retry, and the endpoint will be stuck
INITIALIZING. - The K8s Cacher runs on a 15-second poll cycle. After the Service Builder marks an endpoint
READYin PostgreSQL, there is a brief window (up to 15s) before the Gateway's Redis cache reflects the new state. During this window, status reads may lag.
Celery task timeout is a hard ceiling
The 30-minute Celery task timeout applies to the entire endpoint creation operation, including image pull time. For large model images on cold nodes, image pull alone can approach this limit. Plan capacity accordingly and ensure balloon pods keep GPU nodes warm so image pulls start quickly.
2.2 Inference Flows¶
Synchronous Inference¶
Client
│ POST /v1/model-endpoints/{id}/predict
▼
Gateway
│ Looks up endpoint URL from Redis cache
│ HTTP POST directly to inference pod's HTTP forwarder
▼
Inference Pod (HTTP Forwarder + vLLM / model process)
│ Processes request, returns response
▼
Gateway → Client (response forwarded synchronously)
The Gateway does not queue synchronous requests. The inference pod must be reachable at the time of the request. If the pod is not yet ready or has been evicted, the client receives an error immediately.
Asynchronous Inference¶
Client
│ POST /v1/model-endpoints/{id}/predict (async endpoint)
▼
Gateway
│ Enqueues Celery task to inference queue
│ (per-endpoint SQS queue / ASB topic / Redis queue)
│ Returns task_id immediately (HTTP 200)
▼
Message Broker (per-endpoint queue)
│
▼
Celery Worker (inside inference pod)
│ Dequeues task
│ Runs inference
│ Stores result in Celery result backend (Redis / SQS)
▼
Client polls GET /v1/tasks/{task_id}
│
▼
Gateway
│ Reads task result from Celery result backend
│ Returns status: PENDING / SUCCESS / FAILURE
Each async endpoint has its own dedicated queue: one SQS queue per endpoint on AWS, one ASB topic per endpoint on Azure. The Celery Autoscaler monitors queue depth and scales the Deployment's replica count accordingly (see §3.1).
Streaming Inference¶
Streaming follows the same routing path as synchronous inference. The Gateway establishes a Server-Sent Events (SSE) connection to the inference pod and streams response chunks back to the client as they arrive. The inference pod must support streaming — vLLM does natively via its /v1/chat/completions and /v1/completions endpoints with stream=true.
2.3 LLM API Layer¶
Model engine exposes two API surfaces for LLM inference:
| API Surface | Routes | Description |
|---|---|---|
| Generic endpoint API | GET/POST /v1/model-endpoints, /v1/model-endpoints/{id}/predict |
Low-level; caller specifies image, resources, and all parameters explicitly |
| LLM endpoint API v1 | /v1/llms/... |
Higher-level; opinionated defaults, auto-selects vLLM image and hardware |
| LLM endpoint API v2 | /v2/... |
OpenAI-compatible; same infrastructure as v1 LLM API |
v1 vs v2:
- v1 (
/v1/llms/...): Model engine's native LLM API. Returns model engine response format. - v2 (
/v2/...): OpenAI-compatible API. Accepts and returns the same request/response format as OpenAI's API, includingstream=truefor SSE streaming. Pydantic models are generated from OpenAI's official OpenAPI spec. Endpoints:POST /v2/chat/completions,POST /v2/completions.
How LLM endpoints use Service Builder:
The LLM endpoint API (LiveLLMModelEndpointService) is a thin wrapper over the generic LiveModelEndpointService. When a client calls POST /v1/llms to create an LLM endpoint, the service translates a CreateLLMModelEndpointV1Request into a CreateModelEndpointV1Request with opinionated defaults — vLLM image from vllm_repository, resource sizing from recommendedHardware, GPU type selection — and then delegates to the same Service Builder queue path described in §2.1. There is no separate infrastructure for LLM endpoints. They are regular model endpoints with a curated configuration. All failure modes from §2.1 apply equally.
recommendedHardware auto-selection:
The recommendedHardware helm value contains a lookup table keyed by GPU memory requirement (byGpuMemoryGb) and by model name (byModelName). When an LLM endpoint is created without explicit resource specifications, the service queries this table to select GPU type, GPU count, CPU, memory, storage, and nodes_per_worker. When nodes_per_worker > 1, the service creates a multi-node (LWS) endpoint instead of a regular Deployment. See §3.4 for details.
3. Cross-cutting Concerns¶
3.1 Autoscaling¶
Model engine uses three distinct autoscaling mechanisms depending on endpoint type and configuration. They are not interchangeable, and only one mechanism applies to any given endpoint at a time.
Sync and Streaming Endpoints: HPA (min_workers > 0)¶
When min_workers > 0, the Service Builder creates a HorizontalPodAutoscaler targeting the endpoint's Deployment. The HPA scales based on CPU and memory metrics. The autoscaling API version is selected based on cluster version: autoscaling/v2 for Kubernetes >= 1.26, autoscaling/v2beta2 for Kubernetes 1.23–1.25.
Sync and Streaming Endpoints: KEDA (min_workers == 0)¶
When min_workers == 0, the Service Builder creates a KEDA ScaledObject instead of an HPA. KEDA uses request concurrency metrics sourced from Prometheus to decide when to scale the endpoint from 0 replicas to 1 replica.
KEDA requires prometheus_server_address
KEDA-based scale-to-zero requires config.values.infra.prometheus_server_address to be set in helm values. Without it, the can_scale_http_endpoint_from_zero_flag is False and scale-to-zero will silently not work. This is enforced in dependencies.py:
This is one of the most non-obvious configuration dependencies in the system. The endpoint creation will succeed and the KEDA ScaledObject will be created, but scaling will not function.
Known limitation: KEDA only scales 0→1, not 1→N
As of the current codebase, KEDA ScaledObjects only support scaling a sync endpoint from 0 replicas to 1 replica. Scaling from 1 to N is not implemented. This is a documented TODO in k8s_endpoint_resource_delegate.py:
# Right now, keda only will support scaling from 0 to 1
# TODO support keda scaling from 1 to N as well
if request.build_endpoint_request.min_workers > 0:
# ... create HPA
else: # min workers == 0, use keda
# ... create KEDA ScaledObject
For endpoints that need to scale beyond 1 replica, use min_workers >= 1 (which triggers HPA instead of KEDA).
Async Endpoints: Celery Autoscaler¶
Async endpoints are scaled by the Celery Autoscaler StatefulSet, not by HPA or KEDA. The Celery Autoscaler monitors the depth of each endpoint's message queue (SQS queue on AWS, ASB topic on Azure, Redis queue on GCP/on-prem) and adjusts the Deployment's replica count by patching the K8s API directly.
The number of autoscaler shards is configured via celery_autoscaler.num_shards. Multiple shards distribute the monitoring load across many concurrent endpoints. The Celery Autoscaler is enabled via celery_autoscaler.enabled: true.
Multi-node (LWS) Endpoints: No Autoscaling¶
LeaderWorkerSet endpoints do not support autoscaling. min_workers must equal max_workers. No HPA or KEDA ScaledObject is created. Capacity changes require deleting and recreating the endpoint.
Autoscaling Summary¶
| Endpoint Type | min_workers |
Scaler | Metric Source |
|---|---|---|---|
| Sync / Streaming | > 0 |
HPA | CPU / memory |
| Sync / Streaming | == 0 |
KEDA ScaledObject | Prometheus (request concurrency) |
| Async | any | Celery Autoscaler StatefulSet | Queue depth (SQS / ASB / Redis) |
| Multi-node (LWS) | must equal max_workers |
None | — |
3.2 Observability¶
Structured logging:
All control plane components emit structured JSON logs. Log verbosity is controlled via debug_mode in helm values.
Datadog APM (optional):
Enabled by setting dd_trace_enabled: true in config.values.launch and installing the Datadog agent in the cluster. When enabled, the DatadogMonitoringMetricsGateway is used instead of FakeMonitoringMetricsGateway. This gates distributed tracing and APM metrics. The top-level datadog.enabled helm value controls Datadog agent sidecar injection.
# from dependencies.py
if hmi_config.dd_trace_enabled:
monitoring_metrics_gateway = DatadogMonitoringMetricsGateway()
else:
monitoring_metrics_gateway = FakeMonitoringMetricsGateway()
Prometheus metrics:
Request concurrency metrics are exposed and consumed by KEDA for scale-to-zero. The Prometheus server must be reachable at the address configured in prometheus_server_address. See §3.1 for the dependency.
OpenTelemetry tracing:
An OTel-based telemetry design is in progress and not yet in production. Current tracing is provided via the TracingGateway abstraction, with Datadog as the primary production implementation.
K8s Cacher readiness probe:
The K8s Cacher writes a readiness file (READYZ_FPATH) after its first successful loop iteration. This gates the cacher pod's readinessProbe, ensuring the Redis cache has at least one warm cycle before the pod is considered ready.
3.3 Cloud Backend Abstraction¶
The config.values.infra.cloud_provider value is the single switch that drives selection of broker, storage, registry, and auth implementations at runtime. This selection happens in dependencies.py and k8s_cache.py on startup. Changing this value without corresponding infrastructure changes will cause runtime failures.
Broker (message queue) selection¶
cloud_provider |
Endpoint creation queue | Async inference queue | Queue delegate |
|---|---|---|---|
aws (default) |
SQS | SQS | SQSQueueEndpointResourceDelegate |
azure |
Azure Service Bus | Azure Service Bus | ASBQueueEndpointResourceDelegate |
gcp |
Redis (Memorystore) | Redis (Memorystore) | RedisQueueEndpointResourceDelegate |
onprem |
Redis | Redis | OnPremQueueEndpointResourceDelegate |
Redis broker is the legacy path
Redis was the original broker for all clouds. SQS (AWS) and Azure Service Bus (Azure) replaced it due to reliability and scale limitations. GCP and on-prem still use Redis as the broker. Redis-as-broker has known reliability limitations compared to SQS and ASB.
Azure Service Bus idle connection drops
Azure Service Bus drops idle AMQP connections after approximately 300 seconds. This manifests as random 503 errors on async inference with no obvious configuration cause. The fix is broker_pool_limit=0 (disables connection pooling, forcing reconnection on each use). This was resolved in a recent commit — verify your deployment includes the fix before deploying to Azure.
Storage selection¶
cloud_provider |
Filesystem gateway | LLM artifact gateway | File storage gateway |
|---|---|---|---|
aws / onprem |
S3FilesystemGateway |
S3LLMArtifactGateway |
S3FileStorageGateway |
azure |
ABSFilesystemGateway |
ABSLLMArtifactGateway |
ABSFileStorageGateway |
gcp |
GCSFilesystemGateway |
GCSLLMArtifactGateway |
GCSFileStorageGateway |
On-prem uses S3-compatible storage (MinIO or equivalent) via the same S3 gateways as AWS.
Registry selection¶
cloud_provider |
Docker repository class |
|---|---|
aws (default) |
ECRDockerRepository |
azure |
ACRDockerRepository |
gcp |
GARDockerRepository |
onprem |
OnPremDockerRepository |
Inference autoscaling metrics gateway selection¶
cloud_provider |
Autoscaling metrics gateway |
|---|---|
azure |
ASBInferenceAutoscalingMetricsGateway |
| all others | RedisInferenceAutoscalingMetricsGateway |
Fine-tune repository selection¶
cloud_provider |
Fine-tune repository | Fine-tune events repository |
|---|---|---|
aws / onprem |
S3FileLLMFineTuneRepository |
S3FileLLMFineTuneEventsRepository |
azure |
ABSFileLLMFineTuneRepository |
ABSFileLLMFineTuneEventsRepository |
gcp |
GCSFileLLMFineTuneRepository |
GCSFileLLMFineTuneEventsRepository |
3.4 GPU and Hardware Configuration¶
Node selectors and GPU labels¶
Inference pods are scheduled to GPU nodes using the k8s.amazonaws.com/accelerator node label. This label must be present on GPU nodes before endpoints can be created. The GPU types referenced across model engine configuration:
| Label value | GPU |
|---|---|
nvidia-ampere-a10 |
NVIDIA A10 |
nvidia-ampere-a100 |
NVIDIA A100 |
nvidia-tesla-t4 |
NVIDIA T4 |
nvidia-hopper-h100 |
NVIDIA H100 (full) |
nvidia-hopper-h100-1g20gb |
NVIDIA H100 (MIG 1g.20gb) |
nvidia-hopper-h100-3g40gb |
NVIDIA H100 (MIG 3g.40gb) |
GPU nodes must have the nvidia.com/gpu: NoSchedule taint that GPU inference pods tolerate. The NVIDIA GPU Operator must be installed and the driver must be functional on every GPU node (nvidia-smi must succeed).
Balloon pods and GPU node warming¶
The balloons helm value creates one low-priority Deployment per accelerator type. Each balloon Deployment occupies a configurable number of replicas (replicaCount) on the corresponding node type, requesting GPU resources to prevent the cluster autoscaler from scaling down GPU nodes between inference workloads.
The balloonConfig.reserveHighPriority: true flag restricts eviction to only high-priority pods. When a real inference pod is scheduled, it evicts balloon pods to claim GPU resources. Setting replicaCount: 0 for a GPU type disables warming for that node type.
# Example: keep 2 H100 nodes and 1 A10 node warm
balloonConfig:
reserveHighPriority: true
balloons:
- acceleratorName: nvidia-hopper-h100
replicaCount: 2
gpuCount: 4
- acceleratorName: nvidia-ampere-a10
replicaCount: 1
- acceleratorName: cpu
replicaCount: 0 # disabled
recommendedHardware auto-selection¶
The recommendedHardware helm value provides two lookup tables used by the LLM endpoint service:
byGpuMemoryGb: Matches ongpu_memory_le(less-than-or-equal GB of model GPU memory). Selects GPU type, GPU count, CPU, memory, storage, andnodes_per_worker.byModelName: Named overrides that take precedence over thebyGpuMemoryGbtable for specific models.
recommendedHardware:
byGpuMemoryGb:
- gpu_memory_le: 24
cpus: 10
gpus: 1
memory: 24Gi
storage: 80Gi
gpu_type: nvidia-ampere-a10
nodes_per_worker: 1
- gpu_memory_le: 180
cpus: 20
gpus: 2
memory: 160Gi
storage: 160Gi
gpu_type: nvidia-hopper-h100
nodes_per_worker: 1
- gpu_memory_le: 640
cpus: 80
gpus: 8
memory: 800Gi
storage: 640Gi
gpu_type: nvidia-hopper-h100
nodes_per_worker: 2 # triggers LWS creation
byModelName:
- name: deepseek-coder-v2
cpus: 160
gpus: 8
memory: 800Gi
storage: 640Gi
gpu_type: nvidia-hopper-h100
nodes_per_worker: 1
When nodes_per_worker > 1, the LLM endpoint service creates a multi-node (LWS) endpoint instead of a regular Deployment. This is the mechanism by which large models are automatically placed on multi-node configurations without requiring the caller to specify resource details.
imageCache¶
The imageCache helm value defines per-node-type image pre-pulling configuration. Each entry specifies a nodeSelector and optional tolerations matching a GPU node pool. Pre-pulling model images onto nodes reduces inference pod startup time. This is distinct from balloon pods: balloon pods keep nodes allocated; imageCache keeps images warm on those nodes.
4. Component Reference¶
4.1 K8s Cacher¶
What it does:
The K8s Cacher is a standalone Deployment (typically 1 replica) that runs a continuous polling loop. Every sleep_interval_seconds (default: 15 seconds), it:
- Reads the current state of all model endpoint Deployments and LeaderWorkerSets from the K8s API
- Writes endpoint status records to Redis with a TTL of
ttl_seconds(default: 60 seconds) - Updates the image cache state (for the
imageCachefeature)
Why it exists: Direct K8s API calls from Gateway pods were unreliable at scale — requests would time out under load. The Cacher decouples Gateway reads from K8s API polling, with Redis as the intermediary. The Gateway reads exclusively from Redis for endpoint status; it never calls the K8s API for status lookups at request time.
Code path:
k8s_cache.py (main loop, --sleep-interval-seconds)
└─ ModelEndpointCacheWriteService.execute()
├─ LiveEndpointResourceGateway → K8s API (reads Deployments / LWS)
└─ RedisModelEndpointCacheRepository.write(ttl=60s)
Startup behavior:
The cacher calls load_incluster_config() first (for in-cluster operation), falling back to load_kube_config() for local development. It writes a readiness file after the first successful loop iteration to gate its readinessProbe — the pod is not considered ready until at least one cache cycle has completed successfully.
Failure mode: Redis auth broken → endpoint status unknown
If the cacher cannot write to Redis — due to misconfigured Redis auth, network partition, or expired credentials — it fails silently from the Gateway's perspective. The Gateway reads stale or absent Redis entries and returns endpoint status as "unknown", not an error and not INITIALIZING.
This is the most deceptive failure mode in model engine. An endpoint may be fully READY and serving traffic, but the status API returns "unknown" indefinitely because the cacher-to-Redis path is broken.
How to diagnose: check cacher pod logs for Redis connection errors. Verify Redis auth credentials and network reachability from the cacher pod. In smoke tests, the signature is: Service Builder logs show the endpoint reached READY, but GET /v1/model-endpoints/{id} returns "unknown" without ever transitioning.
Parameters (configurable via CLI args, set in helm Deployment spec):
| Parameter | Default | Description |
|---|---|---|
--ttl-seconds |
60 |
Redis TTL for cache entries |
--sleep-interval-seconds |
15 |
Poll interval between K8s API reads |
--redis-url-override |
None | Override the Redis URL from hmi_config.cache_redis_url |
TTL must be greater than sleep interval
If ttl_seconds < sleep_interval_seconds, cache entries expire between writes, causing cache misses on every Gateway status request. The cacher logs a warning if this condition is detected, but does not fail or exit. The default values (60s TTL, 15s interval) satisfy this requirement with a 4x margin.
4.2 Balloon Pods¶
What they do:
Balloon pods are low-priority Deployments that run an ubuntu container with an infinite sleep command. One Deployment exists per GPU type, configured via the balloons helm value. They request GPU resources, causing the cluster autoscaler to provision GPU nodes and keep them allocated even when no inference pods are running.
Why they exist: GPU nodes are expensive to run continuously but slow to provision (5–15 minutes for a new node to join and be ready). Without balloon pods, the cluster autoscaler scales GPU nodes down during idle periods. When a new endpoint is created, the cluster must provision a fresh GPU node, and the 30-minute Celery task timeout (§2.1) starts counting during this wait. Balloon pods eliminate this cold-start delay.
How eviction works:
Balloon pods are created with a low PriorityClass. When a real inference pod needs to be scheduled on a node occupied by a balloon pod, Kubernetes evicts the balloon pod (preemption). The balloonConfig.reserveHighPriority: true setting restricts preemption to only high-priority pods, preventing lower-priority workloads from accidentally evicting balloons and defeating the warming strategy.
Configuration:
balloonConfig:
reserveHighPriority: true
balloons:
- acceleratorName: nvidia-ampere-a10
replicaCount: 1
- acceleratorName: nvidia-ampere-a100
replicaCount: 0 # disabled — no A100 node warming
- acceleratorName: nvidia-hopper-h100
replicaCount: 2
gpuCount: 4 # request 4 GPUs per balloon pod
- acceleratorName: cpu
replicaCount: 0
replicaCount: 0 disables a balloon type
Setting replicaCount: 0 for a GPU type disables node warming for that type. Cold-start delays will occur on the first endpoint creation after a period of inactivity on that GPU type. This is the default for all GPU types in values_sample.yaml — production deployments should set non-zero counts for GPU types in active use.
4.3 Multi-node Endpoints (LWS)¶
What they are:
Multi-node endpoints use LeaderWorkerSet (LWS), a Kubernetes CRD designed for distributed inference workloads that span multiple nodes. LWS is required for models too large to fit on a single node's GPU memory (e.g., 70B+ parameter models requiring more than 8 GPUs).
How they differ from regular Deployments:
| Aspect | Regular Deployment | LeaderWorkerSet |
|---|---|---|
| K8s resource kind | Deployment |
LeaderWorkerSet (CRD) |
| Autoscaling | HPA or KEDA | None |
min_workers vs max_workers |
Can differ | Must be equal |
| Istio resources created | VirtualService + DestinationRule | ServiceEntry only |
| K8s Service template | service.yaml |
lws-service.yaml |
| Scale-to-zero | Supported (via KEDA) | Not supported |
| Capacity change | Update min_workers/max_workers |
Delete and recreate |
When LWS is used:
The LLM endpoint service selects LWS automatically when nodes_per_worker > 1 in the matched recommendedHardware entry. It can also be specified explicitly in a CreateModelEndpointV1Request by setting nodes_per_worker > 1.
Resource creation differences in Service Builder: For LWS endpoints, the Service Builder takes a different code branch:
- Creates a
LeaderWorkerSetresource instead of aDeployment - Creates the K8s Service from
lws-service.yaml(not the standardservice.yaml) - If
istio_enabled: true, creates aServiceEntry(not aVirtualServiceorDestinationRule) — required because LWS routing uses direct IP address resolution rather than Istio's standard hostname-based VirtualService routing - Does not create an HPA or KEDA ScaledObject
Istio and LWS routing:
LWS endpoints require a workaround for Istio. The Gateway manually resolves the K8s Service cluster IP and sends requests directly to that IP, bypassing Istio's standard VirtualService routing. A ServiceEntry is created to allow this direct IP traffic to pass through Istio's policy enforcement. See live_sync_model_endpoint_inference_gateway.py and live_streaming_model_endpoint_inference_gateway.py for the implementation details.
No autoscaling for LWS endpoints
LeaderWorkerSet endpoints cannot be autoscaled. min_workers must equal max_workers at creation time. If you need different capacity, delete the endpoint and recreate it with the desired worker count. This is a known limitation with no current workaround.
Appendix: Key Configuration Values Quick Reference¶
The values below have the highest operational impact. Full reference is in helm-values.md.
| Value | Default | Risk | Impact if wrong |
|---|---|---|---|
db.runDbMigrationScript |
false |
HIGH | Schema errors on first deploy; no clear error surface |
config.values.infra.prometheus_server_address |
unset | HIGH | KEDA scale-to-zero silently broken |
config.values.launch.vllm_repository |
vllm (relative) |
HIGH | Resolves to Scale's internal ECR in many envs; image pull fails silently |
celeryBrokerType |
sqs |
HIGH | Wrong broker for cloud → async endpoints broken |
config.values.infra.cloud_provider |
aws |
HIGH | Wrong storage, broker, and auth clients loaded for cloud |
balloons[*].replicaCount |
0 |
MEDIUM | No GPU node warming → cold-start delays; risks hitting 30-min Celery timeout |
celery_autoscaler.enabled |
true |
MEDIUM | Async endpoints never scale if disabled |
config.values.launch.istio_enabled |
true |
MEDIUM | Must match actual cluster Istio installation state exactly |
db.runDbMigrationScript defaults to false
On first install, the database schema must be initialized. The default false means the migration job does not run, resulting in schema errors at runtime that have no clear error surface. Set db.runDbMigrationScript: true on every first install into a new environment. There is an open TODO to change this default to true.