Skip to content

Helm Values Reference

Audience: Deployment engineers installing model engine into a customer environment. Purpose: Reference for every configurable value, organized by deployment concern. Full chart source: charts/model-engine/values_sample.yaml


High-Risk Values

Read this before starting your installation

The following values have non-obvious defaults or silent failure modes. Getting them wrong causes hard-to-diagnose issues.

Value Default Risk Impact if wrong
db.runDbMigrationScript false HIGH Schema not initialized on first install — model creation fails with cryptic DB errors
config.values.infra.prometheus_server_address unset HIGH KEDA scale-to-zero silently broken for sync endpoints with min_workers=0
config.values.launch.vllm_repository vllm (resolves to Scale's ECR) HIGH Endpoint creation succeeds but pods stay INITIALIZING — image pull fails silently
celeryBrokerType sqs HIGH Wrong broker for non-AWS clouds — all async endpoints broken
config.values.infra.cloud_provider aws HIGH Wrong storage/auth/registry clients loaded for non-AWS environments

TODO

Change db.runDbMigrationScript default to true in values.yaml.


1. Minimum Viable Config

These are the values you must set for the service to start. All other values have safe defaults. Getting any of these wrong will prevent the control plane from coming up or prevent any endpoint from being created successfully.

Value Type Default Required Description
tag string Yes LLM Engine Docker image tag to deploy
image.gatewayRepository string Yes Docker repository for the gateway image
image.builderRepository string Yes Docker repository for the endpoint builder image
image.cacherRepository string Yes Docker repository for the cacher image
image.forwarderRepository string Yes Docker repository for the forwarder image
secrets.kubernetesDatabaseSecretName string llm-engine-postgres-credentials Yes (one of two) Kubernetes Secret name containing DATABASE_URL. Mutually exclusive with secrets.cloudDatabaseSecretName
secrets.cloudDatabaseSecretName string Yes (one of two) Cloud-provider secret name (e.g., AWS Secrets Manager) containing database credentials
serviceAccount.annotations map Yes Annotations to apply to the control-plane service account. On EKS, set eks.amazonaws.com/role-arn
config.values.infra.cloud_provider string aws Yes Cloud provider: aws, azure, or onprem
config.values.infra.k8s_cluster_name string main_cluster Yes Kubernetes cluster name used for resource tagging and lookups
config.values.infra.dns_host_domain string llm-engine.domain.com Yes Base domain for endpoint hostnames
config.values.infra.default_region string us-east-1 Yes Default cloud region for all resource operations
config.values.infra.ml_account_id string "000000000000" Yes Cloud account/subscription ID
config.values.infra.docker_repo_prefix string 000000000000.dkr.ecr.us-east-1.amazonaws.com Yes Prefix prepended to all inference image repositories
config.values.infra.redis_host string Yes (if not using secret) Hostname of the Redis cluster used by the inference control plane
config.values.infra.s3_bucket string llm-engine Yes S3 bucket (or equivalent) for storing fine-tuning artifacts and other assets
config.values.launch.endpoint_namespace string llm-engine Yes Kubernetes namespace where inference endpoint pods are created
config.values.launch.cache_redis_aws_url string Yes (one of three) Full Redis URL used by the cacher. Exactly one of cache_redis_aws_url, cache_redis_azure_host, or cache_redis_aws_secret_name must be set

Minimal Working YAML

tag: "abc123def456"

image:
  gatewayRepository: public.ecr.aws/b2z8n5q1/model-engine
  builderRepository: public.ecr.aws/b2z8n5q1/model-engine
  cacherRepository: public.ecr.aws/b2z8n5q1/model-engine
  forwarderRepository: public.ecr.aws/b2z8n5q1/model-engine
  pullPolicy: Always

secrets:
  kubernetesDatabaseSecretName: llm-engine-postgres-credentials

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/k8s-main-llm-engine

db:
  runDbMigrationScript: true  # required on first install
  runDbInitScript: false

config:
  values:
    infra:
      cloud_provider: aws
      k8s_cluster_name: my-cluster
      dns_host_domain: llm-engine.example.com
      default_region: us-east-1
      ml_account_id: "111122223333"
      docker_repo_prefix: "111122223333.dkr.ecr.us-east-1.amazonaws.com"
      redis_host: my-redis.use1.cache.amazonaws.com
      s3_bucket: my-llm-engine-bucket
    launch:
      endpoint_namespace: llm-engine
      cache_redis_aws_url: redis://my-redis.use1.cache.amazonaws.com:6379/15
      s3_file_llm_fine_tuning_job_repository: "s3://my-llm-engine-bucket/llm-ft-job-repository"
      hf_user_fine_tuned_weights_prefix: "s3://my-llm-engine-bucket/fine_tuned_weights"
      vllm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/vllm"
      tensorrt_llm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/tensorrt-llm"
      batch_inference_vllm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/llm-engine/batch-infer-vllm"

2. Cloud-Specific Config

AWS (Reference Configuration)

AWS is the default. The values below represent the full set of AWS-specific fields.

Value Type Default Required Description
config.values.infra.cloud_provider string aws Yes Set to aws
config.values.infra.default_region string us-east-1 Yes AWS region for ECR, SQS, and other resources
config.values.infra.ml_account_id string Yes AWS account ID (12 digits, quoted as string)
config.values.infra.docker_repo_prefix string Yes ECR registry prefix: <account>.dkr.ecr.<region>.amazonaws.com
config.values.infra.redis_host string Yes (or use secret) ElastiCache hostname
config.values.infra.redis_aws_secret_name string No AWS Secrets Manager secret name containing Redis connection info. Fields: scheme, host, port, auth_token (optional), query_params (optional)
config.values.infra.s3_bucket string llm-engine Yes S3 bucket name for artifacts
config.values.launch.cache_redis_aws_url string Yes (one of three) Full Redis URL: redis://<host>:<port>/<db>
config.values.launch.cache_redis_aws_secret_name string Yes (one of three) AWS Secrets Manager secret with field cache-url containing full Redis URL
config.values.launch.sqs_profile string default No AWS profile for SQS operations
config.values.launch.sqs_queue_policy_template string Yes (for async) IAM policy template for per-endpoint SQS queues. Must grant sqs:* to the LLM Engine role
config.values.launch.sqs_queue_tag_template string No JSON template for SQS queue tags
celeryBrokerType string sqs Yes Use sqs for AWS async endpoints
serviceAccount.annotations."eks.amazonaws.com/role-arn" string Yes IRSA role ARN for the control-plane service account
# AWS reference config diff
config:
  values:
    infra:
      cloud_provider: aws
      default_region: us-east-1
      ml_account_id: "111122223333"
      docker_repo_prefix: "111122223333.dkr.ecr.us-east-1.amazonaws.com"
      redis_host: my-redis.use1.cache.amazonaws.com
      s3_bucket: my-llm-engine-bucket
    launch:
      cache_redis_aws_url: redis://my-redis.use1.cache.amazonaws.com:6379/15
      sqs_profile: default
      sqs_queue_policy_template: >
        {
          "Version": "2012-10-17",
          "Statement": [{
            "Effect": "Allow",
            "Principal": {"AWS": "arn:aws:iam::111122223333:role/k8s-main-llm-engine"},
            "Action": "sqs:*",
            "Resource": "arn:aws:sqs:us-east-1:111122223333:${queue_name}"
          }]
        }

celeryBrokerType: sqs

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/k8s-main-llm-engine

Azure (Diff from AWS)

Value Type Default Required Description
config.values.infra.cloud_provider string Yes Set to azure
config.values.infra.default_region string Yes Azure region (e.g., eastus)
config.values.launch.cache_redis_azure_host string Yes Azure Cache for Redis hostname: <name>.redis.cache.windows.net:6380
keyvaultName string llm-engine-keyvault Yes Azure Key Vault name for secret retrieval
celeryBrokerType string Yes Set to elasticache for Azure Service Bus-backed broker

Azure Service Bus: broker_pool_limit

When using Azure Service Bus as the Celery broker, do not set broker_pool_limit=0. This was previously thought to help with connection management but actually causes idle AMQP connections to drop, resulting in 503 errors on async endpoints. The fix (removing broker_pool_limit=0) is tracked in commit 9deb59f1. Leave this at the library default.

# Azure diff
config:
  values:
    infra:
      cloud_provider: azure
      default_region: eastus
      ml_account_id: "your-subscription-id"
      docker_repo_prefix: "myregistry.azurecr.io"
    launch:
      cache_redis_azure_host: my-llm-engine-cache.redis.cache.windows.net:6380
      # Do NOT set cache_redis_aws_url for Azure

keyvaultName: my-llm-engine-keyvault
celeryBrokerType: elasticache  # Azure Service Bus-backed

GCP / On-Premises (Diff from AWS)

Value Type Default Required Description
config.values.infra.cloud_provider string Yes Set to onprem
config.values.launch.cache_redis_onprem_url string Yes (one of three) Explicit Redis URL for on-prem: redis://redis:6379/0. Highest priority — takes precedence over all other Redis URL fields
celeryBrokerType string Yes Set to elasticache to use Redis as the Celery broker instead of SQS
celery_broker_type_redis bool null No Alternative override flag to force Redis broker regardless of celeryBrokerType
# On-prem / GCP diff
config:
  values:
    infra:
      cloud_provider: onprem
      default_region: us-central1
      ml_account_id: "my-gcp-project"
      docker_repo_prefix: "gcr.io/my-gcp-project"
    launch:
      cache_redis_onprem_url: redis://redis.llm-engine.svc.cluster.local:6379/0

celeryBrokerType: elasticache
celery_broker_type_redis: true

Cloud matrix

For a full per-cloud capability and limitation matrix, see cloud-matrix.md.


3. GPU / Hardware Config

Balloon Pods

Balloon pods are low-priority placeholder deployments that keep GPU nodes warm. When real inference pods need to be scheduled, they preempt the balloon pods, eliminating cold-start node provisioning time.

Value Type Default Required Description
balloons[].acceleratorName string Yes GPU type identifier. Must match node labels. Supported: nvidia-ampere-a10, nvidia-ampere-a100, nvidia-tesla-t4, nvidia-hopper-h100, cpu
balloons[].replicaCount integer 0 Yes Number of balloon pods to maintain for this GPU type. Set to 0 to disable warming for that type
balloons[].gpuCount integer 1 No Number of GPUs each balloon pod requests. Relevant for multi-GPU balloon pods (e.g., gpuCount: 4 for H100 nodes)
balloonConfig.reserveHighPriority bool true No If true, only high-priority pods can preempt balloon pods. If false, any pod can preempt balloons, which may cause unintended disruption
balloonNodeSelector map {node-lifecycle: normal} No Node selector applied to all balloon pod deployments. Restricts balloons to on-demand (non-spot) nodes by default
balloonConfig:
  reserveHighPriority: true

balloonNodeSelector:
  node-lifecycle: normal

balloons:
  - acceleratorName: nvidia-ampere-a10
    replicaCount: 2
  - acceleratorName: nvidia-ampere-a100
    replicaCount: 1
  - acceleratorName: nvidia-hopper-h100
    replicaCount: 1
    gpuCount: 4
  - acceleratorName: nvidia-tesla-t4
    replicaCount: 0
  - acceleratorName: cpu
    replicaCount: 0

Image Cache

Image caching pre-pulls large inference images onto GPU nodes so that endpoint scale-up does not spend time pulling multi-GB images. Each device entry specifies a node selector (and optional tolerations) to target a specific GPU node pool.

Value Type Default Required Description
imageCache.devices[].name string Yes Logical name for this device pool (e.g., a10, h100)
imageCache.devices[].nodeSelector map Yes Label selector targeting nodes in this device pool
imageCache.devices[].tolerations list [] No Tolerations for GPU taint. Required for GPU node pools with nvidia.com/gpu:NoSchedule taint
imageCache:
  devices:
    - name: cpu
      nodeSelector:
        cpu-only: "true"
    - name: a10
      nodeSelector:
        k8s.amazonaws.com/accelerator: nvidia-ampere-a10
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
    - name: a100
      nodeSelector:
        k8s.amazonaws.com/accelerator: nvidia-ampere-a100
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
    - name: t4
      nodeSelector:
        k8s.amazonaws.com/accelerator: nvidia-tesla-t4
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
    - name: h100
      nodeSelector:
        k8s.amazonaws.com/accelerator: nvidia-hopper-h100
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
    - name: h100-1g20gb
      nodeSelector:
        k8s.amazonaws.com/accelerator: nvidia-hopper-h100-1g20gb
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
    - name: h100-3g40gb
      nodeSelector:
        k8s.amazonaws.com/accelerator: nvidia-hopper-h100-3g40gb
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

These tables are used by the LLM Engine to suggest appropriate hardware configurations when a user creates an endpoint without specifying hardware. The engine selects the first matching tier from byModelName (exact match by model name slug), then falls back to byGpuMemoryGb (based on model weight size in GPU memory).

byGpuMemoryGb

Tiers are evaluated in ascending gpu_memory_le order. The first tier where the model's estimated GPU memory requirement is less than or equal to gpu_memory_le is selected.

gpu_memory_le (GB) CPUs GPUs Memory Storage GPU Type nodes_per_worker
24 10 1 24Gi 80Gi nvidia-ampere-a10 1
48 20 2 48Gi 80Gi nvidia-ampere-a10 1
96 40 4 96Gi 96Gi nvidia-ampere-a10 1
180 20 2 160Gi 160Gi nvidia-hopper-h100 1
320 40 4 320Gi 320Gi nvidia-hopper-h100 1
640 80 8 800Gi 640Gi nvidia-hopper-h100 1
640 80 8 800Gi 640Gi nvidia-hopper-h100 2

byModelName

Exact overrides by model name slug. Takes precedence over byGpuMemoryGb.

Model Name CPUs GPUs Memory Storage GPU Type nodes_per_worker
llama-3-8b-instruct-262k 20 2 40Gi 40Gi nvidia-hopper-h100 1
deepseek-coder-v2 160 8 800Gi 640Gi nvidia-hopper-h100 1
deepseek-coder-v2-instruct 160 8 800Gi 640Gi nvidia-hopper-h100 1
recommendedHardware:
  byGpuMemoryGb:
    - gpu_memory_le: 24
      cpus: 10
      gpus: 1
      memory: 24Gi
      storage: 80Gi
      gpu_type: nvidia-ampere-a10
      nodes_per_worker: 1
    - gpu_memory_le: 48
      cpus: 20
      gpus: 2
      memory: 48Gi
      storage: 80Gi
      gpu_type: nvidia-ampere-a10
      nodes_per_worker: 1
    # ... additional tiers
  byModelName:
    - name: deepseek-coder-v2-instruct
      cpus: 160
      gpus: 8
      memory: 800Gi
      storage: 640Gi
      gpu_type: nvidia-hopper-h100
      nodes_per_worker: 1

Control-Plane Node Selector, Tolerations, and Affinity

These apply to the LLM Engine control-plane deployments (gateway, cacher, builder) — not to inference endpoint pods.

Value Type Default Required Description
nodeSelector map {node-lifecycle: normal} No Node selector for control-plane pods. Default pins to on-demand nodes
tolerations list [] No Tolerations for control-plane pods
affinity map {} No Affinity rules for control-plane pods
nodeSelector:
  node-lifecycle: normal
  kubernetes.io/arch: amd64

tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "llm-engine"
    effect: "NoSchedule"

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: llm-engine-gateway
          topologyKey: kubernetes.io/hostname

4. Autoscaling

Gateway Horizontal Pod Autoscaler

The HPA governs the number of gateway replicas based on concurrent request load. This applies only to the control-plane gateway deployment, not to inference endpoint pods.

Value Type Default Required Description
autoscaling.horizontal.enabled bool true No Enable HPA for the gateway deployment
autoscaling.horizontal.minReplicas integer 2 No Minimum number of gateway replicas
autoscaling.horizontal.maxReplicas integer 10 No Maximum number of gateway replicas
autoscaling.horizontal.targetConcurrency integer 50 No Target average concurrent requests per replica before scaling out
autoscaling.vertical.enabled bool false No Enable Vertical Pod Autoscaler (VPA) for control-plane deployments. Requires VPA operator installed in cluster
autoscaling.prewarming.enabled bool false No Enable endpoint pre-warming (reserved for future use)

Celery Autoscaler

For async (queue-backed) endpoints, a separate Celery autoscaler process monitors queue depth and scales inference pods accordingly.

Value Type Default Required Description
celery_autoscaler.enabled bool true No Enable the Celery autoscaler for async endpoint scaling
celery_autoscaler.num_shards integer 3 No Number of autoscaler shard instances. More shards reduces per-shard queue-watching load at high endpoint counts

KEDA (Scale-to-Zero for Sync Endpoints)

KEDA enables sync/streaming endpoints to scale from zero replicas to one when the first request arrives. This is distinct from the HPA (which cannot scale below minReplicas).

Value Type Default Required Description
keda.cooldownPeriod integer 300 No Seconds KEDA waits after the last request before scaling a sync endpoint down to zero
config.values.infra.prometheus_server_address string unset Yes (for KEDA) Address of the Prometheus server that KEDA queries for endpoint request metrics

KEDA requires Prometheus

config.values.infra.prometheus_server_address must be set for KEDA scale-to-zero to function. If it is unset, sync endpoints with min_workers=0 will silently fail to scale up from zero — the endpoint will appear healthy but all requests will hang until manually scaled.

KEDA vs HPA: mutual exclusivity

KEDA and HPA are mutually exclusive per endpoint. When an endpoint has min_workers=0, KEDA manages scaling from 0 to 1. Once at 1+ replicas, the HPA (if configured) takes over scaling above 1. Do not configure both KEDA and HPA to manage the same endpoint's replica range. Additionally, KEDA can only scale from 0 to 1 — it does not replace the HPA for scaling beyond 1 replica.

autoscaling:
  horizontal:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetConcurrency: 50
  vertical:
    enabled: false

celery_autoscaler:
  enabled: true
  num_shards: 3

keda:
  cooldownPeriod: 300

config:
  values:
    infra:
      prometheus_server_address: "http://prometheus-server.istio-system.svc.cluster.local:80"

5. Networking

Istio VirtualService and DestinationRule

LLM Engine uses Istio for traffic routing when config.values.launch.istio_enabled is true. The VirtualService routes external traffic to the gateway service; the DestinationRule configures connection pool and outlier detection.

Value Type Default Required Description
virtualservice.enabled bool true No Create an Istio VirtualService for the gateway
virtualservice.hostDomains list [llm-engine.domain.com] Yes (if enabled) List of hostnames this VirtualService responds to. Must match your Istio gateway configuration
virtualservice.gateways list [default/internal-gateway] Yes (if enabled) Istio Gateway resources to attach this VirtualService to. Format: <namespace>/<gateway-name>
virtualservice.annotations map {} No Additional annotations for the VirtualService resource
destinationrule.enabled bool true No Create an Istio DestinationRule for the gateway service
destinationrule.annotations map {} No Additional annotations for the DestinationRule resource
hostDomain.prefix string http:// No URL scheme prefix used when constructing endpoint host URLs. Set to https:// for TLS-terminated clusters
service.type string ClusterIP No Kubernetes Service type for the gateway. Use ClusterIP with Istio; change to LoadBalancer only if managing ingress outside Istio
service.port integer 80 No Port exposed by the gateway Kubernetes Service
config.values.launch.istio_enabled bool true No Whether Istio service mesh is active. When false, VirtualService/DestinationRule resources are not used and direct service routing applies
virtualservice:
  enabled: true
  hostDomains:
    - llm-engine.example.com
  gateways:
    - default/internal-gateway

destinationrule:
  enabled: true

hostDomain:
  prefix: https://

service:
  type: ClusterIP
  port: 80

config:
  values:
    launch:
      istio_enabled: true

Redis TLS and Authentication

These values control TLS and authentication for the Redis connection used by KEDA and endpoint metrics.

Value Type Default Required Description
redis.enableTLS bool false No Enable TLS for the Redis connection. Required for Azure Cache for Redis (port 6380) and any Redis with TLS enforced
redis.enableAuth bool false No Enable password/token authentication for Redis. Required when the Redis cluster has AUTH configured
redis.auth string null No Redis AUTH password or token. Only used when enableAuth: true. Store this in a Kubernetes Secret rather than directly in values
redis.kedaSecretName string "" No Name of a Kubernetes Secret containing Redis credentials for KEDA's ScaledObject. KEDA reads this directly; leave empty to use unauthenticated Redis
redis.unsafeSsl bool false No Skip TLS certificate verification. Use only in development environments with self-signed certificates
redis:
  enableTLS: true
  enableAuth: true
  auth: ""  # set via --set redis.auth=$REDIS_TOKEN or from a secret
  kedaSecretName: "keda-redis-secret"
  unsafeSsl: false

6. Observability

Datadog Integration

LLM Engine supports two Datadog toggles that must both be set consistently.

Value Type Default Required Description
datadog.enabled bool false No Mount the Datadog agent socket and inject Datadog environment variables into control-plane pods. Requires the Datadog agent DaemonSet to be running on the cluster
dd_trace_enabled bool true No Top-level Helm toggle that controls whether the DD_TRACE_ENABLED environment variable is set to true in control-plane containers
config.values.launch.dd_trace_enabled bool false No Service-config-level toggle that controls whether the application code initializes the ddtrace library at startup. Must match dd_trace_enabled to avoid partial tracing

Two toggles, one feature

dd_trace_enabled (top-level) and config.values.launch.dd_trace_enabled are independent toggles that together control Datadog APM. Setting only one of them produces a broken state: traces may be emitted but not received, or the agent socket may be mounted but no spans generated. Always set both to the same value.

datadog:
  enabled: true

dd_trace_enabled: true

config:
  values:
    launch:
      dd_trace_enabled: true

Logging

Value Type Default Required Description
config.values.launch.sensitive_log_mode bool false No When true, suppresses logging of request/response payloads and other PII-containing fields. Enable in customer environments that process sensitive data
debug_mode bool/null null No Enables verbose debug logging across infrastructure components (gateway, cacher, builder). Produces high log volume — use only for troubleshooting
config:
  values:
    launch:
      sensitive_log_mode: true

debug_mode: null  # set to true only during active debugging

7. Security / Compliance

Pod Security Context

The pod security context applies to all containers within a pod and controls user/group identity and filesystem permissions. Uncomment and set these values when using a hardened base image (e.g., Chainguard).

Value Type Default Required Description
podSecurityContext.runAsUser integer unset No UID to run all containers as. Chainguard images use 65532 (nonroot)
podSecurityContext.runAsGroup integer unset No GID to run all containers as
podSecurityContext.runAsNonRoot bool unset No Enforce that no container runs as UID 0. Set to true for all production deployments
podSecurityContext.fsGroup integer unset No GID for volume mounts. Set to match runAsGroup so mounted secrets and configmaps are readable
podSecurityContext:
  runAsUser: 65532
  runAsGroup: 65532
  runAsNonRoot: true
  fsGroup: 65532

Container Security Context

The container security context applies to each individual container and controls Linux capabilities and filesystem access.

Value Type Default Required Description
containerSecurityContext.allowPrivilegeEscalation bool unset No Prevent the process from gaining additional privileges via setuid/setgid. Set to false in all production deployments
containerSecurityContext.readOnlyRootFilesystem bool unset No Mount the container root filesystem as read-only. Set to false if the application writes to /tmp or other paths on the root fs
containerSecurityContext.capabilities.drop list unset No Linux capabilities to drop. Set to ["ALL"] to remove all capabilities and then add back only what is needed
containerSecurityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: false
  capabilities:
    drop:
      - ALL

Inference Pod Security (serviceTemplate)

These values apply to the inference endpoint pods created by the builder — not to the control-plane pods. They are injected into each endpoint's pod spec via the service template.

Value Type Default Required Description
serviceTemplate.securityContext.capabilities.drop list ["all"] No Linux capabilities to drop from inference containers. Default drops all capabilities
serviceTemplate.mountInfraConfig bool true No Mount the infra ConfigMap into inference pods. Required for the endpoint to read cloud configuration
serviceTemplate.createServiceAccount bool true No Create a dedicated Kubernetes ServiceAccount for inference pods in the endpoint namespace
serviceTemplate.serviceAccountName string model-engine No Name of the ServiceAccount created for inference pods
serviceTemplate.serviceAccountAnnotations map No Annotations for the inference pod ServiceAccount. On EKS, set eks.amazonaws.com/role-arn to the inference IAM role
serviceTemplate:
  securityContext:
    capabilities:
      drop:
        - all
  mountInfraConfig: true
  createServiceAccount: true
  serviceAccountName: model-engine
  serviceAccountAnnotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/llm-engine
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "-2"

FIPS / Federal Compliance

Value Type Default Required Description
celery_enable_sha256 bool/null null No When true, forces Celery to use SHA-256 message signing instead of the default SHA-1. Required in FIPS-mode environments and any environment with federal compliance mandates (FedRAMP, IL4/IL5)

Coordinated rollout required for celery_enable_sha256

Changing celery_enable_sha256 requires a coordinated rollout. In-flight Celery tasks signed with SHA-1 cannot be verified by workers expecting SHA-256, and vice versa. During the transition window, drain all queues before deploying new workers. Rolling updates without draining will cause task signature verification failures and silently dropped async requests.

celery_enable_sha256: true

8. Replica and Resource Tuning

Replica Counts

Value Type Default Required Description
replicaCount.gateway integer 2 No Number of gateway replicas. Minimum 2 for production HA. Overridden by HPA when autoscaling.horizontal.enabled: true
replicaCount.cacher integer 1 No Number of cacher replicas. The cacher maintains a local cache of Kubernetes state (endpoint pods, services). Single replica is usually sufficient
replicaCount.builder integer 1 No Number of builder replicas. The builder handles endpoint creation and image build jobs. Single replica is usually sufficient

Resources

Value Type Default Required Description
resources.requests.cpu string/integer 2 No CPU request for control-plane pods (gateway, cacher, builder)
resources.requests.memory string unset No Memory request for control-plane pods
resources.limits.cpu string/integer unset No CPU limit for control-plane pods
resources.limits.memory string unset No Memory limit for control-plane pods
replicaCount:
  gateway: 2
  cacher: 1
  builder: 1

resources:
  requests:
    cpu: 2
    memory: 4Gi
  limits:
    cpu: 4
    memory: 8Gi

Pod Disruption Budget

Value Type Default Required Description
podDisruptionBudget.enabled bool true No Create a PodDisruptionBudget for the gateway deployment to ensure availability during node drain and rolling updates
podDisruptionBudget.minAvailable integer/string 1 No Minimum number (or percentage) of gateway pods that must remain available during voluntary disruptions

Database Initialization

Value Type Default (values.yaml) Default (values_sample.yaml) Required Description
db.runDbMigrationScript bool true false Yes on first install Run Alembic schema migrations as a pre-install/pre-upgrade Job. Must be true on first install or the database schema will not be initialized
db.runDbInitScript bool false false No Run the database initialization script (seed data). Only needed on fresh installs that require initial seed data

First install: set runDbMigrationScript: true

values_sample.yaml ships with db.runDbMigrationScript: false. On a brand-new install, the database schema does not exist yet. Without migrations, model creation will fail with cryptic PostgreSQL errors about missing tables. Always override this to true on first install. After initial migration, subsequent upgrades will apply incremental migrations automatically when set to true.

Database Engine Tuning

These values tune the SQLAlchemy connection pool. Defaults are appropriate for most deployments. Increase pool_size and max_overflow only when you observe connection exhaustion errors under high gateway concurrency.

Value Type Default Required Description
config.values.infra.db_engine_pool_size integer 10 No Number of persistent connections in the SQLAlchemy connection pool per process
config.values.infra.db_engine_max_overflow integer 10 No Maximum number of connections allowed above pool_size. Total max connections = pool_size + max_overflow
config.values.infra.db_engine_echo bool false No Log all SQL statements. Produces extremely high log volume — use only for debugging SQL query issues
config.values.infra.db_engine_echo_pool bool false No Log all connection pool events (checkout, checkin, overflow). Use only for debugging connection pool exhaustion
config.values.infra.db_engine_disconnect_strategy string pessimistic No Strategy for detecting stale/broken connections. pessimistic tests the connection before each use (safe but adds a small latency). optimistic assumes connections are valid until proven otherwise
config:
  values:
    infra:
      db_engine_pool_size: 10
      db_engine_max_overflow: 10
      db_engine_echo: false
      db_engine_echo_pool: false
      db_engine_disconnect_strategy: "pessimistic"

LLM Inference Image Repositories

These values specify the Docker repository paths for each supported inference backend. They are combined with config.values.infra.docker_repo_prefix at endpoint creation time to form the full image URI.

vllm_repository: always override in customer environments

The default value vllm is a short relative path that resolves to Scale's internal ECR registry when combined with Scale's docker_repo_prefix. In customer environments with a different registry prefix, endpoint pods will attempt to pull from a non-existent or inaccessible image path. The pods will appear to be INITIALIZING with no clear error. Always set vllm_repository to the full repository path or a prefix-relative path that exists in your registry.

Value Type Default Required Description
config.values.launch.vllm_repository string vllm Yes Repository path for vLLM inference images. This is the most commonly used inference backend
config.values.launch.tensorrt_llm_repository string tensorrt-llm No Repository path for TensorRT-LLM inference images
config.values.launch.batch_inference_vllm_repository string llm-engine/batch-infer-vllm No Repository path for batch inference images (used by batch completion endpoints)
config.values.launch.tgi_repository string text-generation-inference No Repository path for HuggingFace Text Generation Inference images
config.values.launch.lightllm_repository string lightllm No Repository path for LightLLM inference images
config.values.launch.sglang_repository string null No Repository path for SGLang inference images. Optional; leave unset if SGLang is not used
config.values.launch.user_inference_base_repository string launch/inference No Base repository for custom user-defined inference images
config.values.launch.user_inference_pytorch_repository string launch/inference/pytorch No Repository for custom PyTorch inference images
config.values.launch.user_inference_tensorflow_repository string launch/inference/tf No Repository for custom TensorFlow inference images
config.values.launch.docker_image_layer_cache_repository string launch-docker-build-cache No Repository used as a layer cache during Docker image builds for custom endpoints
config:
  values:
    launch:
      # Always override these in customer environments
      vllm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/vllm"
      tensorrt_llm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/tensorrt-llm"
      batch_inference_vllm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/llm-engine/batch-infer-vllm"
      tgi_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/text-generation-inference"
      lightllm_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/lightllm"
      user_inference_base_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/launch/inference"
      user_inference_pytorch_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/launch/inference/pytorch"
      user_inference_tensorflow_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/launch/inference/tf"
      docker_image_layer_cache_repository: "111122223333.dkr.ecr.us-east-1.amazonaws.com/launch-docker-build-cache"

Fine-Tuning Storage

Value Type Default Required Description
config.values.launch.s3_file_llm_fine_tuning_job_repository string s3://llm-engine/llm-ft-job-repository Yes S3 URI (or equivalent) where fine-tuning job artifacts (checkpoints, adapters) are stored
config.values.launch.hf_user_fine_tuned_weights_prefix string s3://llm-engine/fine_tuned_weights Yes S3 URI prefix for storing user-uploaded fine-tuned model weights

Image Pull Policy

Value Type Default Required Description
image.pullPolicy string Always No Kubernetes image pull policy for all control-plane images. Always ensures the latest tag is always pulled. Set to IfNotPresent to avoid redundant pulls when using immutable tags

AWS ConfigMap

Value Type Default Required Description
aws.configMap.name string default-config No Name of the Kubernetes ConfigMap containing the AWS CLI configuration
aws.configMap.create bool true No Whether to create the AWS ConfigMap as part of the Helm release
aws.profileName string default No AWS profile name to use from the ConfigMap

Image Builder Service Account

Value Type Default Required Description
imageBuilderServiceAccount.create bool true No Create a dedicated ServiceAccount for the image builder. This account needs ECR push/pull permissions
imageBuilderServiceAccount.annotations map No Annotations for the image builder ServiceAccount. On EKS, set eks.amazonaws.com/role-arn to a role with ECR permissions
imageBuilderServiceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/k8s-main-llm-engine-image-builder

Miscellaneous

Value Type Default Required Description
spellbook.enabled bool false No Enable Spellbook integration. Reserved for Scale internal use
context string production No Deployment context tag. Used for labeling and log correlation. Set to a meaningful environment name (e.g., staging, production, customer-prod)
celery_broker_type_redis bool/null null No When true, forces the Celery broker to use Redis regardless of the celeryBrokerType value. Useful for on-prem and GCP deployments where SQS is unavailable
keyvaultName string llm-engine-keyvault No Azure Key Vault name. Only used when cloud_provider: azure