Smoke Tests¶

Post-deploy validation checklist for Model Engine. Run these checks after every helm install or helm upgrade to verify the deployment is working correctly.

Last verified: [DATE] Verified by: [AUTHOR]

Test Tiers¶

Two tiers are defined to allow partial validation when GPU nodes are unavailable:

Tier	Hardware	Model	What it covers
Tier A — CPU only	Any node, no GPU required	`model_engine_server.inference.forwarding.echo_server` (built into the model-engine image)	Service Builder flow, endpoint lifecycle, sync inference, async inference
Tier B — GPU + LLM	GPU node with NVIDIA driver	`llama-3.1-8b` via vLLM	vLLM image pull from customer registry, GPU scheduling, `POST /v2/chat/completions` sync and streaming

Run Tier A first. If Tier A passes, the control plane, broker, database, and Service Builder are all functioning. Tier B additionally validates GPU scheduling and the LLM API layer.

v2 API only

Tier B only tests POST /v2/chat/completions. Do not use v1 completions endpoints in these smoke tests.

Required Environment Variables¶

Set these before running any commands in this document:

export GATEWAY_URL="http://model-engine.{NAMESPACE}.svc.cluster.local"  # k8s cluster DNS
export AUTH_TOKEN="{your-auth-token}"
export NAMESPACE="{your-namespace}"

Phase 1: Pre-flight¶

Run these checks before helm install.

1.1 Redis reachability¶

[ ] Redis responds to ping:

redis-cli -h {REDIS_HOST} -p {REDIS_PORT} ping
# Expected: PONG

1.2 Database reachability¶

[ ] Database accepts connections:

psql "{DB_URL}" -c "SELECT 1"
# Expected: returns 1

1.3 vLLM image pullable from customer registry¶

Warning

Verify the image is accessible from your mirrored registry, not Scale's internal ECR. This is the #1 silent deployment failure.

[ ] vLLM image is pullable:

kubectl run vllm-pull-test \
  --image={VLLM_REPOSITORY}:{VLLM_TAG} \
  --restart=Never \
  --command -- echo "pull ok" \
  -n {NAMESPACE}
kubectl logs vllm-pull-test -n {NAMESPACE}
kubectl delete pod vllm-pull-test -n {NAMESPACE}
# Expected: "pull ok"

1.4 GPU driver functional on GPU nodes¶

[ ] NVIDIA driver is working on GPU nodes:

# List GPU nodes
kubectl get nodes -l k8s.amazonaws.com/accelerator={GPU_TYPE}

# Run nvidia-smi on a GPU node
kubectl debug node/{GPU_NODE_NAME} \
  -it \
  --image=nvidia/cuda:12.0-base \
  -- nvidia-smi
# Expected: GPU device listed with driver version and CUDA version

1.5 GPU nodes labeled correctly¶

[ ] Nodes carry the expected accelerator label:

kubectl get nodes --show-labels | grep accelerator
# Expected: nodes labeled with k8s.amazonaws.com/accelerator={GPU_TYPE}

1.6 Broker reachability¶

AWS (SQS)Azure (Service Bus)GCP / On-prem (Redis)

[ ] IAM/IRSA SQS access works from the service account:

kubectl run sqs-test \
  --image=amazon/aws-cli \
  --restart=Never \
  --serviceaccount={MODEL_ENGINE_SERVICE_ACCOUNT} \
  -n {NAMESPACE} \
  -- sqs list-queues --region {AWS_REGION}
kubectl logs sqs-test -n {NAMESPACE}
kubectl delete pod sqs-test -n {NAMESPACE}

[ ] Service Bus secret exists:

kubectl get secret {ASB_SECRET_NAME} -n {NAMESPACE}

Warning

Azure Service Bus drops idle AMQP connections after 300 seconds. Ensure broker_pool_limit=0 is set in helm values. See Cloud Support Matrix.

[ ] Redis broker reachable:

redis-cli -h {BROKER_REDIS_HOST} -p {BROKER_REDIS_PORT} ping
# Expected: PONG

Phase 2: Post-install Infrastructure¶

Run these checks immediately after helm install completes.

2.1 All pods Running and Ready¶

[ ] All model-engine pods are Running:

kubectl get pods -n {NAMESPACE} -l app=model-engine-gateway
kubectl get pods -n {NAMESPACE} -l app=model-engine-builder
kubectl get pods -n {NAMESPACE} -l app=model-engine-cacher
kubectl get pods -n {NAMESPACE} -l app=model-engine-celery-autoscaler

[ ] No pods in error states:

kubectl get pods -n {NAMESPACE} | grep -vE "Running|Completed|NAME"
# Expected: no output

2.2 Gateway HPA exists¶

[ ] HPA is configured for the gateway:

kubectl get hpa -n {NAMESPACE}
# Expected: model-engine-gateway HPA present

2.3 Celery Autoscaler StatefulSet ready¶

[ ] StatefulSet shows desired replicas ready:

kubectl get statefulset -n {NAMESPACE} -l app=model-engine-celery-autoscaler
# Expected: READY column matches DESIRED (e.g. 1/1)

2.4 Cacher log check¶

[ ] Cacher is completing cache cycles (should appear within 15 seconds):

kubectl logs -n {NAMESPACE} -l app=model-engine-cacher --tail=50 | grep -i "cache"

[ ] No Redis auth errors:

kubectl logs -n {NAMESPACE} -l app=model-engine-cacher --tail=100 \
  | grep -iE "auth|error|exception|redis"
# Expected: no authentication errors

2.5 Builder Celery worker ready¶

[ ] Builder shows Celery worker connected and ready:

kubectl logs -n {NAMESPACE} -l app=model-engine-builder --tail=100 \
  | grep -iE "ready|celery|connected|worker"

[ ] No broker connection errors:

kubectl logs -n {NAMESPACE} -l app=model-engine-builder --tail=100 \
  | grep -iE "error|refused|timeout|cannot connect"
# Expected: no output

2.6 ConfigMap non-empty¶

[ ] model-engine-config ConfigMap exists and has a populated data section:

kubectl get configmap model-engine-config -n {NAMESPACE} -o yaml
# Expected: data section is non-empty

[ ] Embedded kubeconfig has valid clusters and users:

kubectl get configmap model-engine-config -n {NAMESPACE} \
  -o jsonpath='{.data.kubeconfig}' \
  | python3 -c "
import sys, yaml
c = yaml.safe_load(sys.stdin)
print('clusters:', len(c.get('clusters', [])), 'users:', len(c.get('users', [])))
"
# Expected: clusters: 1 users: 1

Phase 3: Control Plane¶

Commands use k8s cluster DNS. To run from outside the cluster, port-forward first:

kubectl port-forward svc/model-engine 8080:80 -n {NAMESPACE}
# Then set GATEWAY_URL=http://localhost:8080

3.1 Gateway health check¶

[ ] Gateway returns 200:

curl -sf "{GATEWAY_URL}/healthz" && echo "OK"
# Expected: HTTP 200

3.2 Authenticated list endpoints¶

[ ] Returns 200 with valid list (empty is fine):

curl -sf \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  "{GATEWAY_URL}/v1/model-endpoints"
# Expected: HTTP 200, body: {"model_endpoints": []}

Phase 4: Endpoint Lifecycle — Tier A (CPU Only)¶

The echo server (model_engine_server.inference.forwarding.echo_server) is built into the model-engine image. It echoes request bodies unchanged — no GPU or model weights required.

4.1 Create sync CPU echo endpoint¶

[ ] Create endpoint:

curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "smoke-test-echo-sync",
    "model_bundle": {
      "flavor": "streaming_enhanced_runnable_image",
      "repository": "{MODEL_ENGINE_IMAGE_REPOSITORY}",
      "tag": "{MODEL_ENGINE_IMAGE_TAG}",
      "command": [
        "python", "-m",
        "model_engine_server.inference.forwarding.echo_server",
        "--port", "5005"
      ],
      "streaming_command": [
        "python", "-m",
        "model_engine_server.inference.forwarding.echo_server",
        "--port", "5005"
      ],
      "env": {
        "HTTP_HOST": "0.0.0.0",
        "ML_INFRA_SERVICES_CONFIG_PATH": "/workspace/model-engine/model_engine_server/core/configs/default.yaml"
      },
      "protocol": "http",
      "readiness_initial_delay_seconds": 20
    },
    "endpoint_type": "sync",
    "cpus": "1",
    "memory": "1Gi",
    "storage": "2Gi",
    "gpus": 0,
    "min_workers": 1,
    "max_workers": 1,
    "per_worker": 1,
    "labels": {"team": "infra", "product": "smoke-test"},
    "metadata": {}
  }' \
  "{GATEWAY_URL}/v1/model-endpoints"
# Expected: HTTP 200, {"endpoint_creation_task_id": "..."}

4.2 Poll until READY¶

[ ] Wait for status READY (typically 3–8 minutes for CPU):

for i in $(seq 1 50); do
  STATUS=$(curl -sf \
    -H "Authorization: Bearer {AUTH_TOKEN}" \
    "{GATEWAY_URL}/v1/model-endpoints?name=smoke-test-echo-sync" \
    | python3 -c "
import sys, json
eps = json.load(sys.stdin)['model_endpoints']
print(eps[0]['status'] if eps else 'NOT_FOUND')
")
  echo "[$(date)] Status: $STATUS"
  [ "$STATUS" = "READY" ] && break
  sleep 30
done

Stuck INITIALIZING >15 min

Service Builder cannot reach the broker. Check builder logs: kubectl logs -n {NAMESPACE} -l app=model-engine-builder --tail=200

Status unknown

Redis auth is broken in the cacher. The endpoint may be running but the Gateway cannot read its state. Check cacher logs: kubectl logs -n {NAMESPACE} -l app=model-engine-cacher --tail=200

4.3 Send echo request and verify response¶

[ ] Get endpoint ID and send request:

ENDPOINT_ID=$(curl -sf \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  "{GATEWAY_URL}/v1/model-endpoints?name=smoke-test-echo-sync" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['model_endpoints'][0]['id'])")

curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"args": {"hello": "world"}}' \
  "{GATEWAY_URL}/v1/sync-tasks?model_endpoint_id=$ENDPOINT_ID"
# Expected: {"status": "SUCCESS", "result": {"hello": "world"}, ...}

4.4 Create async CPU echo endpoint¶

[ ] Create async endpoint (same payload, endpoint_type: async):

curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "smoke-test-echo-async",
    "model_bundle": {
      "flavor": "streaming_enhanced_runnable_image",
      "repository": "{MODEL_ENGINE_IMAGE_REPOSITORY}",
      "tag": "{MODEL_ENGINE_IMAGE_TAG}",
      "command": [
        "python", "-m",
        "model_engine_server.inference.forwarding.echo_server",
        "--port", "5005"
      ],
      "streaming_command": [
        "python", "-m",
        "model_engine_server.inference.forwarding.echo_server",
        "--port", "5005"
      ],
      "env": {
        "HTTP_HOST": "0.0.0.0",
        "ML_INFRA_SERVICES_CONFIG_PATH": "/workspace/model-engine/model_engine_server/core/configs/default.yaml"
      },
      "protocol": "http",
      "readiness_initial_delay_seconds": 20
    },
    "endpoint_type": "async",
    "cpus": "1",
    "memory": "1Gi",
    "storage": "2Gi",
    "gpus": 0,
    "min_workers": 1,
    "max_workers": 1,
    "per_worker": 1,
    "labels": {"team": "infra", "product": "smoke-test"},
    "metadata": {}
  }' \
  "{GATEWAY_URL}/v1/model-endpoints"

[ ] Poll until smoke-test-echo-async is READY (same pattern as 4.2).

4.5 Send async request and poll for SUCCESS¶

[ ] Submit task and poll:

ASYNC_ENDPOINT_ID=$(curl -sf \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  "{GATEWAY_URL}/v1/model-endpoints?name=smoke-test-echo-async" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['model_endpoints'][0]['id'])")

TASK_ID=$(curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"args": {"y": 1}, "url": null}' \
  "{GATEWAY_URL}/v1/async-tasks?model_endpoint_id=$ASYNC_ENDPOINT_ID" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['task_id'])")
echo "Task ID: $TASK_ID"

for i in $(seq 1 30); do
  RESULT=$(curl -sf \
    -H "Authorization: Bearer {AUTH_TOKEN}" \
    "{GATEWAY_URL}/v1/async-tasks/$TASK_ID")
  STATUS=$(echo "$RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin)['status'])")
  echo "[$(date)] Task status: $STATUS"
  [ "$STATUS" = "SUCCESS" ] && echo "Result: $RESULT" && break
  [ "$STATUS" = "FAILURE" ] && echo "Task failed: $RESULT" && break
  sleep 10
done
# Expected final status: SUCCESS

4.6 Cleanup — Tier A¶

[ ] Delete both endpoints:

for NAME in smoke-test-echo-sync smoke-test-echo-async; do
  ID=$(curl -sf \
    -H "Authorization: Bearer {AUTH_TOKEN}" \
    "{GATEWAY_URL}/v1/model-endpoints?name=$NAME" \
    | python3 -c "import sys,json; eps=json.load(sys.stdin)['model_endpoints']; print(eps[0]['id'] if eps else '')")
  [ -n "$ID" ] && curl -sf -X DELETE \
    -H "Authorization: Bearer {AUTH_TOKEN}" \
    "{GATEWAY_URL}/v1/model-endpoints/$ID"
  echo "Deleted $NAME"
done
# Expected: {"deleted": true} for each

Phase 5: LLM Inference — Tier B (GPU + LLM)¶

GPU required

Complete Phase 1 checks 1.3–1.5 before proceeding. GPU nodes must be available and the vLLM image must be mirrored to the customer registry.

5.1 Create llama-3.1-8b LLM endpoint¶

[ ] Create endpoint with min_workers=1:

curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "smoke-test-llama-3-1-8b",
    "model_name": "llama-3.1-8b",
    "source": "hugging_face",
    "inference_framework": "vllm",
    "inference_framework_image_tag": "{VLLM_TAG}",
    "endpoint_type": "streaming",
    "cpus": 20,
    "gpus": 1,
    "gpu_type": "{GPU_TYPE}",
    "memory": "20Gi",
    "storage": "40Gi",
    "optimize_costs": false,
    "min_workers": 1,
    "max_workers": 1,
    "per_worker": 1,
    "labels": {"team": "infra", "product": "smoke-test"},
    "metadata": {},
    "public_inference": false
  }' \
  "{GATEWAY_URL}/v1/llm/model-endpoints"
# Expected: HTTP 200

5.2 Poll until READY¶

[ ] Allow up to 30 minutes for GPU image pull and model load:

for i in $(seq 1 60); do
  STATUS=$(curl -sf \
    -H "Authorization: Bearer {AUTH_TOKEN}" \
    "{GATEWAY_URL}/v1/llm/model-endpoints/smoke-test-llama-3-1-8b" \
    | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','NOT_FOUND'))")
  echo "[$(date)] Status: $STATUS"
  [ "$STATUS" = "READY" ] && break
  sleep 30
done

Status unknown

Redis authentication is broken in the K8s Cacher. Check cacher logs immediately:

kubectl logs -n {NAMESPACE} -l app=model-engine-cacher --tail=100 \
  | grep -iE "auth|redis|error"

Stuck INITIALIZING — vLLM pod Pending

Check whether the vLLM pod has a scheduling or image pull issue:

kubectl get pods -n {ENDPOINT_NAMESPACE} | grep smoke-test-llama
kubectl describe pod {VLLM_POD_NAME} -n {ENDPOINT_NAMESPACE} | grep -A 10 Events

5.3 POST /v2/chat/completions — sync¶

[ ] Send sync chat completions request and verify non-empty response:

curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smoke-test-llama-3-1-8b",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 50,
    "temperature": 0.0
  }' \
  "{GATEWAY_URL}/v2/chat/completions" \
  | python3 -c "
import sys, json
r = json.load(sys.stdin)
content = r['choices'][0]['message']['content']
assert content, 'Empty response content'
print('Response:', content)
"
# Expected: non-empty assistant message

5.4 POST /v2/chat/completions — streaming (SSE)¶

[ ] Verify SSE chunks arrive:

curl -sf -X POST \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "smoke-test-llama-3-1-8b",
    "messages": [{"role": "user", "content": "Count from one to five."}],
    "max_tokens": 50,
    "temperature": 0.0,
    "stream": true
  }' \
  "{GATEWAY_URL}/v2/chat/completions" \
  | grep "^data:" | grep -v "\[DONE\]" | head -5
# Expected: one or more lines of the form:
#   data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"..."}}]}

5.5 Cleanup — Tier B¶

[ ] Delete the LLM endpoint:

curl -sf -X DELETE \
  -H "Authorization: Bearer {AUTH_TOKEN}" \
  "{GATEWAY_URL}/v1/llm/model-endpoints/smoke-test-llama-3-1-8b"
# Expected: {"deleted": true}

[ ] Verify GPU worker pods are terminated:

kubectl get pods -n {ENDPOINT_NAMESPACE} | grep smoke-test-llama
# Expected: no output

Common Failure Signatures¶

Seeded from real deployment incidents.

Symptom	Likely cause	Where to look
Endpoint stuck `INITIALIZING` >15 min	Service Builder cannot reach message broker	`kubectl logs -n {NAMESPACE} -l app=model-engine-builder --tail=200`
Endpoint status `unknown`	Redis auth failure in K8s Cacher — Gateway cannot read endpoint state	`kubectl logs -n {NAMESPACE} -l app=model-engine-cacher --tail=200` — look for `AUTH` or `NOAUTH` errors
`ImagePullBackOff` on vLLM pods	`vllm_repository` points to Scale's internal ECR; image not mirrored	`kubectl describe pod {POD} -n {NS}` Events; check `vllm_repository` in helm values
GPU worker pods `Pending` indefinitely	NVIDIA driver not initialized, driver image not pullable, or no matching accelerator label	`kubectl describe pod {POD}` Events; `kubectl debug node/{NODE} -- nvidia-smi`
Random 503s on inference after idle period	Azure Service Bus drops idle AMQP connections after 300s	Gateway logs for `503`/`timeout`; verify `broker_pool_limit=0` in helm values
Permission errors / empty kubeconfig on endpoint creation	`model-engine-config` ConfigMap has empty kubeconfig due to race condition at startup	`kubectl get configmap model-engine-config -n {NAMESPACE} -o yaml` — check `kubeconfig` key
`GET /v1/model-endpoints` returns 500	DB unreachable or schema not migrated (`db.runDbMigrationScript: false` on first install)	Gateway pod logs — look for `psycopg2` or `sqlalchemy` errors
Builder logs show `KombuError` or `OperationalError`	`celeryBrokerType` mismatch or broker credentials wrong	Check `celeryBrokerType` in helm values matches deployed broker