Completions

Language Models are trained to predict natural language and provide text outputs as a response to their inputs. The inputs are called prompts and outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words. This process is language dependent.

Scale's LLM Engine provides access to open source language models (see Model Zoo) that can be used for producing completions to prompts.

Completion API call¶

An example API call looks as follows:

Completion call in Python

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
)

print(response.json())
# '{"request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0", "output": {"text": "________ and I am a ________", "num_completion_tokens": 10}}'

print(response.output.text)
# ________ and I am a ________

model: The LLM you want to use (see Model Zoo).
prompt: The main input for the LLM to respond to.
max_new_tokens: The maximum number of tokens to generate in the chat completion.
temperature: The sampling temperature to use. Higher values make the output more random, while lower values will make it more focused and deterministic. When temperature is 0 greedy search is used.

See the full Completion API reference documentation to learn more.

Completion API response¶

An example Completion API response looks as follows:

Response in JSONResponse in Python

    >>> print(response.json())
    {
      "request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0",
      "output": {
        "text": "_______ and I am a _______",
        "num_completion_tokens": 10
      }
    }

    >>> print(response.output.text)
    _______ and I am a _______

Token streaming¶

The Completions API supports token streaming to reduce perceived latency for certain applications. When streaming, tokens will be sent as data-only server-side events.

To enable token streaming, pass stream=True to either Completion.create or Completion.acreate.

Streaming Error Handling¶

Note: Error handling semantics are mixed for streaming calls: - Errors that arise before streaming begins are returned back to the user as HTTP errors with the appropriate status code. - Errors that arise after streaming begins within a HTTP 200 response are returned back to the user as plain-text messages and currently need to be handled by the client.

An example of token streaming using the synchronous Completions API looks as follows:

Token streaming with synchronous API in python

import sys

from llmengine import Completion

# errors occurring before streaming begins will be thrown here
stream = Completion.create(
    model="llama-2-7b",
    prompt="Give me a 200 word summary on the current economic events in the US.",
    max_new_tokens=1000,
    temperature=0.2,
    stream=True,
)

for response in stream:
    if response.output:
        print(response.output.text, end="")
        sys.stdout.flush()
    else: # an error occurred after streaming began
        print(response.error) # print the error message out 
        break

Async requests¶

The Python client supports asyncio for creating Completions. Use Completion.acreate instead of Completion.create to utilize async processing. The function signatures are otherwise identical.

An example of async Completions looks as follows:

Completions with asynchronous API in python

import asyncio
from llmengine import Completion

async def main():
    response = await Completion.acreate(
        model="llama-2-7b",
        prompt="Hello, my name is",
        max_new_tokens=10,
        temperature=0.2,
    )
    print(response.json())

asyncio.run(main())

Batch completions¶

The Python client also supports batch completions. Batch completions supports distributing data to multiple workers to accelerate inference. It also tries to maximize throughput so the completions should finish quite a bit faster than hitting models through HTTP. Use Completion.batch_create to utilize batch completions.

Some examples of batch completions:

Batch completions with prompts in the request

from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent

content = CreateBatchCompletionsRequestContent(
    prompts=["What is deep learning", "What is a neural network"],
    max_new_tokens=10,
    temperature=0.0
)

response = Completion.batch_create(
    output_data_path="s3://my-path",
    model_config=CreateBatchCompletionsModelConfig(
        model="llama-2-7b",
        checkpoint_path="s3://checkpoint-path",
        labels={"team":"my-team", "product":"my-product"}
    ),
    content=content
)
print(response.job_id)

Batch completions with prompts in a file and with 2 parallel jobs

from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent

# Store CreateBatchCompletionsRequestContent data into input file "s3://my-input-path"

response = Completion.batch_create(
    input_data_path="s3://my-input-path",
    output_data_path="s3://my-output-path",
    model_config=CreateBatchCompletionsModelConfig(
        model="llama-2-7b",
        checkpoint_path="s3://checkpoint-path",
        labels={"team":"my-team", "product":"my-product"}
    ),
    data_parallelism=2
)
print(response.job_id)

Batch completions with prompts and use tool

For how to properly use the tool please see Completion.batch_create tool_config doc.

from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent, ToolConfig

# Store CreateBatchCompletionsRequestContent data into input file "s3://my-input-path"

response = Completion.batch_create(
    input_data_path="s3://my-input-path",
    output_data_path="s3://my-output-path",
    model_config=CreateBatchCompletionsModelConfig(
        model="llama-2-7b",
        checkpoint_path="s3://checkpoint-path",
        labels={"team":"my-team", "product":"my-product"}
    ),
    data_parallelism=2,
    tool_config=ToolConfig(
        name="code_evaluator",
    )
)
print(response.json())

Guided decoding¶

Guided decoding is supported by vLLM and backed by Outlines. It enforces certain token generation patterns by tinkering with the sampling logits.

Guided decoding with regex

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
    guided_regex="Sean.*",
)

print(response.json())
# {"request_id":"c19f0fae-317e-4f69-8e06-c04189299b9c","output":{"text":"Sean. I'm a 2","num_prompt_tokens":6,"num_completion_tokens":10,"tokens":null}}

Guided decoding with choice

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
    guided_choice=["Sean", "Brian", "Tim"],
)

print(response.json())
# {"request_id":"641e2af3-a3e3-4493-98b9-d38115ba0d22","output":{"text":"Sean","num_prompt_tokens":6,"num_completion_tokens":4,"tokens":null}}

Guided decoding with JSON schema

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
    guided_json={"properties":{"myString":{"type":"string"}},"required":["myString"]},
)

print(response.json())
# {"request_id":"5b184654-96b6-4932-9eb6-382a51fdb3d5","output":{"text":"{\"myString\" : \"John Doe","num_prompt_tokens":6,"num_completion_tokens":10,"tokens":null}}

Guided decoding with Context-Free Grammar

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
    guided_grammar="start: \"John\""
)

print(response.json())
# {"request_id": "34621b44-c655-402c-a459-f108b3e49b12", "output": {"text": "John", "num_prompt_tokens": 6, "num_completion_tokens": 4, "tokens": None}}

Which model should I use?¶

See the Model Zoo for more information on best practices for which model to use for Completions.