Completions
Language Models are trained to predict natural language and provide text outputs as a response to their inputs. The inputs are called prompts and outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words. This process is language dependent.
Scale's LLM Engine provides access to open source language models (see Model Zoo) that can be used for producing completions to prompts.
Completion API call¶
An example API call looks as follows:
from llmengine import Completion
response = Completion.create(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
)
print(response.json())
# '{"request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0", "output": {"text": "________ and I am a ________", "num_completion_tokens": 10}}'
print(response.output.text)
# ________ and I am a ________
- model: The LLM you want to use (see Model Zoo).
- prompt: The main input for the LLM to respond to.
- max_new_tokens: The maximum number of tokens to generate in the chat completion.
- temperature: The sampling temperature to use. Higher values make the output more random, while lower values will make it more focused and deterministic. When temperature is 0 greedy search is used.
See the full Completion API reference documentation to learn more.
Completion API response¶
An example Completion API response looks as follows:
Token streaming¶
The Completions API supports token streaming to reduce perceived latency for certain applications. When streaming, tokens will be sent as data-only server-side events.
To enable token streaming, pass stream=True
to either Completion.create or Completion.acreate.
Streaming Error Handling¶
Note: Error handling semantics are mixed for streaming calls:
- Errors that arise before streaming begins are returned back to the user as HTTP
errors with the appropriate status code.
- Errors that arise after streaming begins within a HTTP 200
response are returned back to the user as plain-text messages and currently need to be handled by the client.
An example of token streaming using the synchronous Completions API looks as follows:
import sys
from llmengine import Completion
# errors occurring before streaming begins will be thrown here
stream = Completion.create(
model="llama-2-7b",
prompt="Give me a 200 word summary on the current economic events in the US.",
max_new_tokens=1000,
temperature=0.2,
stream=True,
)
for response in stream:
if response.output:
print(response.output.text, end="")
sys.stdout.flush()
else: # an error occurred after streaming began
print(response.error) # print the error message out
break
Async requests¶
The Python client supports asyncio
for creating Completions. Use Completion.acreate instead of Completion.create
to utilize async processing. The function signatures are otherwise identical.
An example of async Completions looks as follows:
import asyncio
from llmengine import Completion
async def main():
response = await Completion.acreate(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
)
print(response.json())
asyncio.run(main())
Batch completions¶
The Python client also supports batch completions. Batch completions supports distributing data to multiple workers to accelerate inference. It also tries to maximize throughput so the completions should finish quite a bit faster than hitting models through HTTP. Use Completion.batch_create to utilize batch completions.
Some examples of batch completions:
from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent
content = CreateBatchCompletionsRequestContent(
prompts=["What is deep learning", "What is a neural network"],
max_new_tokens=10,
temperature=0.0
)
response = Completion.batch_create(
output_data_path="s3://my-path",
model_config=CreateBatchCompletionsModelConfig(
model="llama-2-7b",
checkpoint_path="s3://checkpoint-path",
labels={"team":"my-team", "product":"my-product"}
),
content=content
)
print(response.job_id)
from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent
# Store CreateBatchCompletionsRequestContent data into input file "s3://my-input-path"
response = Completion.batch_create(
input_data_path="s3://my-input-path",
output_data_path="s3://my-output-path",
model_config=CreateBatchCompletionsModelConfig(
model="llama-2-7b",
checkpoint_path="s3://checkpoint-path",
labels={"team":"my-team", "product":"my-product"}
),
data_parallelism=2
)
print(response.job_id)
For how to properly use the tool please see Completion.batch_create tool_config doc.
from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent, ToolConfig
# Store CreateBatchCompletionsRequestContent data into input file "s3://my-input-path"
response = Completion.batch_create(
input_data_path="s3://my-input-path",
output_data_path="s3://my-output-path",
model_config=CreateBatchCompletionsModelConfig(
model="llama-2-7b",
checkpoint_path="s3://checkpoint-path",
labels={"team":"my-team", "product":"my-product"}
),
data_parallelism=2,
tool_config=ToolConfig(
name="code_evaluator",
)
)
print(response.json())
Guided decoding¶
Guided decoding is supported by vLLM and backed by Outlines. It enforces certain token generation patterns by tinkering with the sampling logits.
from llmengine import Completion
response = Completion.create(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
guided_regex="Sean.*",
)
print(response.json())
# {"request_id":"c19f0fae-317e-4f69-8e06-c04189299b9c","output":{"text":"Sean. I'm a 2","num_prompt_tokens":6,"num_completion_tokens":10,"tokens":null}}
from llmengine import Completion
response = Completion.create(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
guided_choice=["Sean", "Brian", "Tim"],
)
print(response.json())
# {"request_id":"641e2af3-a3e3-4493-98b9-d38115ba0d22","output":{"text":"Sean","num_prompt_tokens":6,"num_completion_tokens":4,"tokens":null}}
from llmengine import Completion
response = Completion.create(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
guided_json={"properties":{"myString":{"type":"string"}},"required":["myString"]},
)
print(response.json())
# {"request_id":"5b184654-96b6-4932-9eb6-382a51fdb3d5","output":{"text":"{\"myString\" : \"John Doe","num_prompt_tokens":6,"num_completion_tokens":10,"tokens":null}}
from llmengine import Completion
response = Completion.create(
model="llama-2-7b",
prompt="Hello, my name is",
max_new_tokens=10,
temperature=0.2,
guided_grammar="start: \"John\""
)
print(response.json())
# {"request_id": "34621b44-c655-402c-a459-f108b3e49b12", "output": {"text": "John", "num_prompt_tokens": 6, "num_completion_tokens": 4, "tokens": None}}
Which model should I use?¶
See the Model Zoo for more information on best practices for which model to use for Completions.