Skip to content

🐍 Python Client API Reference

Completion

Bases: APIEngine

Completion API. This API is used to generate text completions.

Language models are trained to understand natural language and predict text outputs as a response to their inputs. The inputs are called prompts and the outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words; this process is language dependent.

The Completion API can be run either synchronous or asynchronously (via Python asyncio). For each of these modes, you can also choose whether to stream token responses or not.

create classmethod

create(
    model: str,
    prompt: str,
    max_new_tokens: int = 20,
    temperature: float = 0.2,
    stop_sequences: Optional[List[str]] = None,
    return_token_log_probs: Optional[bool] = False,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    include_stop_str_in_output: Optional[bool] = False,
    guided_json: Optional[Dict[str, Any]] = None,
    guided_regex: Optional[str] = None,
    guided_choice: Optional[List[str]] = None,
    timeout: int = COMPLETION_TIMEOUT,
    stream: bool = False,
) -> Union[
    CompletionSyncResponse,
    Iterator[CompletionStreamResponse],
]

Creates a completion for the provided prompt and parameters synchronously.

This API can be used to get the LLM to generate a completion synchronously. It takes as parameters the model (see Model Zoo) and the prompt. Optionally it takes max_new_tokens, temperature, timeout and stream. It returns a CompletionSyncResponse if stream=False or an async iterator of CompletionStreamResponse with request_id and outputs fields.

Parameters:

Name Type Description Default
model str

Name of the model to use. See Model Zoo for a list of Models that are supported.

required
prompt str

The prompt to generate completions for, encoded as a string.

required
max_new_tokens int

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_new_tokens cannot exceed the model's context length. See Model Zoo for information on each supported model's context length.

20
temperature float

What sampling temperature to use, in the range [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. When temperature is 0 greedy search is used.

0.2
stop_sequences Optional[List[str]]

One or more sequences where the API will stop generating tokens for the current completion.

None
return_token_log_probs Optional[bool]

Whether to return the log probabilities of generated tokens. When True, the response will include a list of tokens and their log probabilities.

False
presence_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None
frequency_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None
top_k Optional[int]

Integer that controls the number of top tokens to consider. Range: [1, infinity). -1 means consider all tokens.

None
top_p Optional[float]

Float that controls the cumulative probability of the top tokens to consider. Range: (0.0, 1.0]. 1.0 means consider all tokens.

None
include_stop_str_in_output Optional[bool]

Whether to include the stop sequence in the output. Default to False.

False
guided_json Optional[Dict[str, Any]]

If specified, the output will follow the JSON schema.

None
guided_regex Optional[str]

If specified, the output will follow the regex pattern.

None
guided_choice Optional[List[str]]

If specified, the output will be exactly one of the choices.

None
timeout int

Timeout in seconds. This is the maximum amount of time you are willing to wait for a response.

COMPLETION_TIMEOUT
stream bool

Whether to stream the response. If true, the return type is an Iterator[CompletionStreamResponse]. Otherwise, the return type is a CompletionSyncResponse. When streaming, tokens will be sent as data-only server-sent events.

False

Returns:

Name Type Description
response Union[CompletionSyncResponse, AsyncIterable[CompletionStreamResponse]]

The generated response (if stream=False) or iterator of response chunks (if stream=True)

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
)
print(response.json())
{
    "request_id": "8bbd0e83-f94c-465b-a12b-aabad45750a9",
    "output": {
        "text": "_______ and I am a _______",
        "num_completion_tokens": 10
    }
}

Token streaming can be used to reduce perceived latency for applications. Here is how applications can use streaming:

from llmengine import Completion

stream = Completion.create(
    model="llama-2-7b",
    prompt="why is the sky blue?",
    max_new_tokens=5,
    temperature=0.2,
    stream=True,
)

for response in stream:
    if response.output:
        print(response.json())
{"request_id": "ebbde00c-8c31-4c03-8306-24f37cd25fa2", "output": {"text": "\n", "finished": false, "num_completion_tokens": 1 } }
{"request_id": "ebbde00c-8c31-4c03-8306-24f37cd25fa2", "output": {"text": "I", "finished": false, "num_completion_tokens": 2 } }
{"request_id": "ebbde00c-8c31-4c03-8306-24f37cd25fa2", "output": {"text": " don", "finished": false, "num_completion_tokens": 3 } }
{"request_id": "ebbde00c-8c31-4c03-8306-24f37cd25fa2", "output": {"text": "’", "finished": false, "num_completion_tokens": 4 } }
{"request_id": "ebbde00c-8c31-4c03-8306-24f37cd25fa2", "output": {"text": "t", "finished": true, "num_completion_tokens": 5 } }

acreate async classmethod

acreate(
    model: str,
    prompt: str,
    max_new_tokens: int = 20,
    temperature: float = 0.2,
    stop_sequences: Optional[List[str]] = None,
    return_token_log_probs: Optional[bool] = False,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    include_stop_str_in_output: Optional[bool] = False,
    guided_json: Optional[Dict[str, Any]] = None,
    guided_regex: Optional[str] = None,
    guided_choice: Optional[List[str]] = None,
    timeout: int = COMPLETION_TIMEOUT,
    stream: bool = False,
) -> Union[
    CompletionSyncResponse,
    AsyncIterable[CompletionStreamResponse],
]

Creates a completion for the provided prompt and parameters asynchronously (with asyncio).

This API can be used to get the LLM to generate a completion asynchronously. It takes as parameters the model (see Model Zoo) and the prompt. Optionally it takes max_new_tokens, temperature, timeout and stream. It returns a CompletionSyncResponse if stream=False or an async iterator of CompletionStreamResponse with request_id and outputs fields.

Parameters:

Name Type Description Default
model str

Name of the model to use. See Model Zoo for a list of Models that are supported.

required
prompt str

The prompt to generate completions for, encoded as a string.

required
max_new_tokens int

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_new_tokens cannot exceed the model's context length. See Model Zoo for information on each supported model's context length.

20
temperature float

What sampling temperature to use, in the range [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. When temperature is 0 greedy search is used.

0.2
stop_sequences Optional[List[str]]

One or more sequences where the API will stop generating tokens for the current completion.

None
return_token_log_probs Optional[bool]

Whether to return the log probabilities of generated tokens. When True, the response will include a list of tokens and their log probabilities.

False
presence_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None
frequency_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None
top_k Optional[int]

Integer that controls the number of top tokens to consider. Range: [1, infinity). -1 means consider all tokens.

None
top_p Optional[float]

Float that controls the cumulative probability of the top tokens to consider. Range: (0.0, 1.0]. 1.0 means consider all tokens.

None
include_stop_str_in_output Optional[bool]

Whether to include the stop sequence in the output. Default to False.

False
guided_json Optional[Dict[str, Any]]

If specified, the output will follow the JSON schema. For examples see https://json-schema.org/learn/miscellaneous-examples.

None
guided_regex Optional[str]

If specified, the output will follow the regex pattern.

None
guided_choice Optional[List[str]]

If specified, the output will be exactly one of the choices.

None
timeout int

Timeout in seconds. This is the maximum amount of time you are willing to wait for a response.

COMPLETION_TIMEOUT
stream bool

Whether to stream the response. If true, the return type is an Iterator[CompletionStreamResponse]. Otherwise, the return type is a CompletionSyncResponse. When streaming, tokens will be sent as data-only server-sent events.

False

Returns:

Name Type Description
response Union[CompletionSyncResponse, AsyncIterable[CompletionStreamResponse]]

The generated response (if stream=False) or iterator of response chunks (if stream=True)

import asyncio
from llmengine import Completion

async def main():
    response = await Completion.acreate(
        model="llama-2-7b",
        prompt="Hello, my name is",
        max_new_tokens=10,
        temperature=0.2,
    )
    print(response.json())

asyncio.run(main())
{
    "request_id": "9cfe4d5a-f86f-4094-a935-87f871d90ec0",
    "output": {
        "text": "_______ and I am a _______",
        "num_completion_tokens": 10
    }
}

Token streaming can be used to reduce perceived latency for applications. Here is how applications can use streaming:

import asyncio
from llmengine import Completion

async def main():
    stream = await Completion.acreate(
        model="llama-2-7b",
        prompt="why is the sky blue?",
        max_new_tokens=5,
        temperature=0.2,
        stream=True,
    )

    async for response in stream:
        if response.output:
            print(response.json())

asyncio.run(main())
{"request_id": "9cfe4d5a-f86f-4094-a935-87f871d90ec0", "output": {"text": "\n", "finished": false, "num_completion_tokens": 1}}
{"request_id": "9cfe4d5a-f86f-4094-a935-87f871d90ec0", "output": {"text": "I", "finished": false, "num_completion_tokens": 2}}
{"request_id": "9cfe4d5a-f86f-4094-a935-87f871d90ec0", "output": {"text": " think", "finished": false, "num_completion_tokens": 3}}
{"request_id": "9cfe4d5a-f86f-4094-a935-87f871d90ec0", "output": {"text": " the", "finished": false, "num_completion_tokens": 4}}
{"request_id": "9cfe4d5a-f86f-4094-a935-87f871d90ec0", "output": {"text": " sky", "finished": true, "num_completion_tokens": 5}}

batch_create classmethod

batch_create(
    output_data_path: str,
    model_config: CreateBatchCompletionsModelConfig,
    content: Optional[
        CreateBatchCompletionsRequestContent
    ] = None,
    input_data_path: Optional[str] = None,
    data_parallelism: int = 1,
    max_runtime_sec: int = 24 * 3600,
    tool_config: Optional[ToolConfig] = None,
) -> CreateBatchCompletionsResponse

Creates a batch completion for the provided input data. The job runs offline and does not depend on an existing model endpoint.

Prompts can be passed in from an input file, or as a part of the request.

Parameters:

Name Type Description Default
output_data_path str

The path to the output file. The output file will be a JSON file containing the completions.

required
model_config CreateBatchCompletionsModelConfig

The model configuration to use for the batch completion.

required
content Optional[CreateBatchCompletionsRequestContent]

The content to use for the batch completion. Either one of content or input_data_path must be provided.

None
input_data_path Optional[str]

The path to the input file. The input file should be a JSON file with data of type BatchCompletionsRequestContent. Either one of content or input_data_path must be provided.

None
data_parallelism int

The number of parallel jobs to run. Data will be evenly distributed to the jobs. Defaults to 1.

1
max_runtime_sec int

The maximum runtime of the batch completion in seconds. Defaults to 24 hours.

24 * 3600
tool_config Optional[ToolConfig]

Configuration for tool use. NOTE: this config is highly experimental and signature will change significantly in future iterations. Currently only Python code evaluator is supported. Python code context starts with "```python\n" and ends with "\n>>>\n", data before "\n```\n" and content end will be replaced by the Python execution results. Please format prompts accordingly and provide examples so LLMs could properly generate Python code.

None

Returns:

Name Type Description
response CreateBatchCompletionsResponse

The response containing the job id.

from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent

response = Completion.batch_create(
    output_data_path="s3://my-path",
    model_config=CreateBatchCompletionsModelConfig(
        model="llama-2-7b",
        checkpoint_path="s3://checkpoint-path",
        labels={"team":"my-team", "product":"my-product"}
    ),
    content=CreateBatchCompletionsRequestContent(
        prompts=["What is deep learning", "What is a neural network"],
        max_new_tokens=10,
        temperature=0.0
    )
)
print(response.json())
from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent

# Store CreateBatchCompletionsRequestContent data into input file "s3://my-input-path"

response = Completion.batch_create(
    input_data_path="s3://my-input-path",
    output_data_path="s3://my-output-path",
    model_config=CreateBatchCompletionsModelConfig(
        model="llama-2-7b",
        checkpoint_path="s3://checkpoint-path",
        labels={"team":"my-team", "product":"my-product"}
    ),
    data_parallelism=2
)
print(response.json())
from llmengine import Completion
from llmengine.data_types import CreateBatchCompletionsModelConfig, CreateBatchCompletionsRequestContent, ToolConfig

# Store CreateBatchCompletionsRequestContent data into input file "s3://my-input-path"

response = Completion.batch_create(
    input_data_path="s3://my-input-path",
    output_data_path="s3://my-output-path",
    model_config=CreateBatchCompletionsModelConfig(
        model="llama-2-7b",
        checkpoint_path="s3://checkpoint-path",
        labels={"team":"my-team", "product":"my-product"}
    ),
    data_parallelism=2,
    tool_config=ToolConfig(
        name="code_evaluator",
    )
)
print(response.json())

FineTune

Bases: APIEngine

FineTune API. This API is used to fine-tune models.

Fine-tuning is a process where the LLM is further trained on a task-specific dataset, allowing the model to adjust its parameters to better align with the task at hand. Fine-tuning is a supervised training phase, where prompt/response pairs are provided to optimize the performance of the LLM. LLM Engine currently uses LoRA for fine-tuning. Support for additional fine-tuning methods is upcoming.

LLM Engine provides APIs to create fine-tunes on a base model with training & validation datasets. APIs are also provided to list, cancel and retrieve fine-tuning jobs.

Creating a fine-tune will end with the creation of a Model, which you can view using Model.get(model_name) or delete using Model.delete(model_name).

create classmethod

create(
    model: str,
    training_file: str,
    validation_file: Optional[str] = None,
    hyperparameters: Optional[
        Dict[str, Union[str, int, float]]
    ] = None,
    wandb_config: Optional[Dict[str, Any]] = None,
    suffix: Optional[str] = None,
) -> CreateFineTuneResponse

Creates a job that fine-tunes a specified model with a given dataset.

This API can be used to fine-tune a model. The model is the name of base model (Model Zoo for available models) to fine-tune. The training and validation files should consist of prompt and response pairs. training_file and validation_file must be either publicly accessible HTTP or HTTPS URLs, or file IDs of files uploaded to LLM Engine's Files API (these will have the file- prefix). The referenced files must be CSV files that include two columns: prompt and response. A maximum of 100,000 rows of data is currently supported. At least 200 rows of data is recommended to start to see benefits from fine-tuning. For sequences longer than the native max_seq_length of the model, the sequences will be truncated.

A fine-tuning job can take roughly 30 minutes for a small dataset (~200 rows) and several hours for larger ones.

Parameters:

Name Type Description Default
model `str`

The name of the base model to fine-tune. See Model Zoo for the list of available models to fine-tune.

required
training_file `str`

Publicly accessible URL or file ID referencing a CSV file for training. When no validation_file is provided, one will automatically be created using a 10% split of the training_file data.

required
validation_file `Optional[str]`

Publicly accessible URL or file ID referencing a CSV file for validation. The validation file is used to compute metrics which let LLM Engine pick the best fine-tuned checkpoint, which will be used for inference when fine-tuning is complete.

None
hyperparameters `Optional[Dict[str, Union[str, int, float, Dict[str, Any]]]]`

A dict of hyperparameters to customize fine-tuning behavior.

Currently supported hyperparameters:

  • lr: Peak learning rate used during fine-tuning. It decays with a cosine schedule afterward. (Default: 2e-3)
  • warmup_ratio: Ratio of training steps used for learning rate warmup. (Default: 0.03)
  • epochs: Number of fine-tuning epochs. This should be less than 20. (Default: 5)
  • weight_decay: Regularization penalty applied to learned weights. (Default: 0.001)
  • peft_config: A dict of parameters for the PEFT algorithm. See LoraConfig for more information.
None
wandb_config `Optional[Dict[str, Any]]`

A dict of configuration parameters for Weights & Biases. See Weights & Biases for more information. Set hyperparameter["report_to"] to wandb to enable automatic finetune metrics logging. Must include api_key field which is the wandb API key. Also supports setting base_url to use a custom Weights & Biases server.

None
suffix `Optional[str]`

A string that will be added to your fine-tuned model name. If present, the entire fine-tuned model name will be formatted like "[model].[suffix].[YYMMDD-HHMMSS]". If absent, the fine-tuned model name will be formatted "[model].[YYMMDD-HHMMSS]". For example, if suffix is "my-experiment", the fine-tuned model name could be "llama-2-7b.my-experiment.230717-230150". Note: suffix must be between 1 and 28 characters long, and can only contain alphanumeric characters and hyphens.

None

Returns:

Name Type Description
CreateFineTuneResponse CreateFineTuneResponse

an object that contains the ID of the created fine-tuning job

Here is an example script to create a 5-row CSV of properly formatted data for fine-tuning an airline question answering bot:

import csv

# Define data
data = [
  ("What is your policy on carry-on luggage?", "Our policy allows each passenger to bring one piece of carry-on luggage and one personal item such as a purse or briefcase. The maximum size for carry-on luggage is 22 x 14 x 9 inches."),
  ("How can I change my flight?", "You can change your flight through our website or mobile app. Go to 'Manage my booking' section, enter your booking reference and last name, then follow the prompts to change your flight."),
  ("What meals are available on my flight?", "We offer a variety of meals depending on the flight's duration and route. These can range from snacks and light refreshments to full-course meals on long-haul flights. Specific meal options can be viewed during the booking process."),
  ("How early should I arrive at the airport before my flight?", "We recommend arriving at least two hours before domestic flights and three hours before international flights."),
  "Can I select my seat in advance?", "Yes, you can select your seat during the booking process or afterwards via the 'Manage my booking' section on our website or mobile app."),
  ]

# Write data to a CSV file
with open('customer_service_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["prompt", "response"])
    writer.writerows(data)

Currently, data needs to be uploaded to either a publicly accessible web URL or to LLM Engine's private file server so that it can be read for fine-tuning. Publicly accessible HTTP and HTTPS URLs are currently supported.

To privately share data with the LLM Engine API, use LLM Engine's File.upload API. You can upload data in local file to LLM Engine's private file server and then use the returned file ID to reference your data in the FineTune API. The file ID is generally in the form of file-<random_string>, e.g. "file-7DLVeLdN2Ty4M2m".

Example code for fine-tuning:

from llmengine import FineTune

response = FineTune.create(
    model="llama-2-7b",
    training_file="file-7DLVeLdN2Ty4M2m",
)

print(response.json())
{
    "fine_tune_id": "ft-cir3eevt71r003ks6il0"
}

get classmethod

get(fine_tune_id: str) -> GetFineTuneResponse

Get status of a fine-tuning job.

This API can be used to get the status of an already running fine-tuning job. It takes as a single parameter the fine_tune_id and returns a GetFineTuneResponse object with the id and status (PENDING, STARTED, UNDEFINED, FAILURE or SUCCESS).

Parameters:

Name Type Description Default
fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description
GetFineTuneResponse GetFineTuneResponse

an object that contains the ID and status of the requested job

from llmengine import FineTune

response = FineTune.get(
    fine_tune_id="ft-cir3eevt71r003ks6il0",
)

print(response.json())
{
    "fine_tune_id": "ft-cir3eevt71r003ks6il0",
    "status": "STARTED"
}

get_events classmethod

get_events(fine_tune_id: str) -> GetFineTuneEventsResponse

Get events of a fine-tuning job.

This API can be used to get the list of detailed events for a fine-tuning job. It takes the fine_tune_id as a parameter and returns a response object which has a list of events that has happened for the fine-tuning job. Two events are logged periodically: an evaluation of the training loss, and an evaluation of the eval loss. This API will return all events for the fine-tuning job.

Parameters:

Name Type Description Default
fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description
GetFineTuneEventsResponse GetFineTuneEventsResponse

an object that contains the list of events for the fine-tuning job

from llmengine import FineTune

response = FineTune.get_events(fine_tune_id="ft-cir3eevt71r003ks6il0")
print(response.json())
{
    "events":
    [
        {
            "timestamp": 1689665099.6704428,
            "message": "{'loss': 2.108, 'learning_rate': 0.002, 'epoch': 0.7}",
            "level": "info"
        },
        {
            "timestamp": 1689665100.1966307,
            "message": "{'eval_loss': 1.67730712890625, 'eval_runtime': 0.2023, 'eval_samples_per_second': 24.717, 'eval_steps_per_second': 4.943, 'epoch': 0.7}",
            "level": "info"
        },
        {
            "timestamp": 1689665105.6544185,
            "message": "{'loss': 1.8961, 'learning_rate': 0.0017071067811865474, 'epoch': 1.39}",
            "level": "info"
        },
        {
            "timestamp": 1689665106.159139,
            "message": "{'eval_loss': 1.513688564300537, 'eval_runtime': 0.2025, 'eval_samples_per_second': 24.696, 'eval_steps_per_second': 4.939, 'epoch': 1.39}",
            "level": "info"
        }
    ]
}

list classmethod

list() -> ListFineTunesResponse

List fine-tuning jobs.

This API can be used to list all the fine-tuning jobs. It returns a list of pairs of fine_tune_id and status for all existing jobs.

Returns:

Name Type Description
ListFineTunesResponse ListFineTunesResponse

an object that contains a list of all fine-tuning jobs and their statuses

from llmengine import FineTune

response = FineTune.list()
print(response.json())
{
    "jobs": [
        {
            "fine_tune_id": "ft-cir3eevt71r003ks6il0",
            "status": "STARTED"
        },
        {
            "fine_tune_id": "ft_def456",
            "status": "SUCCESS"
        }
    ]
}

cancel classmethod

cancel(fine_tune_id: str) -> CancelFineTuneResponse

Cancel a fine-tuning job.

This API can be used to cancel an existing fine-tuning job if it's no longer required. It takes the fine_tune_id as a parameter and returns a response object which has a success field confirming if the cancellation was successful.

Parameters:

Name Type Description Default
fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description
CancelFineTuneResponse CancelFineTuneResponse

an object that contains whether the cancellation was successful

from llmengine import FineTune

response = FineTune.cancel(fine_tune_id="ft-cir3eevt71r003ks6il0")
print(response.json())
{
    "success": true
}

Model

Bases: APIEngine

Model API. This API is used to get, list, and delete models. Models include both base models built into LLM Engine, and fine-tuned models that you create through the FineTune.create() API.

See Model Zoo for the list of publicly available base models.

create classmethod

create(
    name: str,
    model: str,
    inference_framework_image_tag: str,
    source: LLMSource = LLMSource.HUGGING_FACE,
    inference_framework: LLMInferenceFramework = LLMInferenceFramework.VLLM,
    num_shards: int = 1,
    quantize: Optional[Quantization] = None,
    checkpoint_path: Optional[str] = None,
    cpus: int = 8,
    memory: str = "24Gi",
    storage: str = "40Gi",
    gpus: int = 1,
    min_workers: int = 0,
    max_workers: int = 1,
    per_worker: int = 2,
    endpoint_type: ModelEndpointType = ModelEndpointType.STREAMING,
    gpu_type: Optional[str] = "nvidia-ampere-a10",
    high_priority: Optional[bool] = False,
    post_inference_hooks: Optional[
        List[PostInferenceHooks]
    ] = None,
    default_callback_url: Optional[str] = None,
    public_inference: Optional[bool] = True,
    labels: Optional[Dict[str, str]] = None,
) -> CreateLLMEndpointResponse

Create an LLM model. Note: This API is only available for self-hosted users.

Parameters:

Name Type Description Default
name `str`

Name of the endpoint

required
model `str`

Name of the base model

required
inference_framework_image_tag `str`

Image tag for the inference framework. Use "latest" for the most recent image

required
source `LLMSource`

Source of the LLM. Currently only HuggingFace is supported

HUGGING_FACE
inference_framework `LLMInferenceFramework`

Inference framework for the LLM. Current supported frameworks are LLMInferenceFramework.DEEPSPEED, LLMInferenceFramework.TEXT_GENERATION_INFERENCE, LLMInferenceFramework.VLLM and LLMInferenceFramework.LIGHTLLM

VLLM
num_shards `int`

Number of shards for the LLM. When bigger than 1, LLM will be sharded to multiple GPUs. Number of GPUs must be equal or larger than num_shards.

1
quantize `Optional[Quantization]`

Quantization method for the LLM. text_generation_inference supports bitsandbytes and vllm supports awq.

None
checkpoint_path `Optional[str]`

Remote path to the checkpoint for the LLM. LLM engine must have permission to access the given path. Can be either a folder or a tar file. Folder is preferred since we don't need to untar and model loads faster. For model weights, safetensors are preferred but PyTorch checkpoints are also accepted (model loading will be longer).

None
cpus `int`

Number of cpus each worker should get, e.g. 1, 2, etc. This must be greater than or equal to 1. Recommendation is set it to 8 * GPU count.

8
memory `str`

Amount of memory each worker should get, e.g. "4Gi", "512Mi", etc. This must be a positive amount of memory. Recommendation is set it to 24Gi * GPU count.

'24Gi'
storage `str`

Amount of local ephemeral storage each worker should get, e.g. "4Gi", "512Mi", etc. This must be a positive amount of storage. Recommendataion is 40Gi for 7B models, 80Gi for 13B models and 200Gi for 70B models.

'40Gi'
gpus `int`

Number of gpus each worker should get, e.g. 0, 1, etc.

1
min_workers `int`

The minimum number of workers. Must be greater than or equal to 0. This should be determined by computing the minimum throughput of your workload and dividing it by the throughput of a single worker. When this number is 0, max_workers must be 1, and the endpoint will autoscale between 0 and 1 pods. When this number is greater than 0, max_workers can be any number greater or equal to min_workers.

0
max_workers `int`

The maximum number of workers. Must be greater than or equal to 0, and as well as greater than or equal to min_workers. This should be determined by computing the maximum throughput of your workload and dividing it by the throughput of a single worker

1
per_worker `int`

The maximum number of concurrent requests that an individual worker can service. LLM engine automatically scales the number of workers for the endpoint so that each worker is processing per_worker requests, subject to the limits defined by min_workers and max_workers - If the average number of concurrent requests per worker is lower than per_worker, then the number of workers will be reduced. - Otherwise, if the average number of concurrent requests per worker is higher than per_worker, then the number of workers will be increased to meet the elevated traffic. Here is our recommendation for computing per_worker: 1. Compute min_workers and max_workers per your minimum and maximum throughput requirements. 2. Determine a value for the maximum number of concurrent requests in the workload. Divide this number by max_workers. Doing this ensures that the number of workers will "climb" to max_workers.

2
endpoint_type `ModelEndpointType`

Currently only "streaming" endpoints are supported.

STREAMING
gpu_type `Optional[str]`

If specifying a non-zero number of gpus, this controls the type of gpu requested. Here are the supported values:

  • nvidia-tesla-t4
  • nvidia-ampere-a10
  • nvidia-ampere-a100
  • nvidia-ampere-a100e
'nvidia-ampere-a10'
high_priority `Optional[bool]`

Either True or False. Enabling this will allow the created endpoint to leverage the shared pool of prewarmed nodes for faster spinup time

False
post_inference_hooks `Optional[List[PostInferenceHooks]]`

List of hooks to trigger after inference tasks are served

None
default_callback_url `Optional[str]`

The default callback url to use for sync completion requests. This can be overridden in the task parameters for each individual task. post_inference_hooks must contain "callback" for the callback to be triggered

None
public_inference `Optional[bool]`

If True, this endpoint will be available to all user IDs for inference

True
labels `Optional[Dict[str, str]]`

An optional dictionary of key/value pairs to associate with this endpoint

None

Returns: CreateLLMEndpointResponse: creation task ID of the created Model. Currently not used.

from llmengine import Model

response = Model.create(
    name="llama-2-7b-test"
    model="llama-2-7b",
    inference_framework_image_tag="0.2.1.post1",
    inference_framework=LLMInferenceFramework.VLLM,
    num_shards=1,
    checkpoint_path="s3://path/to/checkpoint",
    cpus=8,
    memory="24Gi",
    storage="40Gi",
    gpus=1,
    min_workers=0,
    max_workers=1,
    per_worker=10,
    endpoint_type=ModelEndpointType.STREAMING,
    gpu_type="nvidia-ampere-a10",
    public_inference=False,
)

print(response.json())
from llmengine import Model

response = Model.create(
    name="llama-2-13b-test"
    model="llama-2-13b",
    inference_framework_image_tag="0.2.1.post1",
    inference_framework=LLMInferenceFramework.VLLM,
    num_shards=2,
    checkpoint_path="s3://path/to/checkpoint",
    cpus=16,
    memory="48Gi",
    storage="80Gi",
    gpus=2,
    min_workers=0,
    max_workers=1,
    per_worker=10,
    endpoint_type=ModelEndpointType.STREAMING,
    gpu_type="nvidia-ampere-a10",
    public_inference=False,
)

print(response.json())
from llmengine import Model

response = Model.create(
    name="llama-2-70b-test"
    model="llama-2-70b",
    inference_framework_image_tag="0.9.4",
    inference_framework=LLMInferenceFramework.TEXT_GENERATION_INFERENCE,
    num_shards=4,
    quantize="bitsandbytes",
    checkpoint_path="s3://path/to/checkpoint",
    cpus=40,
    memory="96Gi",
    storage="200Gi",
    gpus=4,
    min_workers=0,
    max_workers=1,
    per_worker=10,
    endpoint_type=ModelEndpointType.STREAMING,
    gpu_type="nvidia-ampere-a10",
    public_inference=False,
)

print(response.json())

get classmethod

get(model: str) -> GetLLMEndpointResponse

Get information about an LLM model.

This API can be used to get information about a Model's source and inference framework. For self-hosted users, it returns additional information about number of shards, quantization, infra settings, etc. The function takes as a single parameter the name model and returns a GetLLMEndpointResponse object.

Parameters:

Name Type Description Default
model `str`

Name of the model

required

Returns:

Name Type Description
GetLLMEndpointResponse GetLLMEndpointResponse

object representing the LLM and configurations

from llmengine import Model

response = Model.get("llama-2-7b.suffix.2023-07-18-12-00-00")

print(response.json())
{
    "id": null,
    "name": "llama-2-7b.suffix.2023-07-18-12-00-00",
    "model_name": null,
    "source": "hugging_face",
    "status": "READY",
    "inference_framework": "text_generation_inference",
    "inference_framework_tag": null,
    "num_shards": null,
    "quantize": null,
    "spec": null
}

list classmethod

list() -> ListLLMEndpointsResponse

List LLM models available to call inference on.

This API can be used to list all available models, including both publicly available models and user-created fine-tuned models. It returns a list of GetLLMEndpointResponse objects for all models. The most important field is the model name.

Returns:

Name Type Description
ListLLMEndpointsResponse ListLLMEndpointsResponse

list of models

from llmengine import Model

response = Model.list()
print(response.json())
{
    "model_endpoints": [
        {
            "id": null,
            "name": "llama-2-7b.suffix.2023-07-18-12-00-00",
            "model_name": null,
            "source": "hugging_face",
            "inference_framework": "text_generation_inference",
            "inference_framework_tag": null,
            "num_shards": null,
            "quantize": null,
            "spec": null
        },
        {
            "id": null,
            "name": "llama-2-7b",
            "model_name": null,
            "source": "hugging_face",
            "inference_framework": "text_generation_inference",
            "inference_framework_tag": null,
            "num_shards": null,
            "quantize": null,
            "spec": null
        },
        {
            "id": null,
            "name": "llama-13b-deepspeed-sync",
            "model_name": null,
            "source": "hugging_face",
            "inference_framework": "deepspeed",
            "inference_framework_tag": null,
            "num_shards": null,
            "quantize": null,
            "spec": null
        },
        {
            "id": null,
            "name": "falcon-40b",
            "model_name": null,
            "source": "hugging_face",
            "inference_framework": "text_generation_inference",
            "inference_framework_tag": null,
            "num_shards": null,
            "quantize": null,
            "spec": null
        }
    ]
}

update classmethod

update(
    name: str,
    model: Optional[str] = None,
    inference_framework_image_tag: Optional[str] = None,
    source: Optional[LLMSource] = None,
    num_shards: Optional[int] = None,
    quantize: Optional[Quantization] = None,
    checkpoint_path: Optional[str] = None,
    cpus: Optional[int] = None,
    memory: Optional[str] = None,
    storage: Optional[str] = None,
    gpus: Optional[int] = None,
    min_workers: Optional[int] = None,
    max_workers: Optional[int] = None,
    per_worker: Optional[int] = None,
    endpoint_type: Optional[ModelEndpointType] = None,
    gpu_type: Optional[str] = None,
    high_priority: Optional[bool] = None,
    post_inference_hooks: Optional[
        List[PostInferenceHooks]
    ] = None,
    default_callback_url: Optional[str] = None,
    public_inference: Optional[bool] = None,
    labels: Optional[Dict[str, str]] = None,
) -> UpdateLLMEndpointResponse

Update an LLM model. Note: This API is only available for self-hosted users.

Parameters:

Name Type Description Default
name `str`

Name of the endpoint

required
model `Optional[str]`

Name of the base model

None
inference_framework_image_tag `Optional[str]`

Image tag for the inference framework. Use "latest" for the most recent image

None
source `Optional[LLMSource]`

Source of the LLM. Currently only HuggingFace is supported

None
num_shards `Optional[int]`

Number of shards for the LLM. When bigger than 1, LLM will be sharded to multiple GPUs. Number of GPUs must be equal or larger than num_shards.

None
quantize `Optional[Quantization]`

Quantization method for the LLM. text_generation_inference supports bitsandbytes and vllm supports awq.

None
checkpoint_path `Optional[str]`

Remote path to the checkpoint for the LLM. LLM engine must have permission to access the given path. Can be either a folder or a tar file. Folder is preferred since we don't need to untar and model loads faster. For model weights, safetensors are preferred but PyTorch checkpoints are also accepted (model loading will be longer).

None
cpus `Optional[int]`

Number of cpus each worker should get, e.g. 1, 2, etc. This must be greater than or equal to 1. Recommendation is set it to 8 * GPU count.

None
memory `Optional[str]`

Amount of memory each worker should get, e.g. "4Gi", "512Mi", etc. This must be a positive amount of memory. Recommendation is set it to 24Gi * GPU count.

None
storage `Optional[str]`

Amount of local ephemeral storage each worker should get, e.g. "4Gi", "512Mi", etc. This must be a positive amount of storage. Recommendataion is 40Gi for 7B models, 80Gi for 13B models and 200Gi for 70B models.

None
gpus `Optional[int]`

Number of gpus each worker should get, e.g. 0, 1, etc.

None
min_workers `Optional[int]`

The minimum number of workers. Must be greater than or equal to 0. This should be determined by computing the minimum throughput of your workload and dividing it by the throughput of a single worker. When this number is 0, max_workers must be 1, and the endpoint will autoscale between 0 and 1 pods. When this number is greater than 0, max_workers can be any number greater or equal to min_workers.

None
max_workers `Optional[int]`

The maximum number of workers. Must be greater than or equal to 0, and as well as greater than or equal to min_workers. This should be determined by computing the maximum throughput of your workload and dividing it by the throughput of a single worker

None
per_worker `Optional[int]`

The maximum number of concurrent requests that an individual worker can service. LLM engine automatically scales the number of workers for the endpoint so that each worker is processing per_worker requests, subject to the limits defined by min_workers and max_workers - If the average number of concurrent requests per worker is lower than per_worker, then the number of workers will be reduced. - Otherwise, if the average number of concurrent requests per worker is higher than per_worker, then the number of workers will be increased to meet the elevated traffic. Here is our recommendation for computing per_worker: 1. Compute min_workers and max_workers per your minimum and maximum throughput requirements. 2. Determine a value for the maximum number of concurrent requests in the workload. Divide this number by max_workers. Doing this ensures that the number of workers will "climb" to max_workers.

None
endpoint_type `Optional[ModelEndpointType]`

Currently only "streaming" endpoints are supported.

None
gpu_type `Optional[str]`

If specifying a non-zero number of gpus, this controls the type of gpu requested. Here are the supported values:

  • nvidia-tesla-t4
  • nvidia-ampere-a10
  • nvidia-ampere-a100
  • nvidia-ampere-a100e
None
high_priority `Optional[bool]`

Either True or False. Enabling this will allow the created endpoint to leverage the shared pool of prewarmed nodes for faster spinup time

None
post_inference_hooks `Optional[List[PostInferenceHooks]]`

List of hooks to trigger after inference tasks are served

None
default_callback_url `Optional[str]`

The default callback url to use for sync completion requests. This can be overridden in the task parameters for each individual task. post_inference_hooks must contain "callback" for the callback to be triggered

None
public_inference `Optional[bool]`

If True, this endpoint will be available to all user IDs for inference

None
labels `Optional[Dict[str, str]]`

An optional dictionary of key/value pairs to associate with this endpoint

None

Returns: UpdateLLMEndpointResponse: creation task ID of the updated Model. Currently not used.

delete classmethod

delete(
    model_endpoint_name: str,
) -> DeleteLLMEndpointResponse

Deletes an LLM model.

This API can be used to delete a fine-tuned model. It takes as parameter the name of the model and returns a response object which has a deleted field confirming if the deletion was successful. If called on a base model included with LLM Engine, an error will be thrown.

Parameters:

Name Type Description Default
model_endpoint_name `str`

Name of the model endpoint to be deleted

required

Returns:

Name Type Description
response DeleteLLMEndpointResponse

whether the model endpoint was successfully deleted

from llmengine import Model

response = Model.delete("llama-2-7b.suffix.2023-07-18-12-00-00")
print(response.json())
{
    "deleted": true
}

download classmethod

download(
    model_name: str, download_format: str = "hugging_face"
) -> ModelDownloadResponse

Download a fine-tuned model.

This API can be used to download the resulting model from a fine-tuning job. It takes the model_name and download_format as parameter and returns a response object which contains a dictonary of filename, url pairs associated with the fine-tuned model. The user can then download these urls to obtain the fine-tuned model. If called on a nonexistent model, an error will be thrown.

Parameters:

Name Type Description Default
model_name `str`

name of the fine-tuned model

required
download_format `str`

download format requested (default=hugging_face)

'hugging_face'

Returns: DownloadModelResponse: an object that contains a dictionary of filenames, urls from which to download the model weights. The urls are presigned urls that grant temporary access and expire after an hour.

from llmengine import Model

response = Model.download("llama-2-7b.suffix.2023-07-18-12-00-00", download_format="hugging_face")
print(response.json())
{
    "urls": {"my_model_file": "https://url-to-my-model-weights"}
}

File

Bases: APIEngine

File API. This API is used to upload private files to LLM engine so that fine-tunes can access them for training and validation data.

Functions are provided to upload, get, list, and delete files, as well as to get the contents of a file.

upload classmethod

upload(file: BufferedReader) -> UploadFileResponse

Uploads a file to LLM engine.

For use in FineTune creation, this should be a CSV file with two columns: prompt and response. A maximum of 100,000 rows of data is currently supported.

Parameters:

Name Type Description Default
file `BufferedReader`

A local file opened with open(file_path, "r")

required

Returns:

Name Type Description
UploadFileResponse UploadFileResponse

an object that contains the ID of the uploaded file

from llmengine import File

response = File.upload(open("training_dataset.csv", "r"))

print(response.json())
{
    "id": "file-abc123"
}

get classmethod

get(file_id: str) -> GetFileResponse

Get file metadata, including filename and size.

Parameters:

Name Type Description Default
file_id `str`

ID of the file

required

Returns:

Name Type Description
GetFileResponse GetFileResponse

an object that contains the ID, filename, and size of the requested file

from llmengine import File

response = File.get(
    file_id="file-abc123",
)

print(response.json())
{
    "id": "file-abc123",
    "filename": "training_dataset.csv",
    "size": 100
}

download classmethod

download(file_id: str) -> GetFileContentResponse

Get contents of a file, as a string. (If the uploaded file is in binary, a string encoding will be returned.)

Parameters:

Name Type Description Default
file_id `str`

ID of the file

required

Returns:

Name Type Description
GetFileContentResponse GetFileContentResponse

an object that contains the ID and content of the file

from llmengine import File

response = File.download(file_id="file-abc123")
print(response.json())
{
    "id": "file-abc123",
    "content": "Hello world!"
}

list classmethod

list() -> ListFilesResponse

List metadata about all files, e.g. their filenames and sizes.

Returns:

Name Type Description
ListFilesResponse ListFilesResponse

an object that contains a list of all files and their filenames and sizes

from llmengine import File

response = File.list()
print(response.json())
{
    "files": [
        {
            "id": "file-abc123",
            "filename": "training_dataset.csv",
            "size": 100
        },
        {
            "id": "file-def456",
            "filename": "validation_dataset.csv",
            "size": 50
        }
    ]
}

delete classmethod

delete(file_id: str) -> DeleteFileResponse

Deletes a file.

Parameters:

Name Type Description Default
file_id `str`

ID of the file

required

Returns:

Name Type Description
DeleteFileResponse DeleteFileResponse

an object that contains whether the deletion was successful

from llmengine import File

response = File.delete(file_id="file-abc123")
print(response.json())
{
    "deleted": true
}