Share via


Foundation model REST API reference

This article provides general API information for Databricks Foundation Model APIs and the models they support. The Foundation Model APIs are designed to be similar to OpenAI's REST API to make migrating existing projects easier. Both the pay-per-token and provisioned throughput endpoints accept the same REST API request format.

Endpoints

Foundation Model APIs supports pay-per-token endpoints and provisioned throughput endpoints.

A preconfigured endpoint is available in your workspace for each pay-per-token supported model, and users can interact with these endpoints using HTTP POST requests. See Supported foundation models on Mosaic AI Model Serving for supported models.

Provisioned throughput endpoints can be created using the API or the Serving UI. These endpoints support multiple models per endpoint for A/B testing, as long as both served models expose the same API format. For example, both models are chat models. See POST /api/2.0/serving-endpoints for endpoint configuration parameters.

Requests and responses use JSON, the exact JSON structure depends on an endpoint's task type. Chat and completion endpoints support streaming responses.

Usage

Responses include a usage sub-message which reports the number of tokens in the request and response. The format of this sub-message is the same across all task types.

Field Type Description
completion_tokens Integer Number of generated tokens. Not included in embedding responses.
prompt_tokens Integer Number of tokens from the input prompt(s).
total_tokens Integer Number of total tokens.
reasoning_tokens Integer Number of the thinking tokens. It is only applicable to reasoning models.

For models like databricks-meta-llama-3-3-70b-instruct a user prompt is transformed using a prompt template before being passed into the model. For pay-per-token endpoints, a system prompt might also be added. prompt_tokens includes all text added by our server.

Responses API

Important

The Responses API is only compatible with OpenAI models.

The Responses API enables multi-turn conversations with a model. Unlike Chat Completions, the Responses API uses input instead of messages.

Responses API request

Field Default Type Description
model String Required. Model ID used to generate the response.
input String or List[ResponsesInput] Required. Text, image, or file inputs to the model, used to generate a response. Unlike messages, this field uses input to specify conversation content.
instructions null String A system (or developer) message inserted into the model's context.
max_output_tokens null null, which means no limit, or an integer greater than zero An upper bound for the number of tokens that can be generated for a response, including visible output tokens and reasoning tokens.
temperature 1.0 Float in [0,2] The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p 1.0 Float in (0,1] The probability threshold used for nucleus sampling.
stream false Boolean If set to true, the model response data will be streamed to the client as it is generated using server-sent events.
stream_options null StreamOptions Options for streaming responses. Only set this when you set stream: true.
text null TextConfig Configuration options for a text response from the model. Can be plain text or structured JSON data.
reasoning null ReasoningConfig Reasoning configuration for gpt-5 and o-series models.
tool_choice "auto" String or ToolChoiceObject How the model should select which tool (or tools) to use when generating a response. See the tools parameter to see how to specify which tools the model can call.
tools null List[ToolObject] An array of tools the model may call while generating a response. Note: Code interpreter and web search tools are not supported by Databricks.
parallel_tool_calls true Boolean Whether to allow the model to run tool calls in parallel.
max_tool_calls null Integer greater than zero The maximum number of total calls to built-in tools that can be processed in a response.
metadata null Object Set of 16 key-value pairs that can be attached to an object.
prompt_cache_key null String Used to cache responses for similar requests to optimize cache hit rates. Replaces the user field.
prompt_cache_retention null String The retention policy for the prompt cache. Set to "24h" to enable extended prompt caching, which keeps cached prefixes active for longer, up to a maximum of 24 hours.
safety_identifier null String A stable identifier used to help detect users of your application that may be violating usage policies.
user null String Deprecated. Use safety_identifier and prompt_cache_key instead.
truncation null String The truncation strategy to use for the model response.
top_logprobs null Integer An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability.
include null List[String] Specify additional output data to include in the model response.
prompt null Object Reference to a prompt template and its variables.

Unsupported parameters: The following parameters are not supported by Databricks and will return a 400 error if specified:

  • background - Background processing is not supported
  • store - Stored responses is not supported
  • conversation - Conversation API is not supported
  • service_tier - Service tier selection is managed by Databricks

ResponsesInput

The input field accepts either a string or a list of input message objects with role and content.

Field Type Description
role String Required. The role of the message author. Can be "user" or "assistant".
content String or List[ResponsesContentBlock] Required. The content of the message, either as a string or array of content blocks.

ResponsesContentBlock

Content blocks define the type of content in input and output messages. The content block type is determined by the type field.

InputText
Field Type Description
type String Required. Must be "input_text".
text String Required. The text content.
OutputText
Field Type Description
type String Required. Must be "output_text".
text String Required. The text content.
annotations List[Object] Optional annotations for the text content.
InputImage
Field Type Description
type String Required. Must be "input_image".
image_url String Required. URL or base64-encoded data URI of the image.
InputFile
Field Type Description
type String Required. Must be "input_file".
file_id String File identifier if using uploaded files.
filename String The name of the file.
file_data String Base64-encoded data URI with format prefix. For example, PDF files use format data:application/pdf;base64,<base64 data>.
FunctionCall
Field Type Description
type String Required. Must be "function_call".
id String Required. Unique identifier for the function call.
call_id String Required. The call identifier.
name String Required. The name of the function being called.
arguments Object/String Required. The function arguments as JSON object or string.
FunctionCallOutput
Field Type Description
type String Required. Must be "function_call_output".
call_id String Required. The call identifier this output corresponds to.
output String/Object Required. The function output as string or JSON object.

StreamOptions

Configuration for streaming responses. Only used when stream: true.

Field Type Description
include_usage Boolean If true, include token usage information in the stream. Default is false.

TextConfig

Configuration for text output, including structured outputs.

Field Type Description
format ResponsesFormatObject The format specification for the text output.

ResponsesFormatObject

Specifies the output format for text responses.

Field Type Description
type String Required. The type of format: "text" for plain text, "json_object" for JSON, or "json_schema" for structured JSON.
json_schema Object Required when type is "json_schema". The JSON schema object that defines the structure of the output.

The json_schema object has the same structure as JsonSchemaObject documented in the Chat Completions API.

ReasoningConfig

Configuration for reasoning behavior in reasoning models (o-series and gpt-5 models).

Field Type Description
effort String The reasoning effort level: "low", "medium", or "high". Default is "medium".
encrypted_content String Encrypted reasoning content for stateless mode. Provided by the model in previous responses.

ToolObject

See Function calling on Azure Databricks.

Field Type Description
type String Required. The type of the tool. Currently, only function is supported.
function FunctionObject Required. The function definition associated with the tool.

FunctionObject

Field Type Description
name String Required. The name of the function to be called.
description Object Required. The detailed description of the function. The model uses this description to understand the relevance of the function to the prompt and generate the tool calls with higher accuracy.
parameters Object The parameters the function accepts, described as a valid JSON schema object. If the tool is called, then the tool call is fit to the JSON schema provided. Omitting parameters defines a function without any parameters. The number of properties is limited to 15 keys.
strict Boolean Whether to enable strict schema adherence when generating the function call. If set to true, the model follows the exact schema defined in the schema field. Only a subset of JSON schema is supported when strict is true

ToolChoiceObject

See Function calling on Azure Databricks.

Field Type Description
type String Required. The type of the tool. Currently, only "function" is supported.
function Object Required. An object defining which tool to call of the form {"type": "function", "function": {"name": "my_function"}} where "my_function is the name of a FunctionObject in the tools field.

Responses API response

For non-streaming requests, the response is a single response object. For streaming requests, the response is a text/event-stream where each event is a response chunk.

Field Type Description
id String Unique identifier for the response. Note: Databricks encrypts this ID for security.
object String The object type. Equal to "response".
created_at Integer The Unix timestamp (in seconds) when the response was created.
status String The status of the response. One of: completed, failed, in_progress, cancelled, queued, or incomplete.
model String The model version used to generate the response.
output List[ResponsesMessage] The output generated by the model, typically containing message objects.
usage Usage Token usage metadata.
error Error Error information if the response failed.
incomplete_details IncompleteDetails Details about why the response is incomplete, if applicable.
instructions String The instructions provided in the request.
max_output_tokens Integer The maximum output tokens specified in the request.
temperature Float The temperature used for generation.
top_p Float The top_p value used for generation.
tools List[ToolObject] The tools specified in the request.
tool_choice String or ToolChoiceObject The tool_choice setting from the request.
parallel_tool_calls Boolean Whether parallel tool calls were enabled.
store Boolean Whether the response was stored.
metadata Object The metadata attached to the response.

ResponsesMessage

Message objects in the output field containing the model's response content.

Field Type Description
id String Required. Unique identifier for the message.
role String Required. The role of the message. Either "user" or "assistant".
content List[ResponsesContentBlock] Required. The content blocks in the message.
status String The status of the message processing.
type String Required. The object type. Equal to "message".

Error

Error information when a response fails.

Field Type Description
code String Required. The error code.
message String Required. A human-readable error message.
param String The parameter that caused the error, if applicable.
type String Required. The error type.

IncompleteDetails

Details about why a response is incomplete.

Field Type Description
reason String Required. The reason the response is incomplete.

Chat Completions API

The Chat Completions API enables multi-turn conversations with a model. The model response provides the next assistant message in the conversation. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.

Chat request

Field Default Type Description
messages ChatMessage list Required. A list of messages representing the current conversation.
max_tokens null null, which means no limit, or an integer greater than zero The maximum number of tokens to generate.
stream true Boolean Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard.
temperature 1.0 Float in [0,2] The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p 1.0 Float in (0,1] The probability threshold used for nucleus sampling.
top_k null null, which means no limit, or an integer greater than zero Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.
stop [] String or List[String] Model stops generating further tokens when any one of the sequences in stop is encountered.
n 1 Integer greater than zero The API returns n independent chat completions when n is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints.
tool_choice none String or ToolChoiceObject Used only in conjunction with the tools field. tool_choice supports a variety of keyword strings such as auto, required, and none. auto means that you are letting the model decide which (if any) tool is relevant to use. With auto if the model doesn't believe any of the tools in tools are relevant, the model generates a standard assistant message instead of a tool call. required means that the model picks the most relevant tool in tools and must generate a tool call. none means that the model does not generate any tool calls and instead must generate a standard assistant message. To force a tool call with a specific tool defined in tools, use a ToolChoiceObject. By default, if the tools field is populated tool_choice = "auto". Else, the tools field defaults to tool_choice = "none"
tools null ToolObject A list of tools that the model can call. Currently, function is the only supported tool type and a max of 32 functions are supported.
response_format null ResponseFormatObject An object specifying the format that the model must output. Accepted types are text, json_schema or json_object
Setting to { "type": "json_schema", "json_schema": {...} } enables structured outputs which ensures the model follows your supplied JSON schema.
Setting to { "type": "json_object" } ensures the responses the model generates is valid JSON, but does not ensure that responses follow a specific schema.
logprobs false Boolean This parameter indicates whether to provide the log probability of a token being sampled.
top_logprobs null Integer This parameter controls the number of most likely token candidates to return log probabilities for at each sampling step. Can be 0-20. logprobs must be true if using this field.
reasoning_effort "medium" String Controls the level of reasoning effort the model should apply when generating responses. Accepted values are "low", "medium", or "high". Higher reasoning effort may result in more thoughtful and accurate responses but may increase latency and token usage. This parameter is only accepted by a limited set of models, including databricks-gpt-oss-120b and databricks-gpt-oss-20b.

ChatMessage

Field Type Description
role String Required. The role of the author of the message. Can be "system", "user", "assistant" or "tool".
content String The content of the message. Required for chat tasks that do not involve tool calls.
tool_calls ToolCall list The list of tool_calls that the model generated. Must have role as "assistant" and no specification for the content field.
tool_call_id String When role is "tool", the ID associated with the ToolCall that the message is responding to. Must be empty for other role options.

The system role can only be used once, as the first message in a conversation. It overrides the model's default system prompt.

ToolCall

A tool call action suggestion by the model. See Function calling on Azure Databricks.

Field Type Description
id String Required. A unique identifier for this tool call suggestion.
type String Required. Only "function" is supported.
function FunctionCallCompletion Required. A function call suggested by the model.
cache_control String Enables caching for your request. This parameter is only accepted by Databricks-hosted Claude models. See Prompt caching for an example.

FunctionCallCompletion

Field Type Description
name String Required. The name of the function the model recommended.
arguments Object Required. Arguments to the function as a serialized JSON dictionary.

Note: ToolChoiceObject, ToolObject, and FunctionObject are defined in the Responses API section and are shared between both APIs.

ResponseFormatObject

See Structured outputs on Azure Databricks.

Field Type Description
type String Required. The type of response format being defined. Either text for unstructured text, json_object for unstructured JSON objects, or json_schema for JSON objects adhering to a specific schema.
json_schema JsonSchemaObject Required. The JSON schema to adhere to if type is set to json_schema

JsonSchemaObject

See Structured outputs on Azure Databricks.

Field Type Description
name String Required. The name of the response format.
description String A description of what the response format is for, used by the model to determine how to respond in the format.
schema Object Required. The schema for the response format, described as a JSON schema object.
strict Boolean Whether to enable strict schema adherence when generating the output. If set to true, the model follows the exact schema defined in the schema field. Only a subset of JSON schema is supported when strict is true

Chat response

For non-streaming requests, the response is a single chat completion object. For streaming requests, the response is a text/event-stream where each event is a completion chunk object. The top-level structure of completion and chunk objects is almost identical: only choices has a different type.

Field Type Description
id String Unique identifier for the chat completion.
choices List[ChatCompletionChoice] or List[ChatCompletionChunk] (streaming) List of chat completion texts. n choices are returned if the n parameter is specified.
object String The object type. Equal to either "chat.completions" for non-streaming or "chat.completion.chunk" for streaming.
created Integer The time the chat completion was generated in seconds.
model String The model version used to generate the response.
usage Usage Token usage metadata. Might not be present on streaming responses.

ChatCompletionChoice

Field Type Description
index Integer The index of the choice in the list of generated choices.
message ChatMessage A chat completion message returned by the model. The role will be assistant.
finish_reason String The reason the model stopped generating tokens.
extra_fields String When using proprietary models from external model providers, the provider's APIs might include additional metadata in responses. Databricks filters these responses and returns only a subset of the provider's original fields. The safetyRating is the only extra field supported at this time, see the Gemini documentation for more details.

ChatCompletionChunk

Field Type Description
index Integer The index of the choice in the list of generated choices.
delta ChatMessage A chat completion message part of generated streamed responses from the model. Only the first chunk is guaranteed to have role populated.
finish_reason String The reason the model stopped generating tokens. Only the last chunk will have this populated.

Embeddings API

Embedding tasks map input strings into embedding vectors. Many inputs can be batched together in each request. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.

Embedding request

Field Type Description
input String or List[String] Required. The input text to embed. Can be a string or a list of strings.
instruction String An optional instruction to pass to the embedding model.

Instructions are optional and highly model specific. For instance the BGE authors recommend no instruction when indexing chunks and recommend using the instruction "Represent this sentence for searching relevant passages:" for retrieval queries. Other models like Instructor-XL support a wide range of instruction strings.

Embeddings response

Field Type Description
id String Unique identifier for the embedding.
object String The object type. Equal to "list".
model String The name of the embedding model used to create the embedding.
data EmbeddingObject The embedding object.
usage Usage Token usage metadata.

EmbeddingObject

Field Type Description
object String The object type. Equal to "embedding".
index Integer The index of the embedding in the list of embeddings generated by the model.
embedding List[Float] The embedding vector. Each model will return a fixed size vector (1024 for BGE-Large)

Completions API

Text completion tasks are for generating responses to a single prompt. Unlike Chat, this task supports batched inputs: multiple independent prompts can be sent in one request. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.

Completion request

Field Default Type Description
prompt String or List[String] Required. The prompts for the model.
max_tokens null null, which means no limit, or an integer greater than zero The maximum number of tokens to generate.
stream true Boolean Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard.
temperature 1.0 Float in [0,2] The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p 1.0 Float in (0,1] The probability threshold used for nucleus sampling.
top_k null null, which means no limit, or an integer greater than zero Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.
error_behavior "error" "truncate" or "error" For timeouts and context-length-exceeded errors. One of: "truncate" (return as many tokens as possible) and "error" (return an error). This parameter is only accepted by pay per token endpoints.
n 1 Integer greater than zero The API returns n independent chat completions when n is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints.
stop [] String or List[String] Model stops generating further tokens when any one of the sequences in stop is encountered.
suffix "" String A string that is appended to the end of every completion.
echo false Boolean Returns the prompt along with the completion.
use_raw_prompt false Boolean If true, pass the prompt directly into the model without any transformation.

Completion response

Field Type Description
id String Unique identifier for the text completion.
choices CompletionChoice A list of text completions. For every prompt passed in, n choices are generated if n is specified. Default n is 1.
object String The object type. Equal to "text_completion"
created Integer The time the completion was generated in seconds.
usage Usage Token usage metadata.

CompletionChoice

Field Type Description
index Integer The index of the prompt in request.
text String The generated completion.
finish_reason String The reason the model stopped generating tokens.

Additional resources