LLM Calls

Endpoints for making LLM calls with multi-provider support, auto-routing, and structured output.

Authentication

Two authentication methods are available:

Method	Header	Use Case
JWT	`Authorization: Bearer <jwt>` + `x-tenant-id`	Frontend / Dashboard
API Key	`X-Api-Key: YOUR_API_KEY`	External integrations (n8n, webhooks, MCP)

POST /api/v3/llm/call

Execute an LLM call with full multi-tenant context.

Authentication: JWT + Multi-tenant headers required

Headers:

Authorization: Bearer <jwt-token>
x-tenant-id: <uuid>
x-agent-id: <uuid>
Content-Type: application/json

Request Body:

{
  "model": "auto",
  "prompt": "What is the capital of France?",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"}
  ],
  "options": {
    "temperature": 0.7,
    "maxTokens": 4096,
    "stop": ["END"]
  },
  "context": {
    "task": "chat_general"
  },
  "response_format": { "type": "json_object" }
}

Field	Type	Required	Description
`model`	string	Yes	Model ID (e.g. `openai.gpt-4.1-nano`) — see Models
`prompt`	string	Yes*	The prompt/question (*or `messages`)
`messages`	array	Yes*	Conversation history (*or `prompt`)
`options.temperature`	number	No	Response randomness (0-1). Default: 0.7
`options.maxTokens`	number	No	Maximum response length. Default: 1000
`options.stop`	string[]	No	Stop sequences — the model stops generating when it encounters any of these strings (see Stop Sequences below)
`context.task`	string	No	Task type hint (e.g. `chat_general`, `code`, `reasoning_analysis`)
`response_format`	object	No	Structured output format (see Structured Output below)

Parameter Format

Use options.maxTokens (camelCase), not max_tokens at root level
Use context.task, not context.task_type

Response:

{
  "success": true,
  "response": "The capital of France is Paris.",
  "model": "gpt-4.1-nano",
  "provider": "openai",
  "usage": {
    "input_tokens": 15,
    "output_tokens": 8,
    "total_tokens": 23
  },
  "cost": 0.00015,
  "latency_ms": 450,
  "routing": {
    "is_auto_routed": true,
    "model_chosen": "openai.gpt-4.1-nano",
    "confidence": 0.85
  }
}

POST /api/v3/llm/public/call

Execute an LLM call using API Key authentication.

Authentication: API Key required

Headers:

X-Api-Key: YOUR_API_KEY
Content-Type: application/json

Request/Response: Same as POST /api/v3/llm/call

Quick Examples

curl
Node.js
Python

curl -X POST https://llm.zihin.ai/api/v3/llm/public/call \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "Summarize this contract", "model": "auto"}'

const response = await fetch('https://llm.zihin.ai/api/v3/llm/public/call', {
  method: 'POST',
  headers: {
    'X-Api-Key': 'YOUR_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    query: 'Summarize this contract',
    model: 'auto',
    options: { temperature: 0.7, maxTokens: 4096 },
  }),
});

const data = await response.json();
console.log(data.response);

import requests

response = requests.post(
    "https://llm.zihin.ai/api/v3/llm/public/call",
    headers={
        "X-Api-Key": "YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "query": "Summarize this contract",
        "model": "auto",
        "options": {"temperature": 0.7, "maxTokens": 4096},
    },
)

data = response.json()
print(data["response"])

Structured Output (response_format)

Force the LLM to return a specific JSON structure.

Simple JSON Object:

{
  "model": "openai.gpt-4o",
  "prompt": "List 3 countries and their capitals",
  "options": { "maxTokens": 500 },
  "response_format": { "type": "json_object" }
}

JSON Schema (Strict):

{
  "model": "openai.gpt-4o",
  "prompt": "Extract person data from: John Doe, 30 years old, john@email.com",
  "options": { "temperature": 0.1, "maxTokens": 200 },
  "context": { "task": "extract_structure" },
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "person_data",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "integer" },
          "email": { "type": "string" }
        },
        "required": ["name", "age", "email"],
        "additionalProperties": false
      }
    }
  }
}

Provider Support:

Provider	json_object	json_schema	Notes
OpenAI	Yes	Yes	Full support (strict mode)
Grok	Yes	Yes	OpenAI-compatible API
Google	Yes	Yes	Translated to responseMimeType/responseSchema
Anthropic	No	No	Uses tool_use for structured output

Structured Output and Agent Mode

response_format is not supported in Agent Mode (/api/v2/agents/:id/stream).

The Agent Mode requires the LLM to freely call tools, ask clarifications, and decide next steps. Forcing response_format would break the agentic cycle. For structured responses, use /api/v3/llm/call directly with response_format.

Stop Sequences

Stop sequences cause the model to stop generating text when it encounters any of the specified strings. The stop sequence itself is not included in the response, and finish_reason returns "stop".

Example:

{
  "model": "openai.gpt-4.1-nano",
  "prompt": "List numbers 1 to 20, one per line. Write DONE after number 5.",
  "options": {
    "maxTokens": 500,
    "stop": ["DONE"]
  }
}

The model will generate 1\n2\n3\n4\n5\n and stop before emitting DONE.

Provider mapping:

The API normalizes the stop parameter to each provider's native format automatically:

Provider	Native Field	Notes
OpenAI	`stop`	Up to 4 sequences
Anthropic	`stop_sequences`	Array of strings
Google	`generationConfig.stopSequences`	Array of strings
Grok	`stop`	OpenAI-compatible

Reasoning Models

Reasoning models (OpenAI o-series, GPT-5, Grok reasoning variants) do not support stop sequences. The API silently skips the parameter for these models to avoid errors. If you need stop sequences, use a standard (non-reasoning) model.

Multimodal Content (Images)

Send images alongside text using the OpenAI image_url format. The API automatically converts to each provider's native format.

Example with URL:

{
  "model": "openai.gpt-4o",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Describe this image."},
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
    ]
  }],
  "options": { "maxTokens": 500 }
}

Example with Base64:

{
  "model": "anthropic.claude-haiku-4-5-20251001",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What do you see in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
    ]
  }],
  "options": { "maxTokens": 500 }
}

Provider Support:

Provider	Support	Native Format	Auto Conversion
OpenAI	Yes	`image_url` (native)	Not needed
Anthropic	Yes	`image` with `source.type: url/base64`	Yes
Google	Yes	`inlineData` (base64 only)	Yes (downloads URL)
Grok	Yes*	`image_url` (OpenAI-compatible)	Not needed

* Support depends on the specific model (e.g. grok-2-vision-1212)

When using model: "auto" with multimodal content, the router automatically selects a vision-capable model.

Supported Image Formats

JPEG, PNG, GIF, WebP — via public URLs (https://) or data URIs (base64).

POST /api/v3/llm/test-connection

Test connection to a specific provider.

Authentication: JWT + Multi-tenant headers required

Request Body:

{
  "provider": "openai",
  "model": "gpt-4o"
}

Response:

{
  "success": true,
  "provider": "openai",
  "latency_ms": 250,
  "status": "connected"
}

Authentication​

POST /api/v3/llm/call​

POST /api/v3/llm/public/call​

Quick Examples​

Structured Output (response_format)​

Stop Sequences​

Multimodal Content (Images)​

POST /api/v3/llm/test-connection​