Streaming

SSE streaming across all providers — proxied without buffering, with per-chunk extension hooks.

CrabLLM supports Server-Sent Events (SSE) streaming for chat completions across all providers. Streams are proxied without buffering — tokens arrive incrementally as the provider generates them.

Usage

Set "stream": true in the request body:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Write a haiku."}],
    "stream": true
  }'

The response is a stream of SSE events:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"An"}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":" old"}}]}

data: [DONE]

Provider translation

For non-OpenAI providers, CrabLLM translates the provider's native streaming format to OpenAI-compatible SSE chunks:

Anthropic — message_start, content_block_delta events translated to chat.completion.chunk format.
Google Gemini — streamGenerateContent response parts translated to OpenAI chunks.
Bedrock — AWS event-stream binary frames decoded and translated.
Azure — same SSE format as OpenAI, no translation needed.

Extension hooks

Extensions can observe each streaming chunk via the on_chunk hook. The rate limiter and budget extension use this to count tokens in real-time as they arrive.

Keep-alive

SSE connections include automatic keep-alive pings to prevent proxy/load balancer timeouts during long generation pauses.

If an error occurs mid-stream (after the first chunk has been sent), it is delivered as an SSE event with an error payload. The stream then terminates. Retry and fallback only apply before the stream starts.

Streaming

Usage

Provider translation

Extension hooks

Keep-alive

Error handling

On this page