Many projects and products implement network proxies that sit between clients and vLLM, including:
- Content filtering and guardrail systems - Safety and compliance enforcement
- Traffic management and routing layers - Load balancing and service mesh integration
- Model-as-a-Service platforms - Multi-tenancy, billing, and resource management
- API gateways - Authentication, rate limiting, logging, and access control
However, these proxies frequently introduce compatibility issues that break complex agentic applications:
- Missing API coverage: Don't support all vLLM inference APIs
- Parameter filtering: Don't accept or forward all request parameters, especially model-specific extensions
- Response mangling: Don't preserve all response fields, breaking client expectations
- Streaming corruption: Break Server-Sent Events (SSE) format or introduce excessive latency
- Payload modification: Edit requests/responses in ways incompatible with vLLM's expectations
These issues cause silent failures that are hard to debug, particularly in agentic workloads that rely on specific parameters like tool calling, structured outputs, or reasoning features.
These guidelines define requirements for proxy implementations that sit between clients and vLLM servers. They cover same-format proxies — proxies where the API format is identical on both the client-facing and server-facing sides. They apply to:
- HTTP reverse proxies (Nginx, Envoy, HAProxy)
- Python middleware (FastAPI, Starlette, Flask)
- Kubernetes sidecars and service meshes
- API gateways and management platforms
- Any software that forwards requests/responses between clients and vLLM
Out of scope: Proxies that translate between API formats (e.g., converting OpenAI Chat Completions to a provider-native format such as Anthropic Messages API or Vertex GenerateContent) are explicitly out of scope. API translation is inherently lossy and cannot meet the transparency requirements defined here. See Non-Conformant Proxies for what this means in practice.
These guidelines were developed for vLLM but apply to any OpenAI-compatible inference server. The core principles — preserve all parameters, maintain streaming format, forward errors — are universal for transparent proxying.
For vLLM users: This document defines requirements for production proxy deployments.
For other inference servers: The same principles apply. Where this document references vLLM-specific features (e.g., disaggregated serving parameters), treat them as examples of the general principle: proxies must preserve unknown parameters regardless of the inference server implementation.
This document uses RFC 2119 keywords to indicate requirement levels:
- MUST / REQUIRED / SHALL: Absolute requirements. Non-compliance breaks compatibility.
- MUST NOT / SHALL NOT: Absolute prohibitions. Violating these breaks compatibility.
- SHOULD / RECOMMENDED: Strong recommendations. Deviation requires justification.
- SHOULD NOT / NOT RECOMMENDED: Strong warnings against certain behavior.
- MAY / OPTIONAL: Truly optional. Implementers can choose based on their needs.
A proxy that follows these guidelines can claim zero impact on inference quality — applications produce identical results with or without the proxy in the request path. This is the core value proposition of conformance: no additional evaluation, certification, or benchmarking is needed to prove the proxy does not degrade the application.
A proxy that does not follow these guidelines — whether it translates between API formats, modifies request parameters, restructures responses, or alters payloads in any way — cannot make this claim. Such a proxy effectively becomes its own inference client (to the backend) and inference server (to the application). It takes on the full burden of proving that its modifications do not degrade application quality, including:
- Evaluation: Running comprehensive evals against every model the proxy handles to quantify accuracy and quality impact
- Compatibility testing: Validating behavior across the full matrix of API features (tool calling, structured outputs, streaming, reasoning, multimodal, etc.) for every backend
- Benchmarking: Measuring latency, throughput, and resource overhead introduced by the proxy
- Ongoing maintenance: Re-running all of the above whenever the proxy, the backend APIs, or the supported models change
This is not a judgment on whether payload modification or API translation is useful — there are legitimate reasons to do both. It is a statement about where the burden of proof falls. Conformant proxies inherit the inference server's guarantees. Non-conformant proxies must establish their own.
Proxies MUST support the core vLLM inference endpoints and SHOULD support additional endpoints based on use cases.
Proxies MUST support these endpoints for basic compatibility:
- Standard: OpenAI-compatible API, supported by most inference servers
- Purpose: Chat-based text generation (OpenAI Chat Completions API)
- Streaming: MUST support both
stream=trueandstream=false - Content-Type:
application/jsonfor requests,text/event-streamfor streaming responses - Use Cases: Chat applications, agentic workflows, tool calling, reasoning
- Standard: OpenAI-compatible API, supported by most inference servers
- Purpose: List available models
- Rationale: Clients regularly query this for model discovery and validation
- Response Format: JSON list of model metadata
Proxies SHOULD support these endpoints for broader compatibility:
- Standard: OpenAI-compatible API
- Purpose: Raw text completion (OpenAI Completions API)
- Streaming: MUST support both
stream=trueandstream=false - Use Cases: Legacy applications, simple completion tasks
- Note: Less critical for modern agentic applications but important for backward compatibility
- Extension: Agentic workflows (vLLM, some other servers)
- Purpose: Agentic workflows and multi-turn interactions
- Streaming: MUST support both
stream=trueandstream=false - Use Cases: Complex agentic applications with tool use and reasoning
- Extension: Anthropic Messages API compatibility (vLLM)
- Purpose: Anthropic Messages API compatibility
- Streaming: MUST support both
stream=trueandstream=false - Use Cases: Applications using Anthropic-compatible clients
- Standard: OpenAI-compatible API
- Purpose: Text embedding generation
- Use Cases: RAG (Retrieval-Augmented Generation), semantic search
- Note: Only relevant if proxy handles embedding models
Proxies MAY support additional specialized endpoints based on specific deployment requirements. These include endpoints for tokenization, audio processing, pooling model tasks, and other model-specific features.
Note: Most proxy deployments will not need to front these endpoints. Only implement support if your use case explicitly requires them.
For each supported endpoint, proxies MUST:
- Accept all valid request parameters defined in the inference server's protocol schemas
- Forward requests to the inference server backend without filtering parameters
- Preserve response format exactly as returned by the inference server
- Maintain HTTP semantics: status codes, headers, content types
- Handle errors by forwarding inference server error responses to clients
Proxies MUST NOT:
- Return 404 for supported endpoints
- Filter or validate parameters beyond basic HTTP/JSON parsing
- Modify response structure or remove fields
- Convert between streaming and non-streaming formats
Proxies MUST preserve all request parameters, including those not recognized by the proxy implementation. This is critical for supporting model-specific features, future parameters, and custom extensions.
Core Requirement: Proxies MUST forward all request fields from client to the inference server, even if the proxy does not understand their meaning.
Rationale: Many inference servers use Pydantic's ConfigDict(extra="allow") pattern to accept unknown fields. For example, vLLM's OpenAIBaseModel (in vllm/entrypoints/openai/engine/protocol.py) enables:
- Model-specific parameters not in the OpenAI spec
- Custom extensions via
vllm_xargs - Future vLLM parameters without proxy updates
- Experimental features during development
While proxies MUST preserve ALL request parameters (known and unknown), it's helpful to understand the categories of features that inference servers commonly expose:
Many modern inference servers support these parameters:
-
Tool calling: Function definitions and calling strategies
- Example impact if filtered: Breaks agentic applications entirely
- Supported by: vLLM, many other inference servers
-
Structured outputs: JSON schema constraints, regex patterns, choice lists
- Example impact if filtered: Breaks constrained generation workflows
- Supported by: vLLM, OpenAI, many other inference servers
-
Reasoning: Control over reasoning/thinking output generation
- Example impact if filtered: Breaks reasoning model workflows
- Supported by: vLLM, inference servers for reasoning models
-
Advanced sampling: Top-K, Min-P, repetition penalties beyond OpenAI spec
- Example impact if filtered: Degrades generation quality for models tuned with these params
- Supported by: vLLM, text-generation-inference, llama.cpp, others
-
Multimodal: Image/audio/video processing overrides
- Example impact if filtered: Breaks or degrades multimodal inputs
- Supported by: vLLM, servers supporting multimodal models
Some parameters are specific to particular inference server implementations. Examples from vLLM:
-
Disaggregated serving: Parameters for KV cache transfer between prefill/decode instances
- Example impact if filtered: Breaks distributed deployments entirely
- vLLM-specific feature
-
Security: Prefix cache salting for multi-tenant environments
- Example impact if filtered: Security vulnerability allowing prompt guessing
- vLLM-specific feature
-
Custom extensions: Model-specific or experimental parameters via
vllm_xargs- Example impact if filtered: Breaks custom model features
- vLLM extension mechanism (other servers may have similar extension patterns)
Why this matters:
Even for server-specific parameters, proxies cannot maintain a "known good" allowlist because:
- Different inference servers support different feature sets
- New parameters are added regularly for emerging capabilities
- Custom extensions enable experimentation and model-specific features
The only safe approach is to preserve ALL parameters and let the inference server handle validation.
Note: Specific parameter names are not enumerated here to avoid documentation drift. Consult your inference server's API documentation, OpenAPI schemas, protocol definitions, or SDKs to understand supported parameters. The principle remains the same: preserve all parameters regardless of server.
Core Principle: Proxies should only validate parameters they specifically need for their own operation. Let the inference server handle all other validation.
Proxies MAY perform validation for:
- HTTP/JSON correctness: Valid JSON syntax, correct content types
- Parameters the proxy needs: Only validate parameters required for proxy functionality
- Example: Reading
modelfield for semantic routing - Example: Reading
streamfield to determine response handling - Important: Only validate values the proxy must act upon; forward everything else
- Example: Reading
Why This Limitation?
Different inference servers have different validation rules. If a proxy validates parameters beyond its own needs:
- Proxy stricter than server: Blocks valid requests the inference server would accept
- Proxy looser than server: Allows invalid requests that fail downstream anyway
- Version drift: Validation rules change; proxies can't stay synchronized with all backends
Security Considerations:
While preserving all parameters, proxies SHOULD implement security controls:
- Size limits: Request body and parameter size limits prevent resource exhaustion
- Rate limiting: Enhanced rate limiting for requests with suspicious patterns or many unknown parameters
- Attack pattern detection: Scanning for injection patterns can trigger additional rate limiting without filtering valid parameters
Important: Security validation MUST NOT filter parameters from requests. Proxies may block entire requests for security/policy reasons (e.g., guardrails), but must not selectively remove parameters before forwarding.
Proxies MUST NOT:
- Validate inference server requirements: Don't check for required fields like
messagesorprompt— the inference server will do this - Validate parameter types or ranges: Don't check if
temperatureis a number or within range — the inference server will do this - Validate nested field structures: Don't parse or validate message contents, tool schemas, etc.
- Filter unknown parameters: Forward all fields, even if unrecognized
- Enforce strict schema compliance: Allow extra fields beyond defined schemas
- Reject requests solely for unknown fields: Unknown fields are expected and valid
Proxies MAY inspect parameter values for:
- Logging: Record parameters for debugging/auditing
- Security scanning: Detect attack patterns, anomalous values, or suspicious parameter combinations
- Routing: Route requests based on parameter values (e.g., by model)
- Policy enforcement: Check compliance with organizational policies
- Billing/metering: Track usage based on parameters
However, proxies MUST:
- Forward original values: Don't modify parameters based on inspection
- Preserve all fields: Including those used for routing/logging
- Maintain structure: Don't flatten, restructure, or rename fields
Security Note: When logging requests, redact authorization headers and API keys to prevent credential leakage.
Streaming is critical for agentic applications and user-facing chat interfaces. Proxies MUST handle streaming correctly to avoid breaking real-time user experiences.
When stream=true is set in the request, the inference server returns responses using the Server-Sent Events (SSE) protocol.
SSE Protocol Requirements:
Proxies MUST preserve the SSE protocol format exactly as received from the inference server:
data: <single-line JSON object>
data: <single-line JSON object>
data: <single-line JSON object>
data: [DONE]
SSE Protocol Rules:
- Each chunk is prefixed with
data:(literal string "data:" followed by space) - Each chunk payload is a JSON object serialized as a single line (no newlines within the JSON itself)
- Each chunk is followed by two newlines (
\n\n) - Stream terminates with
data: [DONE]\n\n
JSON Structure:
The JSON structure within each data: line varies by endpoint (/v1/chat/completions, /v1/responses, /v1/messages, etc.). Proxies MUST:
- NOT modify the JSON structure of chunks
- Consult the OpenAI API documentation for the endpoint you are proxying to understand the expected chunk schema
- Preserve all fields in each chunk, including unknown or extension fields (e.g., vLLM-specific fields like
token_ids,reasoning)
Client Disconnect Handling:
- If client disconnects mid-stream, proxy SHOULD close the upstream connection to the inference server promptly
- This prevents the inference server from wasting resources generating output no one will receive
Proxies MUST:
- Preserve
Content-Type: text/event-streamheader - Maintain
data:prefix for each chunk - Preserve two newlines (
\n\n) after each chunk - Forward
data: [DONE]terminator - NOT modify JSON structure within chunks
Proxies MUST NOT convert between streaming and non-streaming formats:
PROHIBITED:
- Converting
stream=truerequest tostream=falseto the inference server - Converting
stream=falserequest tostream=trueto the inference server - Buffering entire streaming response and returning as single response
- Splitting non-streaming response into fake streaming chunks
ALLOWED:
- Forwarding
stream=trueasstream=truewith SSE chunks passed through - Forwarding
stream=falseasstream=falsewith JSON response passed through
Rationale: Clients rely on streaming for real-time feedback. Converting streaming to non-streaming introduces unacceptable latency and breaks user experience. Converting non-streaming to streaming adds complexity without benefit.
Proxies SHOULD NOT introduce significant latency in streaming responses.
Best Practice: Forward chunks immediately upon receipt from the inference server.
Limited Buffering: Proxies MAY buffer chunks briefly for:
- Semantic inspection: Analyzing chunk content for logging or policy enforcement
- Header injection: Adding proxy-specific headers or metadata
- Format validation: Ensuring SSE format correctness
- Multi-token context operations: Processing that requires analyzing multiple tokens together (e.g., guardrail evaluation, PII redaction, safety filtering) before forwarding. This "look-ahead" buffering is acceptable as long as the response remains streaming.
However, proxies MUST:
- Emit chunks in order: Never reorder chunks
- Minimize added latency: Keep added latency per chunk under 10-50ms
- Note: This is added latency per chunk, not total latency
- Example: For 100 chunks, acceptable added latency is 1-5 seconds total
- NOT buffer entire response: Don't accumulate all chunks before forwarding
- NOT convert to non-streaming: Even if buffering for inspection or multi-token analysis, the final delivery must remain streaming
Latency Measurement:
- Measure time from receiving chunk from the inference server to forwarding to client
- This should be milliseconds, not seconds per chunk
Proxies MUST preserve the structure of streaming response chunks exactly as received from the inference server.
General Requirements:
Proxies MUST:
- Preserve all fields in each chunk, including:
- Incremental content fields (structure varies by endpoint)
- Completion status indicators (e.g.,
finish_reason) - Usage/billing information (if present in any chunk)
- Extension fields (e.g., vLLM-specific fields like
token_ids,reasoning) - Nested objects and arrays
- NOT modify field values within chunks
- NOT move fields between chunks (e.g., moving usage data to a different chunk)
- NOT strip or filter any fields, even if their purpose is unclear
Usage Information:
If usage information appears in streaming chunks (controlled by stream_options.include_usage or similar parameters), proxies MUST:
- Preserve the usage object exactly as received
- Include all nested fields (e.g.,
cached_tokens,prompt_tokens_details) - NOT move usage to a different chunk than where the inference server placed it
- NOT strip usage information (important for billing/metering)
Rationale: Different endpoints have different chunk schemas. The proxy's job is to forward chunks faithfully, not to understand or validate their structure. Consult the OpenAI API documentation for the endpoint you are proxying to understand the expected chunk format.
Proxies MUST forward stream_options parameters:
{
"stream": true,
"stream_options": {
"include_usage": true,
"continuous_usage_stats": true
}
}Parameters:
include_usage- Include usage in final chunkcontinuous_usage_stats- Include usage in every chunk (vLLM extension)
Many endpoints stream data incrementally, where a complete data structure (e.g., tool call arguments, structured output, reasoning chains) is built across multiple chunks.
Proxies MUST preserve the incremental structure of streaming data:
General Requirements:
Proxies MUST:
- Forward incremental data exactly as received without buffering to reconstruct complete structures
- Preserve partial or incomplete data in chunks (e.g., partial JSON strings, incomplete arrays)
- NOT validate or parse incremental data mid-stream (it may be syntactically invalid until the final chunk)
- NOT reformat or restructure incremental data before streaming completes
Proxies MUST NOT:
- Buffer chunks to return "complete" data structures (e.g., complete JSON objects, complete tool calls)
- Reformat partial data before streaming completes
- Filter or skip chunks containing partial/incomplete data
- Validate syntax of incremental data mid-stream
Rationale: Clients rely on streaming to receive incremental updates as they arrive. Buffering to reconstruct complete structures defeats the purpose of streaming and introduces unacceptable latency.
Endpoint-Specific Behavior: Different endpoints stream different types of incremental data (tool calls, structured outputs, reasoning chains, etc.). Consult the OpenAI API documentation for the endpoint you are proxying to understand what data may be streamed incrementally.
Proxies must handle HTTP headers and request/response metadata correctly to support authentication, tracing, and CORS.
Proxy Flexibility: Proxies MAY inject, modify, or terminate authorization headers for policy enforcement or authentication mediation.
Use Cases:
- Authentication termination: Proxy validates client auth, sends new auth to the inference server
- Example: Client sends OAuth token, proxy validates, sends API key to vLLM
- Appropriate for: External-facing proxies, multi-tenant systems, security boundaries
- Auth injection: Proxy adds authentication that clients don't provide
- Example: Internal network proxy adds service account credentials
- Appropriate for: Internal proxies, service mesh, trusted network segments
- Auth forwarding: Proxy forwards client auth unchanged
- Example: Simple reverse proxy passes
Authorization: Bearer <token>through - Appropriate for: Internal routing only, when proxy is NOT a security boundary
- Example: Simple reverse proxy passes
Requirements:
- If proxy terminates client auth, it MUST establish auth with the inference server backend (e.g., vLLM via
--api-key) - Proxies MUST NOT forward client auth to the inference server if proxy is responsible for authentication
- Proxies MUST NOT expose inference server credentials to clients
Security Considerations:
When handling authorization, be aware of:
- Credential protection: Redact authorization headers in logs; never include credential values in error messages
- Timing attacks: Use constant-time comparison when validating API keys
- Brute force protection: Rate limit authentication failures
- Credential storage: Store backend credentials in environment variables or secrets manager, not in code
Example Scenarios:
Client: Authorization: Bearer client-token-123
Proxy: Authorization: Bearer client-token-123 (forwarded unchanged)
vLLM: Validates client-token-123
Note: Only appropriate for internal routing where proxy is not a security boundary.
Client: Authorization: Bearer client-oauth-token
Proxy: (validates OAuth token)
Proxy → Inference Server: Authorization: Bearer vllm-api-key-456
vLLM: Validates vllm-api-key-456
Note: Recommended pattern when proxy acts as security boundary.
Client: (no Authorization header)
Proxy → Inference Server: Authorization: Bearer vllm-api-key-789
vLLM: Validates vllm-api-key-789
Note: Only appropriate for internal proxies in trusted network segments.
Proxies SHOULD forward X-Request-Id back to the client for request tracing across system boundaries.
Proxies MAY generate an X-Request-Id if one is not already present in the client request.
Purpose:
- Trace requests across multiple hops (client → proxy → vLLM)
- Correlate logs across services
- Debug request flow in distributed systems
Example:
Client → Proxy: X-Request-Id: req-client-12345
Proxy → Inference Server: X-Request-Id: req-client-12345 (preserved)
Inference Server → Proxy: X-Request-Id: req-client-12345
Proxy → Client: X-Request-Id: req-client-12345 (returned)
Proxies MUST preserve custom headers unless required for security policy.
Allowed Filtering:
- Remove headers that expose internal infrastructure
- Remove security-sensitive headers per organizational policy
- Remove headers explicitly marked for proxy consumption (e.g.,
X-Proxy-*) - Normalize
Serverheader to avoid version disclosure
Prohibited Filtering:
- Remove unknown headers "just in case"
- Remove headers because proxy doesn't understand them
- Remove headers to "clean up" the request
- Remove application-level custom headers that don't expose infrastructure
Proxies SHOULD propagate CORS headers for browser-based clients.
CORS Headers to Preserve:
Access-Control-Allow-OriginAccess-Control-Allow-MethodsAccess-Control-Allow-HeadersAccess-Control-Allow-CredentialsAccess-Control-Allow-Expose-HeadersAccess-Control-Max-Age
Rationale: Browser-based clients (JavaScript frontends) require CORS headers to make cross-origin requests. If proxy strips these headers, browser requests will fail.
Security Warning - CORS Misconfiguration:
Never use wildcard CORS (Access-Control-Allow-Origin: *) with authenticated APIs. This allows any malicious website to call your API from a user's browser. Instead, validate the Origin header against an allowlist of trusted domains. Wildcard CORS is only acceptable for truly public, unauthenticated APIs.
Note: Browsers forbid the combination of wildcard origin with Access-Control-Allow-Credentials: true.
Alternative - Proxy-Managed CORS:
Proxy MAY add its own CORS headers instead of forwarding the inference server's, if proxy is responsible for CORS policy enforcement. If proxy adds CORS headers, it MUST strip the inference server's CORS headers to avoid duplicate/conflicting headers.
Proxies MUST forward errors from the inference server to clients without modification, and MAY add proxy-specific errors when appropriate.
Proxies MUST preserve HTTP status codes from the inference server unchanged.
Proxies MUST NOT:
- Convert specific error codes to generic ones (e.g., 400 → 500)
- Use proxy-specific status codes for inference server errors
- Swallow errors and return 200 OK
vLLM returns errors in OpenAI-compatible format:
{
"error": {
"message": "Invalid value for 'temperature': must be between 0 and 2",
"type": "invalid_request_error",
"param": "temperature",
"code": 400
}
}Proxies MUST:
- Preserve the entire
errorobject structure - Forward
type,param, andcodefields unchanged - NOT add/remove fields from the error object
- NOT reformat or translate error messages, except for security-sensitive modifications such as:
- PII redaction (e.g., removing echoed user input containing personal information)
- Credential or secret redaction
- Internal path or hostname sanitization
When modifying error messages for security purposes, proxies SHOULD:
- Preserve the general meaning and diagnostic value of the error
- Use clear redaction markers (e.g.,
[REDACTED],***) to indicate modification - Document the redaction policy for operators and users
Proxies MAY add their own errors for proxy-level failures.
Use Cases:
- Proxy auth validation fails (before forwarding to vLLM)
- Proxy rate limiting triggered
- Proxy policy violation detected
- Proxy cannot connect to the inference server backend
Requirements:
- Use distinct error types to differentiate from inference server errors
- Good:
"type": "proxy_auth_error","type": "proxy_rate_limit" - Bad:
"type": "invalid_request_error"(conflicts with inference server)
- Good:
- Include clear
messageindicating error source- Good:
"message": "Proxy: Request exceeds rate limit" - Bad:
"message": "Rate limit exceeded"(ambiguous source)
- Good:
- Use appropriate HTTP status codes (400, 401, 403, 429, 500, 503)
Security Note - Information Disclosure:
Error messages should not leak sensitive information that could help attackers. Avoid including in error messages:
- Internal URLs, IP addresses, or hostnames
- File paths or directory structures
- Software versions or build numbers
- Stack traces in production
- Credential format hints
- Detailed validation failure reasons
Use generic messages like "Proxy: Upstream service unavailable" or "Proxy: Authentication failed" instead of detailed error information.
Consider rate limiting error responses (especially 401/403) to prevent brute force and enumeration attacks.
Proxies MUST NOT swallow errors.
PROHIBITED:
- Returning
200 OKwhen the inference server returns error - Replacing inference server error with generic "service unavailable" message
- Logging error but not forwarding to client
- Retrying internally without informing client
ALLOWED:
- Forwarding inference server error unchanged
- Adding proxy error context while preserving inference server error
- Logging errors for debugging (in addition to forwarding)
- Modifying error messages for scenarios like PII redaction while preserving general meaning and diagnostic value
For streaming requests, errors may occur mid-stream.
Error in SSE Stream: If vLLM encounters error during streaming, it MAY send error chunk:
data: {"error":{"message":"Generation error","type":"server_error","code":500}}
Proxies MUST:
- Forward error chunks in SSE format
- NOT buffer error until end of stream
- NOT convert streaming error to non-streaming response
- Terminate stream after error chunk
If proxy cannot connect to the inference server backend:
Proxies SHOULD:
- Return
503 Service Unavailablestatus - Include generic error message indicating upstream unavailability
- Use proxy-specific error type (e.g.,
"proxy_upstream_error") - NOT leak internal infrastructure details (URLs, IPs, ports, timeouts)
Example:
{
"error": {
"message": "Proxy: Upstream service unavailable",
"type": "proxy_upstream_error",
"param": null,
"code": 503
}
}Note: Avoid including internal URLs, IP addresses, ports, or timeout details in error messages.
Production proxies should consider implementing additional security controls:
- Size limits: Request body and parameter size limits help prevent resource exhaustion (DoS attacks)
- Streaming limits: Stream duration and idle timeouts prevent hung connections
- Rate limiting: Per-client rate limits protect against abuse; lower limits for suspicious requests or authentication failures
- Connection limits: Limit concurrent connections per client and to the inference server backend
- Transport security: External-facing proxies should use HTTPS/TLS; internal proxies should use HTTPS or operate in trusted network segments
These guidelines define requirements for proxies that sit between clients and vLLM (or other OpenAI-compatible inference servers). The core principle is transparent proxying: preserve all parameters and responses, maintain streaming format, and forward errors faithfully.
To build a compatible proxy:
- Support required endpoints:
/v1/chat/completions,/v1/models - Preserve all parameters: Use
extra="allow"pattern, forward unknown fields - Preserve all response fields: Don't filter response, forward everything
- Handle streaming correctly: Maintain SSE format, don't convert to non-streaming
- Manage headers appropriately: Handle auth, preserve X-Request-Id, forward custom headers
- Forward errors: Don't swallow errors, preserve status codes and error format
- Validate implementation: Test that your proxy preserves parameters, responses, streaming format, and errors as specified
For questions or issues, refer to the vLLM Documentation: https://docs.vllm.ai/