Skip to content

Instantly share code, notes, and snippets.

View mathur-exe's full-sized avatar

Gaurang Mathur mathur-exe

View GitHub Profile
@mathur-exe
mathur-exe / gist:4e5a9dc685e56900eacdb780e32e24f8
Created December 3, 2025 20:15
Langgraph Inference Client Latency
### vLLM + LangChain/LangGraph ChatCompletion Notes
1. **Use the OpenAI Chat wrapper** – LangChain’s `ChatOpenAI` (and anything that consumes `ChatModel`/`openai.ChatCompletion` semantics) can point at an OpenAI-compatible base URL, so run your vLLM server on the standard `/v1` endpoint and pass that as `base_url` or `OPENAI_BASE_URL` when instantiating the client. Example:
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="chat-model-name",
api_key="placeholder", # vLLM doesn’t enforce a key
base_url="http://localhost:8000/v1",