Speaker: Bohan — Member of Technical Staff, OpenAI
Event: PGConf.dev Talk / https://www.youtube.com/watch?v=Ni1SGhNu-Q4
Topic: How OpenAI scales PostgreSQL to the next level
Note: Transcription is imperfect. Speaker has a heavy Asian (chinese?) accent which made it hard to transcribe correctly.
I think it's 3 p.m. So I think we can get started. Yeah, hi. Hi, everyone. I'm very happy to get here to talk about how we scale Postgres at OpenAI. And I'm Bohan, and a member of technical staff from OpenAI.
And yeah, so this talk I will title talk is about how we scale Postgres to the next level at OpenAI. And before this talk, I want to appreciate my team members, Su Chong, Charming, Chong Long, Dimitri, Kai, Chi, Ben, and Venka, and many more at OpenNet Infra team. And thanks for them to make Postgres super reliable and scalable. And also we also thanks the Azure Postgres team and thanks for their support.
So I will talk about:
- the background: how we use Postgres at OpenAI
- the optimizations we did: how we scale to millions of QPS in the unsharded Postgres
- case studies about past outages
- some feature requests we have for the community — where Postgres could do better.
Postgres is a backbone of our most critical system at OpenAI.
So if Postgres goes down, many of our key features will become unavailable.
In the past, we actually have several related incidents about Postgres, which had a significant impact to the service that ChatGPT.
However, scaling Postgres to meet OpenAI’s demands is not an easy task.
We basically operate on a single primary instance in Azure without sharding for a very, very long time until we encounter write scalability limits.
Because for the single primary instance, you cannot scale writes. You can add read replicas to scale reads, but not writes.
- Single primary, multi-replica architecture
- Write scalability remains a bottleneck
- We moved some write-heavy workloads that are shardable to other systems
- Read-heavy workloads stay on the single-primary Postgres
- No new workloads/tables allowed on the unsharded instance
We also did optimizations to ensure the current architecture has sufficient runway to support existing read-heavy workloads and future growth.
Overall, we think Postgres is not ideal for write-heavy workloads, but for OpenAI’s read-heavy workloads, it can scale very well, even with the current unsharded architecture.
ChatGPT-comment: OpenAI intentionally keeps a single Azure primary for read-dominant workloads and forbids new writes-heavy workloads to preserve stability.
There are some known issues in Postgres MVCC design (multi-version concurrency control).
Andy and I wrote a blog article about that in 2023.
Basically, the current MVCC design can have inefficiency:
- table and index bloat
- complex, hard-to-tune autovacuum
- version churn from tuple copying (updates copy the entire row)
- increased index maintenance overhead
It’s also difficult to scale replicas for write-heavy workloads because write amplification increases WAL volume and replica lag.
With many replicas (20-30), network bandwidth can become a bottleneck.
We want to reduce the load for both writes and reads.
- Mitigate write spikes by migrating shardable workloads away
- Reduce unnecessary writes at application level (bugs or redundant writes)
- Use daisy writes to smooth spikes
- Set strict rate limits on backfills
- Shift reads to replicas as much as possible
- Keep only transactional reads on primary
- Optimize queries on primary to ensure minimal CPU cost
- Avoid long-running idle queries (they block autovacuum)
- Set timeouts:
idle_in_transaction_session_timeout(~30s)statement_timeout- client-side cancel timeouts
- Avoid OLTP anti-patterns, especially multi-way joins
- Handle joins at application level when possible
- ORM usage must be monitored closely, as it can generate inefficient queries
ChatGPT-comment: They set strict timeout policies and replaced ORM multi-joins with application-level joins to keep latency consistent.
Since there’s only one primary, it’s a single point of failure.
Read replicas can survive individual failures, but the primary is critical.
- Most critical requests are read-only → served by replicas
- If primary fails, most critical requests still succeed (become Sev2, not Sev0)
- Categorize requests into high- and low-priority
- Dedicate replicas for high-priority traffic to isolate from expensive queries
- Set read limits on multiple places to prevent a single query spike from cascading
- Limit new connection creation to prevent pool exhaustion
- Apply query-digest specific rate limits
- Previously had issues running out of connections
- Introduced PgBouncer as connection proxy
- Connection creation reduced from 50 ms → 5 ms
- Enables reuse, reducing active connection count
- Primary has a hard limit of 5 000 connections
- Connection pooling is critical for stability
- Failover handled at proxy layer — automatic replica rerouting
Schema is critical — many issues stem from schema migrations.
- Only backward-compatible schema changes permitted
- No new tables or new workloads allowed
- Only adding/removing columns, adding indexes
- 5-second timeout on DDL operations (usually finish < 1 s)
- Avoid full-table rewrites (e.g., adding columns with dynamic defaults)
- Require nullable columns when adding new ones
- Allow index addition with controlled concurrency
- Schema changes can be blocked by long-running queries
- They run periodic
SELECT * FROM pg_stat_activity WHERE query LIKE '%table%'- If long-running queries persist > 1 s, user must fix or move them off primary
With all those optimizations, we are currently pretty happy about Postgres.
We scaled Azure Postgres to millions of QPS powering OpenAI’s critical services.
- Dozens of replicas, no increase in replication lag
- Low latency across geo-distributed replicas (Asia, Australia)
- Only one Sev0 incident in nine months (write spike)
- Sufficient headroom for future growth
ChatGPT-comment: impressive claim — millions of QPS on a single unsharded Azure primary, mostly read-heavy, stable for months.
- Redis layer failure → cache misses → more direct Postgres queries
- Increased latency → app timeouts → retries → load explosion → Postgres collapse
Mitigations:
- Rate-limit at Postgres and proxy layer
- Drop excess requests early
- Rate-limit at application level to break retry storms
- Multi-table joins (up to 12 tables)
- Low baseline volume but sudden spikes after feature rollouts
- Retries amplified load further
- CPU jumped to 95 %+ → degraded primary → impacted ChatGPT
Fixes:
- Refactor queries (move joins to app level)
- Add query-level rate limiting and blocking
- Serve read-only queries from replicas
- Unsharded design sensitive to unexpected write bursts
- CPU hit 90-95 %, replicas lagged > 10 minutes
- Replica queries returned stale data
- Writes blocked new queries
Remediations:
- Optimize applications to cut redundant writes
- Move more reads to replicas
- Remove unused indexes (to reduce WAL volume)
- Scale instance/network
- Found Azure bug in
synchronous_standby_names— caused WAL sender spin CPU bug
- Fixed by Azure team → CPU dropped from 90 % → < 10 %
- Hard to get P95/P99 latency metrics natively
pg_stat_statementsonly reports averages, not histograms- Desire: percentile latency per query digest
- Would like native logging of DDL operations (add/drop column/index)
- Today use external monitoring
- Needed for accurate tracking and debugging performance regressions
- Need ability to disable indexes before dropping them
- Current workaround: drop → recreate if needed (expensive)
- A “disable index” feature would allow safe experimentation
- Sometimes see active queries stuck on
client readfor hours- Idle-in-transaction timeout doesn’t kill them since state = active
- Might be TCP keepalive defaults or Azure bug
ChatGPT-comment: this is a known area — idle client connections marked “active” can persist indefinitely on Azure due to network timeout defaults.
- PostgreSQL default knobs are “notoriously bad”
- Would like adaptive knob tuning (especially for autovacuum)
- Reference to research (Ottertune) and rule-based tuning
At OpenAI, we've proven that Postgres can scale very well for read-heavy workloads, even without sharding, using a single primary writer.
- For write-heavy: Postgres struggles
- For read-heavy: Postgres can scale massively
- Startups should “just use Postgres” for a long time
- If you hit scaling issues, that’s a good problem — it means you’re growing
Q: Regarding the stale query that is still active — could this be TCP keepalive?
A: Possibly. Defaults are high, Azure sets them by default. Similar issues seen on GitHub, no clear fix.
Q: Example of application-level write optimization?
A: Some apps inserted or updated twice unnecessarily. Fixed that.
Also use “daisy writes” — delay writes until features are used, smoothing spikes.
End of transcript.
- OpenAI runs a single unsharded Azure PostgreSQL primary with dozens of replicas.
- Achieves millions of QPS — mostly read-heavy, low latency.
- Write spikes remain the only major failure mode.
- Heavy reliance on rate limiting, PgBouncer, query control, and strict schema governance.
- Feature wishlist focuses on observability, autovacuum tuning, and index disablement.