Skip to content

Instantly share code, notes, and snippets.

@stanislavkozlovski
Last active October 18, 2025 09:37
Show Gist options
  • Select an option

  • Save stanislavkozlovski/7853a1c0a73caba81de53f4e36c618f5 to your computer and use it in GitHub Desktop.

Select an option

Save stanislavkozlovski/7853a1c0a73caba81de53f4e36c618f5 to your computer and use it in GitHub Desktop.

Scaling Postgres to the Next Level at OpenAI

Speaker: Bohan — Member of Technical Staff, OpenAI

Event: POSETTE 2025 / https://www.youtube.com/watch?v=NvY2kvi1Fa0

Topic: How OpenAI scales PostgreSQL to the next level

Note: Transcription may be imperfect. Speaker has a heavy Asian (chinese?) accent which makes it hard to parse


Part 1

Everyone, I'm Wohan Zhang. I'm a member of technical staff at OpenAI.
In this talk, I will discuss how we scale Postgres to the next level at OpenAI. Specifically, we will talk about how we operate Postgres and the license feeder and the challenge we face and how we address them at OpenAI.
If you are a DBA operating production Postgres at a very large scale, you might be interested in this talk.
Okay, let's get started. Let's see what we did at OpenAI to scale Postgres.


Acknowledgements

Before starting the talk, I would like to acknowledge the hard work from my teammates.
Thanks, Su Cheng Liu, Charming Yu, Cheng Long Hao, Ben Rice, and more people in the OpenAI infra team.
Thanks for all their hard work to make Postgres super reliable and scalable at OpenAI.


Background

A little bit about myself.
I am a member of technical staff at OpenAI.
Before that, I was a co-founder at Olerton, which is Samuel Spinoff, based on many years of research from Carnegie Manuel Database Group.
So we basically applied the machine learning and AI approach to automatically optimize your databases, specifically optimize Postgres and My Mexico.
And before Otterton, I was a researcher at Carnegie Manu Database Group, working with Professor Andy Pawdle on the other research project.


Postgres at OpenAI

Postgres is the backbone of our most critical system at OpenAI.
So if Postgres goes down, many of our key features will become available.
So Postgres feeders have previously that to multiple self-0vers, which take down the entire ChatGPT service.
So Postgres is super important at OpenAI.
So scaling Postgres to meet OpenAI demand is also super important, but it's also non-trivial. It's not an easy task.


Initial Architecture

Historically, we operate on a single primary instance in Idris without sharing at all for a very long time until we encounter the RAS scalability limits.
Since if you only have one single primary, you cannot scale RIS horizontally.
So in a single primary multiple replicas architecture, RAS scalability can become some bottleneck.

For the requirements, you can add more replicas to scale reads.
But for the RAS, we only have one single primary.
So this architecture actually, in this architecture, the RASDB is a bottleneck.


Initial Optimizations

We did lots of optimization to make it work.
First of all, we move most of the read-heavy workloads to other systems.
So we migrate off the RAT-Heavy workloads from Postgres.
We also make sure new tables and new workloads are not allowed, which means that we only support the current workload.
We don't want the new route traffic to the current setup.
We also did lots of optimizations to ensure the current architecture has sufficient runway to support existing read-heavy workloads and future growth.
We'll discuss that what optimization we did in the next few slides.
But right now, we have a sufficient runway in the current single primary setup.


MVCC and Write Challenges

What we found is that Postgres is actually not ideal for RAT heavy workloads.
I've discussed this in details later, but for the OpenAS read-heavy workloads, it can scale really well.

So there are some known challenges in the right-heavy workloads.
First of all, I think Postgres MVCC design, which is the multi-version concurrency control.
Postgres MVCC design has some known issues. It's not very efficient.

There are several issues. The first thing is it will cause the table and index bloat.
Another thing is that it will also increase the auto vacuum tuning capacity.
And also the version churn. They also have the version churn from triple coding.
And also it increases the index maintenance overhead.

If you want to learn more about that, Andy and I wrote a blank article in 2023. The title is “The part of Postgres we hate the most.”
We discussed the MVC design in details and the issues it has.


Part 2

Write-heavy Replication Issues

For the right-heavy workloads, it's also difficult to scale to read replicas.
One thing is that right-heavy workloads can generate lots of red hat logs.
So those red-hat logs will be shipped to the replicas.
If you have lots of the reds and generate lots of red-hat log, the high volumes you need to ship more data to the read replicas, it can potentially increase the replica lag in the read replicas.

Another thing is that the problem even gets worse when you have more replicas because the primary need to send the same copy of write-hat log to many replicas.
If you have many lots of read replicas, the network bandwidth can be a bottleneck.

That's why the heavy workload can change in Postgres.


Optimizations to Reduce Load

So after we margin most of the red-heavy workload from Postgres, currently the read-heavy workloads are still served by the unshortly Postgres in error.
The problem is how. The question is how? How can we just serve the opening as read-heavy workload in a cluster with just a single primary instance?
We did lots of optimizations here.


Write Optimization Techniques

The first thing is that we try to reduce the load on the primary because we only have the one primary. We want to reduce the load as much as possible.
Particularly, we want to reduce the load from both writes and reads.

For the rights, we want to mitigate the rights of bugs in primary.
So the first thing we did is that we migrated write-heavy workloads from Postgres to other systems.
And also we reduce the number of writes at the application level because we also identify some bugs in the application side, which generate unnecessary writes.
So we carefully optimize the application to make sure it only writes to the Postgres when necessary.

And there will not be any duplicates or unnecessary writes in the application.
The start thing we did is that we also use daisy writes where possible to smooth out the right spots to avoid the spikes and smooth it out.

The last thing we did is that another thing we did is that we also, when backfielding a field, we also straight limit the backfielding.
It means that it may take a week to backfill a field for the RS table, but it's really necessary because if you don't read limit the backfield, it will cause the right spike in Postgres, which can cause lots of issues.
So when backfielding a field in Postgres tables, we will set the read limit on that.
That's what we did for the writes.


Read Optimizations

But for the reads, we also upload requirements from primary to read replicas because we want to offload the requirement from primary whenever possible to also reduce the primary load.
Because we only have one primary, if there is a read queries, I think if we can move to read replicas, we will not read from the primary.

Some reads cannot be moved due to transactions.
So for those queries which are required to stay on the primary, we want to make sure those queries are efficient in primary.

Another thing we did is the query optimization.
We optimize the queries a lot to avoid anti-query anti patterns and also avoid on running queries.

So downrunning addle queries can cause lots of issues in Postgres. It can be auto-vacuum and consume lots of resources.
We avoid downrunning queries by setting timeout on different levels.

We set the add-o-intrusting session timeout as a primitive in Postgres, which will kill the queries if it's idle in the transaction for downtime.
We also set the statement timeout because sometimes we find some queries actually idle in a transaction, but the status is active for some reason.
So which means the addle interesting session timeout will now kill those queries. That's why we also need to set the statement timeout.
In the meantime, we also set the client-side timeout in the application.


Part 3

Avoiding Query Anti-patterns

If there are some request is longer than the threshold, in application, it will also kill that request.
We also avoid OLTP query anti-patterns because we observe multi-wave joints in Postgres queries.
And the most expensive queries will join actual 12 tables together. It's super expensive.

And here previously, we had a search on such queries and it's actually there to serve in the past.
So we want to avoid those expensive multi-table joints queries in Postgres because it's not a typical LTP queries.

To avoid those expensive multi-way joint queries, we actually modify the application code to make sure it to break down the joints in the application side.
So instead of joining 12 tables in the Postgres server, we just did it join in the application so that Postgres queries will be still efficient.

And in the application, I think we optimize on application level to make sure we don't have such expensive query in Postgres.

Another thing we noticed is that users don't write raw query, single queries in Postgres.
They usually use ORM. But developing with an ORM can easily lead to inefficient queries because when writing with ORM, you don't know what's the query generated by the ORM.
So usually it has some bad queries. So this requires the developers to carefully write the code using ORM to make sure it won't generate the bad queries.


Handling Single Point of Failure

Another challenge we face is the single point of failure.
Because the primary instance, there is only one primary instance. It can be a single point of failure.
So if that single primary is done, if that instance goes down, no writes will be performed because we only have one writer.
So no writes can be performed anymore.

But we have many re-replicas. So if one read replica fails, applicant can still read from others.
So we have single point failure on primary, but not for re-replica.

So in application side, we make sure that most critical requests actually read only, and they only need to read from real replicas as they don't, and they don't need to write a read from primary, which means that even if the writer goes down, those critical requests can still operate by fetching data from read replica.

So it usually will not cause the severe issues to our users.
So if the primary goes down, it will be a sub two instead of a sub zero or seven because of the most critical requests can still operate by reading data from re-replicas.

So in this way, we can mitigate the impact of the single point of failure from primary.


Prioritization and Replica Segmentation

Besides that, we also categorize the request by priority.
So the idea is that some requires have high priority and some requires have low priority.

So high priority requests will have a greater impact on users when they are unavailable. So it will cost sub zero compared to the low priority, which will only cause sub two.
So we categorize the request by the high priority versus low priority.

And the next step is that for the high priority request, we want to make sure we have the dedicated replicas for those high priority requests to prevent them from being impacted by other low priority requests.

So the idea is that if there are some expensive queries in a low priority request, it will not impact the high priority request.
So for example, if there's a spike in the expensive query in the low priority request, it only affects the low priority request because we have a dedicated reactive card for high priority requests.
So this can help us.


Part 4

Rate Limiting and Connection Pooling

We also need to read the image. We also have read limit in different places.
This is because a search from a single expensive query, as we discussed before, can bring down the entire service.

So previously we have some expensive queries running on primary. For example, there was a 12-way joint queries and typically the volume was low. So it's okay in the most of the time.
But the issue is that sometimes there will be a sudden spike involved those expensive queries and it can took down the entire instance.

So in the past, we had such a spike in the 12-way joint query and the CPU usage jumped from 20% to 99%.
And eventually the whole instance was down because of this single query.
So we want to have some read limit to help to mitigate this case.

So we have read limited on different levels.
The first level is that we have is the application level. So we read limit the application functions to reduce the load during the big traffic.
So the idea is that for the application request, we should have read limit so that we don't have a spike from specific users or specific region.

Another level is the new is on a new connection. We also read limit on the creation of the new connection because it's also super important for us to prevent connection pool exhaustion.

So in the past, we saw some like the huge number of connections. It caused several lots of issues.
So we also read limit the creation of new connection to make sure the connection pool is not exhausted.

We also read limit for specific query digest.
This is because the past incidents we had, so the whole service was took down by a specific expensive query.
So we have a feature to read limit on specific query digest.
So if some queries, if that issue happen again, we can block the query entirely or sell on specific queries so that it will not eat a lot of results and take down the whole service.


Connection Pooling (PgBouncer)

Connection pooling is also super important because PVC, we have some cells like running out of the connections from Postgres.
So yeah, in the old days, we even don't have a connection pool and it's the is very easy to have the connection these options.

So right now we are using PG bouncer as a Postgres proxy.
It acts as a connection pool which enables a connection reuse.
The idea is that the connection can be reused by different applications.
You don't need to close and open the connection again and again.
So the connection can be reused.

This can significantly reduce connection latency.
From our experiment, it can basically reduce the connection latency from 50 milliseconds to less than 5 milliseconds.
That's a huge improvement for creating connections.

It also reduces the number of connections because those connections can be reused across different applications.
This is super important for us given we have that 5,000 connection limit on primary and it's a hard limit on the Postgres instance and we cannot increase that limit.

So we need to use those connections carefully and connection pool and PG bouncer help us.
And it's also responsible for to fill over. So if the re-replica fails, sorry, not fillover, but if a re-replica fuse, the traffic will be automatically routed to other available replicas through this through the PG.


Part 5

Schema Management

Another super important aspect is about the schema management.
Because if you don't do it correctly, you might have some expensive full rewrite for the entire table, which can cause lots of issues.

So for schema, we need to do it very carefully. So at OpenAI, only that with schema changes are allowed.

So first of all, we don't allow creating new tables or introducing new workloads in Postgres because we don't want to add the new direct traffic in the current setup.
So because the current title does not scale the REST very well, so we don't want the new right traffic in this setup.

We also allow adding or removing columns in tables with the five seconds timeout. So any changes that require a table rewrite are not allowed.

So for some schema changes, for example, when you order a column type, it might need to adopt an entire table and rewrite the whole table.
It will basically interrupt your production workloads. We definitely don't allow that.

So we only allow some that which might change, like adding nullable columns or removing column, which will typically finish within a second. We only allow those operations.

So for adding columns, we only allow adding nullable columns. Otherwise, it may cause some full rewrite for the entire table.
For example, if you set for the column, if you set the default to a dynamic variable, it might trigger rewrite for the entire table.
So for adding a column, we also require the column be nullable.

We also allow adding dropping indexes concurrently.
One thing we found is that schema change can be blocked by some consistent queries.

So in a long running queries, when the query is running longer than one second, if those queries are consistent present on the target table, the migration may fail because those queries will block the schema migration.

What we found is that to find those queries, we basically run the cycle on the pgStat activity.
The query is the stack star from PGStack activity. We are query your target table name and the duration is longer than one second.
We run this query periodically.

And if some queries are consistent present, it means that those queries are actually blocking the schema migration for that on the target table.
So if we find those queries, we can either fix those queries in the application to optimize queries to avoid those query consistent to avoid those long-running queries, or we can move them to re-replicas so that it will unblock the schema migration on the primary immediately.


Conclusion

That's it from my talk.
So overall, I think after all the optimization we did, we are super happy with the Postgres right now for our read-heavy workloads.
It's really scalable and reliable right now.

I think in the past six months, we don't have any sub-zero regarding to Postgres.
Sorry, we do have one self-0, but it is because we have a sudden spike of the right traffic.

And besides that, we don't see any that zero on Postgres instance.

So I think the lesson is that for write-heavy workload, Postgres might not be suitable, but for read-habbed workload at OpenAI, if you optimize it carefully, I think it can scale the OpenS workload, read-heavy workload to a very large scale, even with a single primary multi-replayer architecture without sharing at all.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment