Skip to content

Instantly share code, notes, and snippets.

Created Nov 16, 2022
What would you like to do?
Scaling up a Mastodon server to 128K active users

Scaling up a Mastodon server to 128K active users

pgBouncer is your earliest friend. Connections to Postgres are expensive--each one is a separate process that Postgres maintains. Mastodon is a high-concurrency application, and as such needs a connection to the database for every thread in the application. Threads and processes is how you make Mastodon utilize the full potential of the hardware that it's on, and how you keep scaling it beyond a single machine. What pgBouncer does is proxy connections to Postgres through a transaction pool, allowing you to support a vast number of client-side (in this case, client meaning Mastodon) connections with only a few actual database connections. When you reach the point where it makes sense to move Postgres to its own physical machine, I recommend maintaining pgBouncer on each machine that wants to connect to it, rather than putting pgBouncer on the same machine as Postgres.

Postgres is the first component that it makes sense to move to a separate machine. It's the database, everything wants to connect to it, and it's the one piece in the machinery that only scales vertically (i.e. with more powerful hardware). You want the database server to care about only one thing: the database. SSD helps. CPUs help. RAM isn't as useful as you might think. pgTune is your friend. Remember that max_connections on Postgres doesn't need to be very high since you're using pgBouncer, only need to think about how many different pgBouncers you're going to have and how many real connections they will open (it's 20 by default). Tuning Postgres is a mystical art on its own and there is no magic max_connections value that will fit all situations, but at least there seems to be a consensus that too many connections are a bad thing. Amazon's highest Postgres offerings list their max_connection values at 1000.

You can do read-replicas of the database, by the way, but we're not there yet. Your standalone database will serve you for quite a bit before that becomes necessary. The thing that will actually give you trouble first is Sidekiq, because the default configuration is not fit for anything past a couple users. Sidekiq is the background jobs processing system. There is a lot of processing that needs to be done in Mastodon whenever users perform any action. For example, when you post, we need to retrieve a list of your followers (which can be huge) and write the new post into each of their home feeds while applying filters. And for each of your followers on other servers, we actually need to send the post to another server through a HTTP request. If you had to wait for this while the page was loading upon pressing the publish button, well, that wouldn't work. So instead all of this is queued in the background.

These jobs are split into as many tiny jobs as we can manage, because that's how you can make parallelize them best and thus make the most optimal use of hardware and horizontal scaling. But if you've got 10 threads and 22,000 followers, do not be surprised that there are delays. In fact, that is how the need for scaling Sidekiq shows itself: the dreaded backlog. When there are more jobs being enqueued than are being processed, the backlog will grow. Actually, there are more reasons that the backlog can grow, such as if there's a technical issue causing individual jobs to take longer than they normally would, or getting stuck indefinitely reducing the effective number of threads available for processing. But here we're talking about scaling. You need to increase the number of threads that process these jobs, so that more jobs can be processed at the same time.

At this point it's worth mentioning that if you want to go further, you'll need to be using object storage (S3 or similar) for user file uploads, or else manually figure out a shared filesystem between all of the machines in your cluster (very likely possible, but probably not worth it compared to even just self-hosting Minio). Tip: Use our recommended configuration for proxying requests to your object storage bucket, both to save massively on bandwidth costs and to keep consistent URLs even in the case that you need to move buckets or S3 providers in the future.

But it's not enough to stuff more threads into a single Sidekiq process. According to the author of Sidekiq, the optimal value is 25 threads per process, and no more than 50. I've previously used 90 and even 100 threads per process, and I can confirm that the throughput is higher with two processes of 50 instead of one of 100. There are reasons for that that we don't need to go into, but it's important to understand the general trade-offs here. Both threads and processes are ways to parallelize execution (i.e. you should be aiming at engaging all the CPU cores you have), threads are cheaper in terms of RAM, but there are things that threads still have to wait for that separate processes don't (the GIL). If RAM isn't an issue (and generally speaking if you compile Ruby with libjemalloc as our docs recommend, Ruby processes don't consume that much RAM), aim for more processes.

But Sidekiq is a big topic, so we're not done here. Scaling Mastodon is a lot about tuning Sidekiq. All of the background processing jobs actually go into separate queues. There's the default queue, where all the stuff concerning local user experience goes, such as writing new posts into home feeds and delivering notifications. There's the mailers queue, which renders and sends e-mails, such as confirmation e-mails, password resets, and so on. There's push, which is concerned with making HTTP requests to other servers to deliver posts to remote followers. pull, which does stuff like pull down link previews, and other things that could be slow and not super crucial to the user experience. Since 4.0, there's also ingress, processing all incoming deliveries from other servers. And finally, the special scheduler queue, which acts as an embedded cron and performs periodic tasks such as recalculating trends, publishing scheduled posts, or removing unconfirmed user accounts.

By the default, your single Sidekiq process does all of that, and this can be less than ideal. Because your 5 threads could all be busy processing incoming federation activities while a new user waits for their confirmation e-mail to be sent. But we're past a single Sidekiq process, so now it's a matter of tuning how many threads should go to which queues. The default mastodon-sidekiq.service can be copied and modified to have different services (processes) working different queues by passing the queue as the -q argument in the ExecStart line. The -c argument control concurrency, i.e. number of threads, i.e. probably 25 or 50. One thing I've learned to do recently is that you can use substitution in SystemD files to create more instances of the same service at runtime, and thus scale more easily. For example, you could have mastodon-sidekiq-default@.service with -q default, and then you could start 5 of such services with systemctl enable --now mastodon-sidekiq-default@{0,1,2,3,4}. That's 250 threads in one go!

One thing to remember is that there should only be one scheduler in your entire cluster, and it doesn't need many threads (5 is fine). Another thing to keep in mind is that while it's great to have processes dedicated to only a specific queue, this can be a little wasteful if that queue is empty but another one is getting backlogged. A solution to this is having processes work multiple queues, but with different weights (priorities). A weight simply controls the likelihood that the next job will be picked from that queue vs another. As command-line arguments, it looks like -q default,2 -q ingress -- this process will work both default and ingress, but default will have twice the chance of being worked. Whether it's better to have every single process work every single queue (except scheduler) with weights, or have different "types" of processes that only work 2-3 each, I am not sure. It's just that default is the most important one, with push and ingress being close second. mailers is also important but even just 25 threads will get you very far because the rate of sending e-mails isn't that high.

If you've got Sidekiq figured out, you're almost there. You also need to process HTTP requests from your users, which is what Puma does. On thread can process one request at the same time, so if you need to support a higher request throughput, you need more threads--same logic about processes vs. threads applies, except that Puma has a built-in way of managing processes (WEB_CONCURRENCY) and threads (MAX_THREADS) (that's threads per process), and at some point you will definitely want Puma to be on a separate machine from Sidekiq, and then have more machines with Puma, and more machines with Sidekiq. Just don't forget that once your Puma isn't on the same machine as your nginx (or whatever you use as your reverse proxy/load balancer), you will need to specify TRUSTED_PROXY_IP with the internal IP of the load balancer so that Puma can correctly parse users' IP addresses for stuff like rate limiting. Anyway, once you have more than one Puma in the cluster, you use an upstream block in your nginx configuration (or equivalent in your software) to list these Pumas and nginx will do the load balancing between them.

The streaming API will get you pretty far on default configuration, but at some point it too will not be able to answer all of the connections that your users will want to have, and you'll need to either increase the STREAMING_CLUSTER_NUM or make more machines with the streaming API on them and load balancing through nginx. The moment when this becomes necessary can be difficult to detect, because for people who've already connected, the streaming API will continue to work, it's new connections that will be rejected.

Speaking of connections, you will run into the open files limit with nginx sooner or later, which you will notice through errors like "too many open files" in /var/log/nginx/error.log, so you will have to tune nginx by increasing its worker_rlimit_nofile and worker_connections values.

But that's not all! Although we are getting close to the end. Mastodon uses Redis, an in-memory key-value store for storing the background processing queues (Sidekiq), cache, and important cache (like home feeds). Redis is very fast, but even it can reach its limits. Mastodon supports using different Redis instances for these 3 different use cases. Respectively, that is SIDEKIQ_REDIS_URL, CACHE_REDIS_URL and just REDIS_URL. (Actually, Mastodon supports REDIS_HOST, REDIS_PORT etc variants separately for all three). So in the long term, using separate Redis instances can give you more breathing room.

At last, one way to unload the primary database a little is to use read-replicas. That way read queries can go to a different machine and the primary doesn't have to work on them. Additionally, a read-replica serves as a failover and "hot backup", i.e. if the primary goes down you can switch to the read-replica and promote it to new primary much faster than spinning up a new machine from your nightly database backup with x hours of data loss since. Although it should be noted that currently, it only makes sense to connect to read-replicas from Puma, as doing so from Sidekiq may silently discard a lot of jobs due to replication lag. That's a problem still to be solved.

Copy link

deepy commented Nov 16, 2022

an extra "the" in By the default?

Copy link

mrvanes commented Nov 18, 2022

Have you considered CockroachDB as a horizontally scaling replacement for postgres?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment