usegalaxy.org uses a five host setup:
- 2 x hosts for serving web requests
- 2 x hosts for handling jobs and Pulsar staging
- 1 x host for the database
Web requests are balanced only with a DNS round-robin (i.e. there are two A
records for usegalaxy.org), which is not ideal since it does not do any load balancing or health checks. A better setup would include a 6th host with nginx proxying the web hosts.
There are two primary benefits to this setup. Four hosts are not needed, the big benefits come from separating web and job functions:
- Any performance issues relating to staging Pulsar jobs and job handling in general will not affect the general responsiveness of the UI as long as the common file store is responsive.
- Determining the cause of Galaxy issues is much easier when these functions are kept separate: if the UI is slow, you can be reasonably sure it's not due to job handling, and vice versa. This nakes the admin job of diagnosing problems somewhat easier.
Additionally, this method is compatible with uWSGI Zerg Mode, which the uWSGI Mule job handlers are not.
You can't use uWSGI Mules as Job Handlers since that uses a message passing scheme between processes on the same host. Instead, you will need to use "dynamic" job handlers and the DB SKIP LOCKED
Job Handler Assignment Method, which uses a locking scheme on the Galaxy database's job
table to pass setup messages. Thankfully, we've been using this method in production for a number of years without issue, so it's well tested.
As with everything, the setup for this setup for usegalaxy.org is in the usegalaxy-playbook Ansible Playbook. However, because that playbook can be a bit dense, here are the general steps:
-
Configure the job config for the uWSGI + Webless strategy.
To begin, make sure there are no
farm
ormule
options set in theuwsgi
section ofgalaxy.yml
. Then, you'll simply need a<handlers>
section injob_conf.xml
without any individual<handler>
s defined and without a default, like so:<handlers default="handlers" assign_with="db-skip-locked" max_grab="8" />
The
max_grab
parameter controls how many new jobs a handler can "grab" (assign to itself to handle) to itself at a time. Too many and one handler can be overloaded while others are waiting, too few and it can negatively impact throughput. We're using 8, your mileage may vary. The default is to grab all new jobs, which is probably not ideal.If you want to map any tools to specific handlers, you can do so as described in the statically defined handlers documentation, and the usegalaxy.org config can be used as an example. We run 2 x "default" handlers and 1 x "multi" handler per job host, because in the past, handling the large outputs of multicore tools could affect the throughput of the more common/shorter running tools.
If memory serves, you do not actually need to statically define those handlers in the job config the way that we do for usegalaxy.org, if all you want to do is statically map some tools to different handlers. I believe in our case, the only reason we do this is because only the "multi" handlers load the Pulsar plugins. This means that you probably just want to have a single
<handlers ... />
tag (as shown above) without any<handler ...>
children tags. -
Configure nginx for job staging
nginx on both the web and job hosts should be configured with both X-Accel-Redirect (send Galaxy datasets with nginx), and the job servers should be configured with the nginx_upload_module since Pulsar staging doesn't use the fancy JS client chunked upload method.
X-Accel-Redirect
is configured the same way for both hosts (as shown in the Galaxy docs), but thejob_files
API upload module setup for the job hosts is slightly different:location /_job_files { if ($request_method != POST) { rewrite "" /api/jobs/$arg_job_id/files last; } upload_store /galaxy-repl/main/upload_job_files; upload_store_access user:rw; upload_pass_form_field ""; upload_set_form_field "__${upload_field_name}_path" "$upload_tmp_path"; upload_pass_args on; upload_pass /_upload_job_files_done; } location /_upload_job_files_done { internal; rewrite "" /api/jobs/$arg_job_id/files; }
In order to ensure no one uses the job handlers for standard web requests (unlikely, I know), I also redirect any non-Pulsar-staging requests to usegalaxy.org. See my nginx staging server config for reference.
-
Configure Galaxy for job staging
In the
galaxy
section ofgalaxy.yml
, you will need to set nginx_upload_job_files_store, ideally to a directory on the same filesystem as your datasets, and nginx_upload_job_files_path to the upload module path (location
) in the previous step,/_job_files
in our case.Next, in your
job_conf.xml
, for any AMQP-based Pulsar plugins, set thegalaxy_url
param to the URL of your job handler(s). In my case, I use the hostname of the handler that picked the job up, but you could use a load balancer, DNS round-robin, or hardcode a single job handler host here:<param id="galaxy_url">https://{{ inventory_hostname_short }}.galaxyproject.org</param>
-
Configure uWSGI startup
Currently we're doing this with supervisord simply because I haven't had the time to get switched to systemd, but systemd is probably preferable these days and there is a nice Ansible role for it. One consideration is that I don't have a great method for doing Zerg Mode under systemd (here's the hacky solution I have so far, you can
systemctl reload galaxy
to initiate the Zerg Mode restart, but the reload will "time out" despite succeeding). I think @hexylena has a different method for handling zergling restarts under systemd.Anyway, however you do it, you just need to start uWSGI the standard way, effectively:
$ cd /srv/galaxy/server $ /srv/galaxy/venv/bin/uwsgi --yaml /srv/galaxy/config/galaxy.yml
Which is done in my supervisor config using the
directory
andcommand
options. This is the same on both the web and job hosts. -
Configure job handler startup
Dynamic job handlers are started "webless" (aka just the Galaxy application, without a web stack) and instructed to attach to a "pool". What this means is that they will (using the locking scheme described above) poll the database for new jobs to "grab". When the
DB-SKIP-LOCKED
assignment method is configured, new jobs will be created with their state set tonew
and their handler set to_default_
. Job handlers should be attached to the "pool" namedjob-handlers
in order to find these jobs.The "webless" entry point is the galaxy-main script, and the important options are
--server-name
and--attach-to-pool
. When a handler grabs a job, it does so by updating the handler column of any grabbed jobs to the server name. After this they can be picked up by the regular job loop, which looks for jobs in statenew
and handler the same as theserver_name
. Theserver_name
must be unique and is responsible for that job from start to finish, so if that handler is shut down before all its assigned jobs are terminal, those jobs will never complete as described in the warning here.I believe that the grabber filter query will include:
- Any tags of any defined
<handler>
XML tags withid
matching the handler'sserver_name
- Any pools defined in
--attach-to-pool=job-handlers[.<pool>]
(multiple--attach-to-pool
options can be specified)
If started with just
--attach-to-pool=job-handlers
without a.<pool>
, the handler will attach to the_default_
pool.Our format for the server_name, since we have multiple job hosts, is
main_{hostname_alias}_handler{instance_number}
wherehostname_alias
is just a shorthand for the VM's hostname (e.g.w3
forgalaxy-web-03
) andinstance_number
is 0, 1, or 2. See the supervisord config for reference.In the case of usegalaxy.org, mainly for legacy reasons (I only realized it was still the case when I started writing this HOWTO!), we have called the default handler pool
handlers
by settingdefault="handlers"
on the<handlers>
tag andtags="handlers"
on the individual handler definitions. That means that in our case, new jobs are created with statenew
and handlerhandlers
, which is consequently the same thing the "job grabber" query filters on. I say all of this just to give some insight into how job grabbing works. Don't set a default handler and this should be irrelevant to you.tl;dr, if you want to start 2 handlers attached to the
_default_
pool, you would define no individual<handler>
tags injob_conf.xml
as described in step 1, then start the handlers withcwd
in/srv/galaxy/server
with:$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler0 --attach-to-pool=job-handlers $ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler1 --attach-to-pool=job-handlers
- Any tags of any defined
If like me you want to map certain tools to different handlers, don't use my complicated config. I believe you could do all of that with a job_conf.xml
like:
<?xml version="1.0"?>
<job_conf>
<plugins>
<!-- usual stuff here ... -->
</plugins>
<handlers default="handlers" assign_with="db-skip-locked" max_grab="8" />
<destinations>
<!-- usual stuff here ... -->
</destinations>
<tools>
<tool id="bwa" handler="mappers"/>
<tool id="hisat2" handler="mappers"/>
</tools>
</job_conf>
And then start your handlers (if you want, say, 2 to handle default jobs and 2 to handle mapper jobs) like:
# start 2 handlers grabbing from '_default_' pool
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_default_0 --attach-to-pool=job-handlers
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_default_1 --attach-to-pool=job-handlers
# start 2 handlers grabbing from 'mappers' pool
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_mappers_0 --attach-to-pool=job-handlers.mappers
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_mappers_1 --attach-to-pool=job-handlers.mappers