natefoo/Multi-host Galaxy Server.md

## Multi-host Galaxy Server.md

      
    Raw
  

              Multi-host Galaxy Server.md
            
          
    Multi-host Galaxy Server

usegalaxy.org uses a five host setup:

2 x hosts for serving web requests
2 x hosts for handling jobs and Pulsar staging
1 x host for the database

Web requests are balanced only with a DNS round-robin (i.e. there are two A records for usegalaxy.org), which is not ideal since it does not do any load balancing or health checks. A better setup would include a 6th host with nginx proxying the web hosts.
Benefits

There are two primary benefits to this setup. Four hosts are not needed, the big benefits come from separating web and job functions:

Any performance issues relating to staging Pulsar jobs and job handling in general will not affect the general responsiveness of the UI as long as the common file store is responsive.
Determining the cause of Galaxy issues is much easier when these functions are kept separate: if the UI is slow, you can be reasonably sure it's not due to job handling, and vice versa. This nakes the admin job of diagnosing problems somewhat easier.

Additionally, this method is compatible with uWSGI Zerg Mode, which the uWSGI Mule job handlers are not.
Drawbacks

You can't use uWSGI Mules as Job Handlers since that uses a message passing scheme between processes on the same host. Instead, you will need to use "dynamic" job handlers and the DB SKIP LOCKED Job Handler Assignment Method, which uses a locking scheme on the Galaxy database's job table to pass setup messages. Thankfully, we've been using this method in production for a number of years without issue, so it's well tested.
HOWTO

As with everything, the setup for this setup for usegalaxy.org is in the usegalaxy-playbook Ansible Playbook. However, because that playbook can be a bit dense, here are the general steps:


Configure the job config for the uWSGI + Webless strategy.
To begin, make sure there are no farm or mule options set in the uwsgi section of galaxy.yml. Then, you'll simply need a <handlers> section in job_conf.xml without any individual <handler>s defined and without a default, like so:
<handlers default="handlers" assign_with="db-skip-locked" max_grab="8" />
The max_grab parameter controls how many new jobs a handler can "grab" (assign to itself to handle) to itself at a time. Too many and one handler can be overloaded while others are waiting, too few and it can negatively impact throughput. We're using 8, your mileage may vary. The default is to grab all new jobs, which is probably not ideal.
If you want to map any tools to specific handlers, you can do so as described in the statically defined handlers documentation, and the usegalaxy.org config can be used as an example. We run 2 x "default" handlers and 1 x "multi" handler per job host, because in the past, handling the large outputs of multicore tools could affect the throughput of the more common/shorter running tools.
If memory serves, you do not actually need to statically define those handlers in the job config the way that we do for usegalaxy.org, if all you want to do is statically map some tools to different handlers. I believe in our case, the only reason we do this is because only the "multi" handlers load the Pulsar plugins. This means that you probably just want to have a single <handlers ... /> tag (as shown above) without any <handler ...> children tags.


Configure nginx for job staging
nginx on both the web and job hosts should be configured with both X-Accel-Redirect (send Galaxy datasets with nginx), and the job servers should be configured with the nginx_upload_module since Pulsar staging doesn't use the fancy JS client chunked upload method.
X-Accel-Redirect is configured the same way for both hosts (as shown in the Galaxy docs), but the job_files API upload module setup for the job hosts is slightly different:
location /_job_files {
    if ($request_method != POST) {
        rewrite "" /api/jobs/$arg_job_id/files last;
    }
    upload_store /galaxy-repl/main/upload_job_files;
    upload_store_access user:rw;
    upload_pass_form_field "";
    upload_set_form_field "__${upload_field_name}_path" "$upload_tmp_path";
    upload_pass_args on;
    upload_pass /_upload_job_files_done;
}

location /_upload_job_files_done {
    internal;
    rewrite "" /api/jobs/$arg_job_id/files;
}
In order to ensure no one uses the job handlers for standard web requests (unlikely, I know), I also redirect any non-Pulsar-staging requests to usegalaxy.org. See my nginx staging server config for reference.


Configure Galaxy for job staging
In the galaxy section of galaxy.yml, you will need to set nginx_upload_job_files_store, ideally to a directory on the same filesystem as your datasets, and nginx_upload_job_files_path to the upload module path (location) in the previous step, /_job_files in our case.
Next, in your job_conf.xml, for any AMQP-based Pulsar plugins, set the galaxy_url param to the URL of your job handler(s). In my case, I use the hostname of the handler that picked the job up, but you could use a load balancer, DNS round-robin, or hardcode a single job handler host here:
<param id="galaxy_url">https://{{ inventory_hostname_short }}.galaxyproject.org</param>


Configure uWSGI startup
Currently we're doing this with supervisord simply because I haven't had the time to get switched to systemd, but systemd is probably preferable these days and there is a nice Ansible role for it. One consideration is that I don't have a great method for doing Zerg Mode under systemd (here's the hacky solution I have so far, you can systemctl reload galaxy to initiate the Zerg Mode restart, but the reload will "time out" despite succeeding). I think @hexylena has a different method for handling zergling restarts under systemd.
Anyway, however you do it, you just need to start uWSGI the standard way, effectively:
$ cd /srv/galaxy/server
$ /srv/galaxy/venv/bin/uwsgi --yaml /srv/galaxy/config/galaxy.yml
Which is done in my supervisor config using the directory and command options. This is the same on both the web and job hosts.


Configure job handler startup
Dynamic job handlers are started "webless" (aka just the Galaxy application, without a web stack) and instructed to attach to a "pool". What this means is that they will (using the locking scheme described above) poll the database for new jobs to "grab". When the DB-SKIP-LOCKED assignment method is configured, new jobs will be created with their state set to new and their handler set to _default_. Job handlers should be attached to the "pool" named job-handlers in order to find these jobs.
The "webless" entry point is the galaxy-main script, and the important options are --server-name and --attach-to-pool. When a handler grabs a job, it does so by updating the handler column of any grabbed jobs to the server name. After this they can be picked up by the regular job loop, which looks for jobs in state new and handler the same as the server_name. The server_name must be unique and is responsible for that job from start to finish, so if that handler is shut down before all its assigned jobs are terminal, those jobs will never complete as described in the warning here.
I believe that the grabber filter query will include:

Any tags of any defined <handler> XML tags with id matching the handler's server_name
Any pools defined in --attach-to-pool=job-handlers[.<pool>] (multiple --attach-to-pool options can be specified)

If started with just --attach-to-pool=job-handlers without a .<pool>, the handler will attach to the _default_ pool.
Our format for the server_name, since we have multiple job hosts, is main_{hostname_alias}_handler{instance_number} where hostname_alias is just a shorthand for the VM's hostname (e.g. w3 for galaxy-web-03) and instance_number is 0, 1, or 2. See the supervisord config for reference.
In the case of usegalaxy.org, mainly for legacy reasons (I only realized it was still the case when I started writing this HOWTO!), we have called the default handler pool handlers by setting default="handlers" on the <handlers> tag and tags="handlers" on the individual handler definitions. That means that in our case, new jobs are created with state new and handler handlers, which is consequently the same thing the "job grabber" query filters on. I say all of this just to give some insight into how job grabbing works. Don't set a default handler and this should be irrelevant to you.
tl;dr, if you want to start 2 handlers attached to the _default_ pool, you would define no individual <handler> tags in job_conf.xml as described in step 1, then start the handlers with cwd in /srv/galaxy/server with:
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler0 --attach-to-pool=job-handlers
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler1 --attach-to-pool=job-handlers


Mapping Tools

If like me you want to map certain tools to different handlers, don't use my complicated config. I believe you could do all of that with a job_conf.xml like:
<?xml version="1.0"?>
<job_conf>
    <plugins>
        <!-- usual stuff here ... -->
    </plugins>
    <handlers default="handlers" assign_with="db-skip-locked" max_grab="8" />
    <destinations>
        <!-- usual stuff here ... -->
    </destinations>
    <tools>
        <tool id="bwa" handler="mappers"/>
        <tool id="hisat2" handler="mappers"/>
    </tools>
</job_conf>
And then start your handlers (if you want, say, 2 to handle default jobs and 2 to handle mapper jobs) like:
# start 2 handlers grabbing from '_default_' pool
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_default_0 --attach-to-pool=job-handlers
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_default_1 --attach-to-pool=job-handlers

# start 2 handlers grabbing from 'mappers' pool
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_mappers_0 --attach-to-pool=job-handlers.mappers
$ /srv/galaxy/venv/bin/python3 /srv/galaxy/server/scripts/galaxy-main -c /srv/galaxy/config/galaxy.yml --server-name=handler_mappers_1 --attach-to-pool=job-handlers.mappers