natefoo/tutorial.md

## tutorial.md

      
    Raw
  

              tutorial.md
            
          
    Reference Genomes - Exercise

Adapted from Oslo training
Learning Outcomes

By the end of this tutorial, you should:

Have an understanding of the way in which Galaxy stores and uses reference data
Be able to download and use data managers to add a reference genome and its pre-calculated indices into the Galaxy reference data system

Data managers 101

The problem
The Galaxy server administrator needed to know how to update each type of reference data. Know how to run the index builds. Know where to get the data from!
Data managers to the rescue
Data Managers are a special class of Galaxy tool which allows for the download and/or creation of data that is stored within Tool Data Tables and their underlying flat (e.g. .loc) files. These tools handle the creation of indices and the addition of entries/lines to the data table / .loc file via the Galaxy admin interface.
Data Managers can be defined locally or installed through the Tool Shed.
They are a flexible framework for adding reference data to Galaxy (not just genomic data). They are workflow compatible and can run via the Galaxy API.
More detailed background information on data managers can be found at: https://galaxyproject.org/admin/tools/data-managers/ (A summary of which appears below.)
Details on how to define a data manager for a tool can be found here: https://galaxyproject.org/admin/tools/data-managers/how-to/define/
Using Data Managers
Data Managers are composed of two components:

Data Manager configuration (e.g. data_manager_conf.xml)
Data Manager tool

Data Manager Configuration
We need to tell Galaxy where to find the Data Managers and their configuration.
In your group_vars/galaxyservers.yml:


Comment out existing definition of tool_data_table_config_path


Add:
galaxy_config:
  galaxy:
    enable_data_manager_user_view: "True"
    galaxy_data_manager_data_path: "{{ galaxy_mutable_data_dir }}/tool-data"


Where:

enable_data_manager_user_view allows non-admin users to view the available data that has been managed.
galaxy_data_manager_data_path defines the location to use for storing the files created by Data Managers. When not configured it defaults to the value of tool_data_path.

Details on Data Manager Tools and their definition can be found at: https://galaxyproject.org/admin/tools/data-managers/how-to/define/
Exercise: Install a DataManager from the ToolShed

In this exercise we will install a data manager that can fetch the various genome sequences from multiple sources, as well as the bwa index data manager from the Galaxy toolshed.
Part 1: Install a data manager to download reference genome sequences
Make sure you are logged in as an Admin user on your Galaxy server. Then, from the Galaxy Admin page:


Install data_manager_fetch_genome_dbkeys_all_fasta from Galaxy main tool shed

Click Search Tool Shed
Search for "fetch"
Install the data_manager_fetch_genome_dbkeys_all_fasta data manager.


View in the file system where the various elements land. Have a look in the configuration files located in config directory.


config/shed_data_manager_conf.xml
<?xml version="1.0"?>
<data_managers>
    <data_manager guid="toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/fetch_genome_all_fasta_dbkeys/0.0.1" id="fetch_genome_all_fasta_dbkeys" shed_conf_file="./config/shed_tool_conf.xml">
        <tool file="toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/b1bc53e9bbc5/data_manager_fetch_genome_dbkeys_all_fasta/data_manager/data_manager_fetch_genome_all_fasta_dbkeys.xml" guid="toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager_fetch_genome_all_fasta_dbkey/0.0.2"><tool_shed>toolshed.g2.bx.psu.edu</tool_shed><repository_name>data_manager_fetch_genome_dbkeys_all_fasta</repository_name><repository_owner>devteam</repository_owner><installed_changeset_revision>b1bc53e9bbc5</installed_changeset_revision><id>toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager_fetch_genome_all_fasta_dbkey/0.0.2</id><version>0.0.2</version></tool><data_table name="all_fasta">
            <output>
                <column name="value" />
                <column name="dbkey" />
                <column name="name" />
                <column name="path" output_ref="out_file">
                    <move type="file">
                        <source>${path}</source>
                        <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/seq/${path}</target>
                    </move>
                    <value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/seq/${path}</value_translation>
                    <value_translation type="function">abspath</value_translation>
                </column>
            </output>
        </data_table>
        <data_table name="__dbkeys__">
            <output>
                <column name="value" />
                <column name="name" />
                <column name="len_path" output_ref="out_file">
                    <move type="file">
                        <source>${len_path}</source>
                        <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${value}/len/${len_path}</target>
                    </move>
                    <value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${value}/len/${len_path}</value_translation>
                    <value_translation type="function">abspath</value_translation>
                </column>
            </output>
        </data_table>
    </data_manager>


</data_managers>
shed_tool_data_table_conf.xml
<?xml version="1.0"?>
<tables>
<table comment_char="#" name="all_fasta">
        <columns>value, dbkey, name, path</columns>
        <file path="/srv/galaxy/tool-data/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/b1bc53e9bbc5/all_fasta.loc" />
        <tool_shed_repository>
            <tool_shed>toolshed.g2.bx.psu.edu</tool_shed>
            <repository_name>data_manager_fetch_genome_dbkeys_all_fasta</repository_name>
            <repository_owner>devteam</repository_owner>
            <installed_changeset_revision>b1bc53e9bbc5</installed_changeset_revision>
            </tool_shed_repository>
    </table>
<table comment_char="#" name="__dbkeys__">
        <columns>value, name, len_path</columns>
        <file path="/srv/galaxy/tool-data/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/b1bc53e9bbc5/dbkeys.loc" />
        <tool_shed_repository>
            <tool_shed>toolshed.g2.bx.psu.edu</tool_shed>
            <repository_name>data_manager_fetch_genome_dbkeys_all_fasta</repository_name>
            <repository_owner>devteam</repository_owner>
            <installed_changeset_revision>b1bc53e9bbc5</installed_changeset_revision>
            </tool_shed_repository>
    </table>
</tables>
Part 2: Download and install a reference genome sequence
Use the Galaxy Admin page and the data_manager_fetch_genome_all_fasta_dbkey to install some reference data. We will grab sacCer2 (version 2 of the Saccharomyces cerevisiae genome.)
From the Galaxy Admin page:

Click on Local data

You should see something like this:


Click on all_fasta under View Tool Data Table Entries

You should see the current contents of tool-data/all_fasta.loc, which will be empty.

Under Run Data Manager Tools, click Create DBKey and Reference Genome - fetching. The Reference Genome tool form from data_manager_fetch_genome_all_fasta_dbkey is displayed. NOTE: If you receive the error "Uncaught exception in exposed API method:", you will need to restart Galaxy first.

From the DBKEY to assign to data: list choose: sacCer2
Enter S. cerevisiae June 2008 (SGD/sacCer2) (sacCer2) for the Name of sequence field
Leave the ID for sequence field empty
Click Execute


In your history, you will see a new dataset for the data manager run. When the job has finished, go back to the Data Manager view on the Galaxy Admin page. (Click Local Data)

Click on all_fasta under View Tool Data Table Entries

You should see that sacCer2 has been added to all_fasta.

Part 3: Download and install the BWA data manager
In this part we will repeat the process from part 1 except that we will install the bwa data manager this time.

Install the bwa data manager from the toolshed.

From the Admin page, click Search Toolsheds and then search for bwa.
Install the data_manager_bwa_mem_index_builder by the devteam.


Part 4: Build the BWA index for sacCer2
In this part we will actually build the BWA index for sacCer2. It will automatically be added to our list of available reference genomes in the BWA tool.

From the Galaxy Admin page, click Local data
Click on BWA-MEM index - builder under Run Data Manager Tools

Select sacCer2 for Source Fasta Sequence
Put sacCer2 into the other two blank fields.
Click Execute. NOTE: If you receive the error "Parameter all_fasta_source requires a value, but has no legal values defined.", you will need to restart Galaxy first.


The new BWA index for sacCer2 will now be built and the .loc file will be filled in.
To check:

From the Galaxy Admin page -> Local data, click on the bwa mem indexes under View Tool Data Table Entries

S. cerevisiae sacCer2 should now appear in the list!
Part 5: Run BWA with the new reference data!
Now we will run the BWA tool and check to see if the reference data is listed and the tool works with it!

Run the BWA tool on the 2 fast files we loaded earlier, using sacCer2 as the reference.

How cool is that? No editing .loc files, no making sure you've got TABs instead of spaces. Fully auto!
So, what did we learn?

Hopefully, you now understand:

how Galaxy stores and uses its reference data,
how to manually add a reference genome and tool indices if required,
and how to use data managers to make all of this much much easier.

Further reading

If you want to know more about data managers including how to write a data manager tool, details can be found at: https://galaxyproject.org/admin/tools/data-managers/
Suggestions and comments are welcome.
Addendum: Installing and Running a DM with ephemeris

Create a config file for run-data-managers named fetch-sacCer3.yml:
data_managers:
    # Data manager ID
    - id: toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager_fetch_genome_all_fasta_dbkey/0.0.4
      # tool parameters, nested parameters should be specified using a pipe (|)
      params:
        - 'dbkey_source|dbkey': '{{ item }}'
        - 'reference_source|reference_source_selector': 'ucsc'
        - 'reference_source|requested_dbkey': '{{ item }}'
      # Items refere to a list of variables you want to run this data manager. You can use them inside the param field with {{ item }}
      # In case of genome for example you can run this DM with multiple genomes, or you could give multiple URLs.
      items:
        - sacCer3
        #- dm3
        #- mm9
      # Name of the data-tables you want to reload after your DM are finished. This can be important for subsequent data managers
      data_table_reload:
        - all_fasta
        - __dbkeys__
Install the fetch DM:
$ shed-tools install -g https://gat-0.student.galaxy.training --name data_manager_fetch_genome_dbkeys_all_fasta --owner devteam -a abbacadabba
Storing log file in: /tmp/ephemeris_2qpg_hrq
(1/1) repository data_manager_fetch_genome_dbkeys_all_fasta already installed at revision 4d3eff1bc421. Skipping.
Installed repositories (0): []
Skipped repositories (1): [('data_manager_fetch_genome_dbkeys_all_fasta', '4d3eff1bc421')]
Errored repositories (0): []
All repositories have been installed.
Total run time: 0:00:00.770248
Run the fetch DM:
$ run-data-managers --config fetch-sacCer3.yml -g https://gat-0.student.galaxy.training -a abbacadabba
Storing log file in: /tmp/ephemeris_kfsmjk2a
Running data managers that populate the following source data tables: ['all_fasta']
Dispatched job 1. Running DM: "toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager_fetch_genome_all_fasta_dbkey/0.0.4" with parameters: {'dbkey_source|dbkey': 'sacCer3', 'reference_source|reference_source_selector': 'ucsc', 'reference_source|requested_dbkey': 'sacCer3'}
Job 1 finished with state ok.
Running data managers that index sequences.
Finished running data managers. Results:
Successful jobs: 1
Skipped jobs: 0
Failed jobs: 0