kozo2/basics.md

## basics.md

      
    Raw
  

              basics.md
            
          
    この記事はWorkflow Advent Calendar 2020の4日目の記事です。
4日に分けてSnakemake Tutorialを和訳します。
これはその4日のうちの第2日目です。
Basics: An example workflow

Please make sure that you have activated the environment we created before, and that you have an open terminal in the working directory you have created.
以前に作成した環境をアクティブ化したこと、加えて作成した作業ディレクトリでターミナルを開いていることを確認してください。
A Snakemake workflow is defined by specifying rules in a Snakefile.
Rules decompose the workflow into small steps (e.g., the application of a single tool) by specifying how to create sets of output files from sets of input files.
Snakemake automatically determines the dependencies between the rules by matching file names.
Snakemake のワークフローは、Snakefile 中でルールを指定することによって定義されます。
ルールは、入力ファイルのセットから出力ファイルのセットを作成する方法を指定することにより、ワークフローを小さなステップ (例えば、一つのツールだけから成るアプリケーション) に分解します。
Snakemakeは、ファイル名を照合することにより、ルール間の依存関係を自動的に判別します。
The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls.
All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines.
The resulting syntax resembles that of original Python constructs.
Snakemake 言語は Python 言語を拡張し、ルール定義と追加のコントロールの構文構造を追加します。
追加されたすべての構文構造は、キーワードで始まり、同じ行にあるか、インデントされて複数の行で構成されるコードブロックが続きます。
結果の構文は、元の Python 構造の構文に似ています。
In the following, we will introduce the Snakemake syntax by creating an example workflow.
The workflow comes from the domain of genome analysis.
It maps sequencing reads to a reference genome and call variants on the mapped reads.
The tutorial does not require you to know what this is about.
Nevertheless, we provide some background in the following paragraph.
以下では、例のワークフローを作成することで Snakemake の構文を紹介します。
ワークフローは、ゲノム分析分野由来のものです。
シーケンシングリードをリファレンスゲノムにマップし、マップされたリードの変異を検出します。
チュートリアルでは、これが何であるかを知っている必要はありません。
とは言っても、次の段落でいくつかの背景を説明します。
Background

The genome of a living organism encodes its hereditary information.
It serves as a blueprint for proteins, which form living cells, carry information and drive chemical reactions.
Differences between populations, species, cancer cells and healthy tissue, as well as syndromes or diseases can be reflected and sometimes caused by changes in the genome.
This makes the genome a major target of biological and medical research.
Today, it is often analyzed with DNA sequencing, producing gigabytes of data from a single biological sample (e.g. a biopsy of some tissue).
For technical reasons, DNA sequencing cuts the DNA of a sample into millions of small pieces, called reads.
In order to recover the genome of the sample, one has to map these reads against a known reference genome (e.g., the human one obtained during the famous human genome project).
This task is called read mapping.
Often, it is of interest where an individual genome is different from the species-wide consensus represented with the reference genome.
Such differences are called variants.
They are responsible for harmless individual differences (like eye color), but can also cause diseases like cancer.
By investigating the differences between the all mapped reads and the reference sequence at one position, variants can be detected.
This is a statistical challenge, because they have to be distinguished from artifacts generated by the sequencing process.
生物のゲノムは、その遺伝情報をコード化しています。
これは、生体細胞を形成し、情報を伝達し、化学反応を促進するタンパク質の青写真として機能します。
集団、種、癌細胞、健康な組織の違い、さらには症候群や病気が、ゲノムに反映され、またその変化によって引き起こされることもあります。
これにより、ゲノムは生物学的および医学的研究の主要なターゲットになります。
現代では、それはしばしばDNAシーケンシングで分析され、単一の生物学的サンプル(例えば、いくつかの組織の生検)からギガバイトのデータを生成します。
技術的な理由から、DNAシーケンシングはサンプルのDNAをリードと呼ばれる数百万の小さな断片に切断します。
サンプルのゲノムを回復するには、これらの読み取りを既知のリファレンスゲノム（たとえば、有名なヒトゲノムプロジェクト中に取得されたヒトゲノム）に対してマッピングする必要があります。
Step 1: Mapping reads

Our first Snakemake rule maps reads of a given sample to a given reference genome (see Background).
For this, we will use the tool bwa, specifically the subcommand bwa mem.
In the working directory, create a new file called Snakefile with an editor of your choice.
We propose to use the Atom editor, since it provides out-of-the-box syntax highlighting for Snakemake.
In the Snakefile, define the following rule:
最初のSnakemakeルールは、与えられたサンプルの読み取りを与えられたリファレンスゲノムにマッピングします (背景を参照)。
このために、ツール bwa、特にサブコマンド bwa memを使用します。
作業ディレクトリに、好きなエディタを使用してSnakefileという新しいファイルを作成します。
Atomエディターを使うことを提案します。AtomはSnakemakeのためのすぐに使えるシンタックスハイライトを提供しているためです。
Snakefileで、次のルールを定義しましょう:
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

A Snakemake rule has a name (here bwa_map) and a number of directives, here input, output and shell.
The input and output directives are followed by lists of files that are expected to be used or created by the rule.
In the simplest case, these are just explicit Python strings.
The shell directive is followed by a Python string containing the shell command to execute.
In the shell command string, we can refer to elements of the rule via braces notation (similar to the Python format function).
Here, we refer to the output file by specifying {output} and to the input files by specifying {input}.
Since the rule has multiple input files, Snakemake will concatenate them separated by a whitespace.
In other words, Snakemake will replace {input} with data/genome.fa data/samples/A.fastq before executing the command.
The shell command invokes bwa mem with reference genome and reads, and pipes the output into samtools which creates a compressed BAM file containing the alignments.
The output of samtools is piped into the output file defined by the rule.
Snakemakeルールには、名前 (ここでは bwa_map) といくつかのディレクティブ、ここではinput、output、shell）があります。
When a workflow is executed, Snakemake tries to generate given target files.
Target files can be specified via the command line. By executing
$ snakemake -np mapped_reads/A.bam

in the working directory containing the Snakefile, we tell Snakemake to generate the target file mapped_reads/A.bam.
Since we used the -n (or --dry-run) flag, Snakemake will only show the execution plan instead of actually perform the steps.
The -p flag instructs Snakemake to also print the resulting shell command for illustration.
To generate the target files, Snakemake applies the rules given in the Snakefile in a top-down way.
The application of a rule to generate a set of output files is called job.
For each input file of a job, Snakemake again (i.e. recursively) determines rules that can be applied to generate it.
This yields a directed acyclic graph (DAG) of jobs where the edges represent dependencies.
So far, we only have a single rule, and the DAG of jobs consists of a single node.
Nevertheless, we can execute our workflow with
$ snakemake --cores 1 mapped_reads/A.bam

Whenever executing a workflow, you need to specify the number of cores to use.
For this tutorial, we will use a single core for now.
Later you will see how parallelization works.
Note that, after completion of above command, Snakemake will not try to create mapped_reads/A.bam again, because it is already present in the file system.
Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.
Note

A common error is to forget the comma between the input or output items.
Since Python concatenates subsequent strings, this can lead to unexpected behavior.
一般的なエラーは、入力項目または出力項目の間のコンマを忘れることです。
Pythonは後続の文字列を連結するため、これにより予期しない動作が発生する可能性があります。
Step 2: Generalizing the read mapping rule

Obviously, the rule will only work for a single sample with reads in the file data/samples/A.fastq.
However, Snakemake allows to generalize rules by using named wildcards.
Simply replace the A in the second input file and in the output file with the wildcard {sample}, leading to
明らかに、ルールはファイル data/samples/A.fastq 内の読み取りがある単一のサンプルに対してのみ機能します。
ただし、Snakemakeでは、名前付きワイルドカードを使用してルールを一般化できます。
2番目の入力ファイルと出力ファイルのAをワイルドカード{sample}に置き換えるだけで、次のようになります。
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

Note

Note that if a rule has multiple output files, Snakemake requires them to all have exactly the same wildcards. Otherwise, it could happen that two jobs from the same rule want to write the same file.