There are two ways to parse samples with this MetaSnek: you can supply a directory where your reads are, or you can supply a Tab-Separated Values (TSV) file with your sample names and file paths.
If given a directory, reads files will be identified like so: pattern = r".(fasta|fastq|fq)(.gz)?$"
.
i.e. fasta or fastq files, optionally gzipped.
Sample names will be derived from the read file name.
The sample name consists of everything upto the first match for R1 and R2 tags (and optionally S tags for singleton reads) in the file name (not including the filepath),
or everything up to the file extension for single-end reads or orphaned paired reads.
Supported R1/2 tags are: _R1_/_R2_/_S_
,_R1./_R2./_S.
, .R1./.R2./.S.
, .R1_/.R2_/.S_
, _1_/_2_/_S_
, _1./_2./_S.
, .1./.2./.S.
, and .1_/.2_/.S_
.
The module checks for any orphaned R1 or R2 files and raises a warning if found.
If given a TSV file, the module will expect 2-4 columns, tab-separated:
Column | Value |
---|---|
1 | Sample name |
2 | Filepath (reads or R1 reads if paired) |
3 | Optional: filepath for R2 reads if paired |
4 | Optional: filepath for singleton reads |
reads
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_R1.fastq
├── sample2_R2.fastq
├── sample2_S.fastq
└── sample3.fasta.gz
sample1 reads/sample1_1.fastq.gz reads/sample1_2.fastq.gz
sample2 reads/sample1_R2.fastq reads/sample2_R2.fastq reads/sample2_S.fastq
sample3 reads/sample3.fasta.gz
{
'sample1':
{
'R1': 'reads/sample1_1.fastq.gz',
'R2': 'reads/sample1_2.fastq.gz',
'S': None
},
'sample2':
{
'R1': 'reads/sample2_R1.fastq',
'R2': 'reads/sample2_R2.fastq',
'S': 'reads/sample2_S.fastq',
}
'sample3':
{
'R1': 'reads/sample3.fasta.gz',
'R2': None,
'S': None
}
}
from metasnek import fastq_finder
samples = fastq_finder.parse_samples_to_dictionary("reads")
samples = fastq_finder.parse_samples_to_dictionary("samples.tsv")
samples