Skip to content

Instantly share code, notes, and snippets.

@beardymcjohnface
Last active October 16, 2023 01:00
Show Gist options
  • Save beardymcjohnface/bb161ba04ae1042299f48a4849e917c8 to your computer and use it in GitHub Desktop.
Save beardymcjohnface/bb161ba04ae1042299f48a4849e917c8 to your computer and use it in GitHub Desktop.
MetaSnek fastq_finder sample parsing

Parsing sample reads

There are two ways to parse samples with this MetaSnek: you can supply a directory where your reads are, or you can supply a Tab-Separated Values (TSV) file with your sample names and file paths.

Samples from reads directory

If given a directory, reads files will be identified like so: pattern = r".(fasta|fastq|fq)(.gz)?$". i.e. fasta or fastq files, optionally gzipped. Sample names will be derived from the read file name. The sample name consists of everything upto the first match for R1 and R2 tags (and optionally S tags for singleton reads) in the file name (not including the filepath), or everything up to the file extension for single-end reads or orphaned paired reads. Supported R1/2 tags are: _R1_/_R2_/_S_,_R1./_R2./_S., .R1./.R2./.S., .R1_/.R2_/.S_, _1_/_2_/_S_, _1./_2./_S., .1./.2./.S., and .1_/.2_/.S_. The module checks for any orphaned R1 or R2 files and raises a warning if found.

Samples from TSV file

If given a TSV file, the module will expect 2-4 columns, tab-separated:

Column Value
1 Sample name
2 Filepath (reads or R1 reads if paired)
3 Optional: filepath for R2 reads if paired
4 Optional: filepath for singleton reads

Example directory

reads
  ├── sample1_1.fastq.gz
  ├── sample1_2.fastq.gz
  ├── sample2_R1.fastq
  ├── sample2_R2.fastq
  ├── sample2_S.fastq
  └── sample3.fasta.gz

Example TSV

sample1	reads/sample1_1.fastq.gz	reads/sample1_2.fastq.gz
sample2	reads/sample1_R2.fastq	reads/sample2_R2.fastq	reads/sample2_S.fastq
sample3	reads/sample3.fasta.gz

Example result supplied to the pipeline

{
'sample1': 
    {
    'R1': 'reads/sample1_1.fastq.gz', 
    'R2': 'reads/sample1_2.fastq.gz',
    'S': None
    }, 
'sample2': 
    {
    'R1': 'reads/sample2_R1.fastq', 
    'R2': 'reads/sample2_R2.fastq',
    'S': 'reads/sample2_S.fastq',
    }
'sample3': 
    {
    'R1': 'reads/sample3.fasta.gz', 
    'R2': None,
    'S': None
    }
}

To use this module in your own pipelines

from metasnek import fastq_finder

samples = fastq_finder.parse_samples_to_dictionary("reads")
samples = fastq_finder.parse_samples_to_dictionary("samples.tsv")
samples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment