beardymcjohnface/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Parsing sample reads

There are two ways to parse samples with this MetaSnek: you can supply a directory where your reads are,
or you can supply a Tab-Separated Values (TSV) file with your sample names and file paths.
Samples from reads directory

If given a directory, reads files will be identified like so: pattern = r".(fasta|fastq|fq)(.gz)?$".
i.e. fasta or fastq files, optionally gzipped.
Sample names will be derived from the read file name.
The sample name consists of everything upto the first match for R1 and R2 tags (and optionally S tags for singleton reads) in the file name (not including the filepath),
or everything up to the file extension for single-end reads or orphaned paired reads.
Supported R1/2 tags are: _R1_/_R2_/_S_,_R1./_R2./_S., .R1./.R2./.S., .R1_/.R2_/.S_, _1_/_2_/_S_, _1./_2./_S., .1./.2./.S., and .1_/.2_/.S_.
The module checks for any orphaned R1 or R2 files and raises a warning if found.
Samples from TSV file

If given a TSV file, the module will expect 2-4 columns, tab-separated:


Column
Value


1
Sample name


2
Filepath (reads or R1 reads if paired)


3
Optional: filepath for R2 reads if paired


4
Optional: filepath for singleton reads


Example directory

reads
  ├── sample1_1.fastq.gz
  ├── sample1_2.fastq.gz
  ├── sample2_R1.fastq
  ├── sample2_R2.fastq
  ├── sample2_S.fastq
  └── sample3.fasta.gz

Example TSV

sample1	reads/sample1_1.fastq.gz	reads/sample1_2.fastq.gz
sample2	reads/sample1_R2.fastq	reads/sample2_R2.fastq	reads/sample2_S.fastq
sample3	reads/sample3.fasta.gz

Example result supplied to the pipeline

{
'sample1': 
    {
    'R1': 'reads/sample1_1.fastq.gz', 
    'R2': 'reads/sample1_2.fastq.gz',
    'S': None
    }, 
'sample2': 
    {
    'R1': 'reads/sample2_R1.fastq', 
    'R2': 'reads/sample2_R2.fastq',
    'S': 'reads/sample2_S.fastq',
    }
'sample3': 
    {
    'R1': 'reads/sample3.fasta.gz', 
    'R2': None,
    'S': None
    }
}

To use this module in your own pipelines

from metasnek import fastq_finder

samples = fastq_finder.parse_samples_to_dictionary("reads")
samples = fastq_finder.parse_samples_to_dictionary("samples.tsv")
samples
Column	Value
1	Sample name
2	Filepath (reads or R1 reads if paired)
3	Optional: filepath for R2 reads if paired
4	Optional: filepath for singleton reads