RepeatModeler isn't very well suited for sample sequencing data, taking a long time and creating copious amounts of intermediate data files. It obviously wasn't designed for small fragments and reads, which are what we get with sample sequencing data, and here are the main difficulties.
- The subsampling steps for each round take a long time (hours in later rounds) and are done using a single core, which is wasteful and inefficient. However, the full script depends on this subsetting to run properly, so there isn't really a way around this.
- Parallelization occurs during the RECON analyses of rounds 2 to N, so overall, it makes little sense to parallelize heavily since a major bottleneck is the subsetting step (see #1).
- Huge amounts of intermediate files are produced, which grow rapidly with each round. Most of these are the
batch-*
files that are used for parallelization during the RECON rounds. In later rounds (5+, the output size inflates to over 200GB, m