Skip to content

Instantly share code, notes, and snippets.

@ewels
Created August 7, 2013 16:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ewels/6175673 to your computer and use it in GitHub Desktop.
Save ewels/6175673 to your computer and use it in GitHub Desktop.
In bioinformatics, raw ASCII text file can get massive. This script sniffs out large uncompressed text files and sends their paths to STDOUT for piping to a file or zipping command.
#/usr/bin/perl
use warnings;
use strict;
use Cwd;
use File::Find;
####
# FIND UNCOMPRESSED FILES
# Prints the full path of any files larger than 50mb which are uncompressed
# One file path per newline, so output can be piped to other tools, eg:
# perl find_uncompressed_files.pl | xargs gzip
# perl find_uncompressed_files.pl | grep sra
####
my $dir = $ARGV[0];
unless (defined $dir) {
$dir = getcwd();
}
find(\&print_large_uncompressed, $dir);
sub print_large_uncompressed {
if( -s > 52428800){ # file larger than 50mb
if ( index ( `file $_`, "ASCII text" ) != -1 ) {
print $File::Find::name . "\n";
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment