netsensei/columsinperl.md

## columsinperl.md

      
    Raw
  

              columsinperl.md
            
          
    Comparing two files each containing a single column dataset with Perl

Warning! One liners such as these are basically hacks. Please look into the comm program which is part of GNU Coreutils. It basically does all of this without any of the complexity below. See: https://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html
You have 2 text files, each containing rows of data each having 1 column (e.g. e-mail address, uuids, names, md5 hashes,...).
You want to very quickly compare those, spending the least amount of time and energy, looking for ...

... rows both files have in common (intersection)
... rows which are in file A but not in file B.
... rows which are in file B but not in file A.

Here are several one-liners to accomplish just that with Perl.
Given two datasets ...


file A.txt
file B.txt


apple
pear


orange
orange


lemon
lemon


ananas


... rows both files have in common (intersection)
perl -e 'open(I1,@ARGV[0]);open(I2,@ARGV[1]);while(<I1>){chomp $_;$a{$_}=1};while(<I2>){chomp $_;print $_,"\n" if exists $a{$_}}' A.txt B.txt
Yields:


result


orange


lemon


... rows which are in A but not in B (relative complement)
perl -e 'open(I1,@ARGV[0]);open(I2,@ARGV[1]);while(<I1>){chomp $_;$a{$_}=1};while(<I2>){chomp $_;print $_,"\n" if not exists $a{$_}}' B.txt A.txt
Yields:


result


apple


ananas


... rows which are in B but not in A (relative complement)
perl -e 'open(I1,@ARGV[0]);open(I2,@ARGV[1]);while(<I1>){chomp $_;$a{$_}=1};while(<I2>){chomp $_;print $_,"\n" if not exists $a{$_}}' A.txt B.txt
Yields:


result


pear


Why Perl?


Next to Python, Perl is almost by default available as a command line interpreter on most *nix systems.
Regardless of language preferences, Perl is well suited for munging textual data on the command line.
And it's slightly more sane then using AWK, or piping Bash commands.
Comparing lists is a rote exercise. Perl is a good enough tool to do just that in the vast majority of use cases.
Scaling issues when confronted with vast datasets are a specific problem that require a specific approach.

When is this useful?


You are in an environment with limited permissions on what you can and can't install (e.g. no Python Pandas)
It's a very quick one off ad hoc task. You don't even want to bother yourself with pulling data into a spreadsheet program.

When not?


Any use case that involves 2 or more datasets having multiple columns. It's possible to write a Perl one liner to solve more complex classes of use cases, but beware of inadvertently inventing yet another crude database management system. It's far less painful to use existing lightweight tools such as SQlite.
Any architecture that needs to be robust & maintainable. One liners such as these are basically hacks.