Skip to content

Instantly share code, notes, and snippets.

@netsensei
Last active January 1, 2023 09:25
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save netsensei/12d09c9476dca79b2c3092ae404e1107 to your computer and use it in GitHub Desktop.
Save netsensei/12d09c9476dca79b2c3092ae404e1107 to your computer and use it in GitHub Desktop.
Compare columns in 2 files with Perl

Comparing two files each containing a single column dataset with Perl

Warning! One liners such as these are basically hacks. Please look into the comm program which is part of GNU Coreutils. It basically does all of this without any of the complexity below. See: https://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html

You have 2 text files, each containing rows of data each having 1 column (e.g. e-mail address, uuids, names, md5 hashes,...). You want to very quickly compare those, spending the least amount of time and energy, looking for ...

  • ... rows both files have in common (intersection)
  • ... rows which are in file A but not in file B.
  • ... rows which are in file B but not in file A.

Here are several one-liners to accomplish just that with Perl.

Given two datasets ...

file A.txt file B.txt
apple pear
orange orange
lemon lemon
ananas

... rows both files have in common (intersection)

perl -e 'open(I1,@ARGV[0]);open(I2,@ARGV[1]);while(<I1>){chomp $_;$a{$_}=1};while(<I2>){chomp $_;print $_,"\n" if exists $a{$_}}' A.txt B.txt

Yields:

result
orange
lemon

... rows which are in A but not in B (relative complement)

perl -e 'open(I1,@ARGV[0]);open(I2,@ARGV[1]);while(<I1>){chomp $_;$a{$_}=1};while(<I2>){chomp $_;print $_,"\n" if not exists $a{$_}}' B.txt A.txt

Yields:

result
apple
ananas

... rows which are in B but not in A (relative complement)

perl -e 'open(I1,@ARGV[0]);open(I2,@ARGV[1]);while(<I1>){chomp $_;$a{$_}=1};while(<I2>){chomp $_;print $_,"\n" if not exists $a{$_}}' A.txt B.txt

Yields:

result
pear

Why Perl?

  • Next to Python, Perl is almost by default available as a command line interpreter on most *nix systems.
  • Regardless of language preferences, Perl is well suited for munging textual data on the command line. And it's slightly more sane then using AWK, or piping Bash commands.
  • Comparing lists is a rote exercise. Perl is a good enough tool to do just that in the vast majority of use cases. Scaling issues when confronted with vast datasets are a specific problem that require a specific approach.

When is this useful?

  • You are in an environment with limited permissions on what you can and can't install (e.g. no Python Pandas)
  • It's a very quick one off ad hoc task. You don't even want to bother yourself with pulling data into a spreadsheet program.

When not?

  • Any use case that involves 2 or more datasets having multiple columns. It's possible to write a Perl one liner to solve more complex classes of use cases, but beware of inadvertently inventing yet another crude database management system. It's far less painful to use existing lightweight tools such as SQlite.
  • Any architecture that needs to be robust & maintainable. One liners such as these are basically hacks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment