[Deduplicate lines with awk] from #awk #commands
awk '!visited[$0]++' your_file > deduplicated_file
  • The awk "script" !visited[$0]++ is executed for each line of the input file.
  • visited[] is a variable of type associative array (a.k.a. Map). We don't have to initialize it because awk will do it the first time we access it.
  • The $0 variable holds the contents of the line currently being processed.
  • visited[$0] accesses the value stored in the map with a key equal to $0 (the line being processed), a.k.a. the occurrences (which we set below).
  • The ! negates the occurrences' value: In awk, any nonzero numeric value or any nonempty string value is true. By default, variables are initialized to the empty string, which is zero if converted to a number. That being said:
    • If visited[$0] returns a number greater than zero, this negation is resolved to false.
    • If visited[$0] returns a number equal to zero or an empty string, this negation is resolved to true.
  • The ++ operation increases the variable's value (visited[$0]) by one.
  • If the value is empty, awk converts it to 0 (number) automatically and then it gets increased.
  • Note: The operation is executed after we access the variable's value.

Summing up, the whole expression evaluates to:

  • true if the occurrences are zero/empty string
  • false if the occurrences are greater than zero

awk statements consist of a pattern-expression and an associated action.

<pattern/expression> { <action> }

If the pattern succeeds, then the associated action is executed. If we don't provide an action, awk, by default, prints the input. An omitted action is equivalent to { print $0 }.

Our script consists of one awk statement with an expression, omitting the action. So this:

awk '!visited[$0]++' your_file > deduplicated_file

is equivalent to this:

awk '!visited[$0]++ { print $0 }' your_file > deduplicated_file

For every line of the file, if the expression succeeds, the line is printed to the output. Otherwise, the action is not executed, and nothing is printed.

