awk '!visited[$0]++' your_file > deduplicated_file
- The awk "script"
!visited[$0]++
is executed for each line of the input file. visited[]
is a variable of type associative array (a.k.a. Map). We don't have to initialize it because awk will do it the first time we access it.- The
$0
variable holds the contents of the line currently being processed. visited[$0]
accesses the value stored in the map with a key equal to$0
(the line being processed), a.k.a. the occurrences (which we set below).- The ! negates the occurrences' value:
In awk, any nonzero numeric value or any nonempty string value is true.
By default, variables are initialized to the empty string, which is zero if
converted to a number. That being said:
- If
visited[$0]
returns a number greater than zero, this negation is resolved to false. - If
visited[$0]
returns a number equal to zero or an empty string, this negation is resolved to true.
- If
- The
++
operation increases the variable's value (visited[$0]
) by one. - If the value is empty, awk converts it to 0 (number) automatically and then it gets increased.
- Note: The operation is executed after we access the variable's value.
Summing up, the whole expression evaluates to:
true
if the occurrences are zero/empty stringfalse
if the occurrences are greater than zero
awk statements consist of a pattern-expression and an associated action.
<pattern/expression> { <action> }
If the pattern succeeds, then the associated action is executed.
If we don't provide an action, awk, by default, prints the input.
An omitted action is equivalent to { print $0 }
.
Our script consists of one awk statement with an expression, omitting the action. So this:
awk '!visited[$0]++' your_file > deduplicated_file
is equivalent to this:
awk '!visited[$0]++ { print $0 }' your_file > deduplicated_file
For every line of the file, if the expression succeeds, the line is printed to the output. Otherwise, the action is not executed, and nothing is printed.