Skip to content

Instantly share code, notes, and snippets.

@devster31
Last active January 23, 2022 00:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save devster31/6fc790a5f0aceabde33ccecb35b52ee4 to your computer and use it in GitHub Desktop.
Save devster31/6fc790a5f0aceabde33ccecb35b52ee4 to your computer and use it in GitHub Desktop.
[Deduplicate lines with awk] from https://opensource.com/article/19/10/remove-duplicate-lines-files-awk #awk #commands
awk '!visited[$0]++' your_file > deduplicated_file
  • The awk "script" !visited[$0]++ is executed for each line of the input file.
  • visited[] is a variable of type associative array (a.k.a. Map). We don't have to initialize it because awk will do it the first time we access it.
  • The $0 variable holds the contents of the line currently being processed.
  • visited[$0] accesses the value stored in the map with a key equal to $0 (the line being processed), a.k.a. the occurrences (which we set below).
  • The ! negates the occurrences' value: In awk, any nonzero numeric value or any nonempty string value is true. By default, variables are initialized to the empty string, which is zero if converted to a number. That being said:
    • If visited[$0] returns a number greater than zero, this negation is resolved to false.
    • If visited[$0] returns a number equal to zero or an empty string, this negation is resolved to true.
  • The ++ operation increases the variable's value (visited[$0]) by one.
  • If the value is empty, awk converts it to 0 (number) automatically and then it gets increased.
  • Note: The operation is executed after we access the variable's value.

Summing up, the whole expression evaluates to:

  • true if the occurrences are zero/empty string
  • false if the occurrences are greater than zero

awk statements consist of a pattern-expression and an associated action.

<pattern/expression> { <action> }

If the pattern succeeds, then the associated action is executed. If we don't provide an action, awk, by default, prints the input. An omitted action is equivalent to { print $0 }.

Our script consists of one awk statement with an expression, omitting the action. So this:

awk '!visited[$0]++' your_file > deduplicated_file

is equivalent to this:

awk '!visited[$0]++ { print $0 }' your_file > deduplicated_file

For every line of the file, if the expression succeeds, the line is printed to the output. Otherwise, the action is not executed, and nothing is printed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment