delameter/linux-text-processing.rst

## linux-text-processing.rst

      
    Raw
  

              linux-text-processing.rst
            
          
    Linux Text Processing

Cheatsheet for sed, grep and other text utils typical usecases.


1.   General
1.1.   Viewers/editors options
1.2.   Delimiters
1.3.   Units
1.4.   Minimize buffering
1.5.   Color control
1.6.   less runtime
1.7.   Extended regexp


2.   Removing
2.1.   Remove line(s) starting with
2.2.   Remove first N line(s)
2.3.   Remove last N line(s)


3.   Advanced addressing
3.1.   Substitute all matches
3.2.   Substitute N-th line only
3.3.   Substitute M-th match separately for each line
3.4.   Substitute lines from N to N2
3.5.   Conditional substitution


4.   Selective Output
4.1.   Print affected lines, hide unchanged
4.2.   Additionally print affected lines into a file
4.3.   Print line number before each line


5.   Flow control
5.1.   Jump to L if the last  substitution resulted in replace
5.2.   Make sed exit with code E


6.   Misc
6.1.   Remove line separators \n
6.2.   Replace line separators \n with NUL-bytes
6.3.   Remove SGRs _{(ANSI escape sequences for text formatting)}
6.4.   Fast search of text files / filter binary files
6.5.   [ -n ... ] and [ -z ... ] aspects
6.6.   Exit-code-dependant conditions


1.   General


1.1.   Viewers/editors options


less
nano
other


Show whitespaces
 
Alt+P
cat -A

Eval ANSI control seqs
less -R
 
 
Soft-wrap lines
less -S
Alt+S
 

Line numbers
less -N
Alt+N
cat -n grep -n sed =

String search
/
Ctrl+W
 

String replace
 
Ctrl+\
"${var//1/2}" sed s


1.2.   Delimiters


Delimiter
⁽ⁱⁿ⁾


Default delimiter
^{as a regexp}


Escaped character
setup [1]


Example


cut
-d
\t
$'\n'
cut -d$'\n'


sort
-t
\S\s [2]
$'\n'
sort -t$'\t'


column
-s
\s+
$'\n'
column -t -s$'\t'


xargs
-d
\s+
'\n'
'\x0a'
xargs -d'\n'


grep
n/a
$'\t'
grep -Ee $'\t'


tr
n/a

'\n' '\t',
but $'\x1b'


tr '\t' ' '


[1] results may vary depending on an implementation, e.g. '\x1b' works for sed, but not for grep; whereas '\e' is recognized by grep and is ignored by sed. $'\x1b' or $'\e' will work anywhere because they are handled by shell (considering bash is in use).


[2] uniq default delimiter is same as for sort except it cannot be changed (by an option, that is).


1.3.   Units


bytes
chars
lines
words (fields)


cut
-b
-c

-f

wc
-c
-m
-n
-w

head tail
-c

-n


uniq

-s -w

-f


1.4.   Minimize buffering

sed --unbuffered
grep --line-buffered

1.5.   Color control

ls --color[=always|never|auto]
grep --color[=always|never|auto]
git diff --color[=always|never|auto]

1.6.   less runtime

less options can be used as arguments as well as literally be typed into active program window (e.g. - R Enter).

1.7.   Extended regexp

Both grep and sed support extended regexp which can be enabled with -E option.

2.   Removing


2.1.   Remove line(s) starting with

grep -v ^NANI
sed /^NANI/d

2.1.1.   Keep line(s) starting with

grep ^NANI

sed '/^NANI/!d'

Single quotes prevent expansion of !

2.2.   Remove first N line(s)

sed [-e] 1,<N>d
# or
tail -n +<N-1>

Examples

Remove first line from the top:
sed 1d

tail -n +2

Remove first 3 lines:
sed 1,3d

tail -n +4


2.3.   Remove last N line(s)

head -n -<N>

Examples

Remove last line:
head -n -1
# or
sed \$d

$ in address means last line.


Remove last 5 lines:
head -n -5

Remove lines from 5 to the last one:
sed 5,\$d


3.   Advanced addressing


3.1.   Substitute all matches

sed s/./_/g

g means "global"

3.2.   Substitute N-th line only

sed [-e] '<N> s/$/upd'

3.2.1.   Substitute every K-th line, starting from N (including N-th)

sed [-e] '<N>~<K> s/$/upd/'

It's a GNU extension; i.e., it will not work on macOS: [➚]

Examples

Append "upd" to first line only:
sed "1 s/$/upd/"

Append "upd" to 5th line only:
sed "5 s/$/upd/"

Append "upd" to last line only:
sed "$ s/$/upd/"

Append "upd" to every 2nd line from 5th, i.e. 5, 7, 9..:
sed '5~2 s/$/upd/'


3.3.   Substitute M-th match separately for each line

sed s/./A/<M>

The POSIX standard does not specify what should happen when you mix the g and number modifiers [➚1].

Examples

Replace every 3rd character of each line with "A":
sed s/./A/3

Can be combined with line selector, e.g. the next command will replace every 2nd character 'r' of every 4th line, starting from the beginning, to 'A':
sed 1~4s/r/A/2

Replace every last match of each line (dirty hack, educational use only!):
rev | sed 's/./A/' | rev

I'm pretty sure there is a way to do it more delicately, but at the moment don't know how exactly.


3.4.   Substitute lines from N to N2

sed [-e] '<N>,<N2> s/./A/'

3.4.1.   Substitute lines from N to (N+K)

sed [-e] '<N>,+<K> s/./A/'

Examples

Prepend lines 5, 6 and 7 with "---":
sed 5,7s/^/---

Prepend lines 5-12 with "---":
sed 5,+7s/^/---/


3.5.   Conditional substitution

Apply expr2 to lines that match expr1:
sed "<expr1> <expr2>"

Example

Replace all occurences of "black" with "white" if line starts with "color":
sed "/^color/ s/black/white/g"


4.   Selective Output


4.1.   Print affected lines, hide unchanged

sed --quiet s/A/B/p

4.2.   Additionally print affected lines into a file

sed "s/A/B/w /dev/stdout"
# or
sed "s/A/B/w /tmp/temp"

4.3.   Print line number before each line

sed '=; s/./A/'
# or
sed -e = -e s/./A/

5.   Flow control


5.1.   Jump to L if the last  substitution resulted in replace

sed ':<L> <command1>; t<L>;'

5.1.1.   Jump to L if the last substitution DID NOT result in replace

sed ':<L> <command1>; T<L>;'

5.2.   Make sed exit with code E

sed q<E>

Example

Search for 'b' character in every 5th line, starting from 5; continue until found or EOF encountered; after first match replace it with 'w' and immediately stop the processor; exit code will be 3:
sed '5~5s/b/w/; tQ; T; :Q q3;'


6.   Misc


6.1.   Remove line separators \n

tr -d '\n'
# or
sed -z 's/\n//g'

6.2.   Replace line separators \n with NUL-bytes


tr '\n' '\0'

sed -z 's/\n/\x00/g'


For some mysterious reason s/\n/\0/g does not work, as well as \e (should use \x1b instead).

6.3.   Remove SGRs _{(ANSI escape sequences for text formatting)}

sed -Ee 's/\x1b\[[0-9:;]*m//g'

Example

Remove SGRs as in previous code block, but also print out lines with control chars stripped, i.e., with visible ANSI sequence internals:
sed -nEe '=;p;s/\x1b(\[[0-9;]*)m/\1]/g' -e '=;p;s/\[[0-9;]*\]//g' -e '=;p'


6.4.   Fast search of text files / filter binary files

find . -type f -exec grep -Iq . {} \; -print

The -I option to grep tells him to ignore binary files, and the "." option along with -q will make him match text files [...] [➚2].

6.5.   [ -n ... ] and [ -z ... ] aspects

Sometimes it's necessary to use [ ... ] form of condition check command, e.g. when bash is unavailable, but there is pure sh (often encountered situation when you work with Docker).
There is (at least) one subtle aspect regarding -n and -z modes:
$ VAR=
$ if [ -n $VAR ] ; then echo true/$?; else echo false/$? ; fi
true/0
$ if [ -z $VAR ] ; then echo true/$?; else echo false/$? ; fi
true/0


↓ cmd
result/exit code ↘
-n $VAR
-n "$VAR"
-z $VAR
-z "$VAR"


sh -c '[ ... ]'
true/0
false/1
true/0
true/0

bash -c '[[ ... ]]'
false/1
false/1
true/0
true/0


The result for unquoted -n $VAR can be explained as follows:

[ $@ ] form is an equivalent of test $@ (roughly speaking), and test's behaviour depends on argument number. In this particular case VAR is defined, but empty, so that unquoted form is substituted into nothing; while quoted form becomes "", and shell treats it as extra argument, as it should.

6.6.   Exit-code-dependant conditions


This happens because the negation ! operator is actually a command and the value of $? is getting overwritten after ! call with its own exit code, which is 0. However, there is one unclear aspect (where the hell is it stored in between the ! invocations?):
fn() { return 3; }
! fn ; echo $?
0
! ! fn ; echo $?
3
UPD. it is stored in $PIPESTATUS. Apparently ! invocations are made using the same piping mechanisms.


Author: Alexandr Shavykin

Contact: 0.delameter@gmail.com

Date: 24-Jul-24 08:45:11 PDT
	`less`	`nano`	other
Show whitespaces		`Alt+P`	`cat -A`
Eval ANSI control seqs	`less -R`
Soft-wrap lines	`less -S`	`Alt+S`
Line numbers	`less -N`	`Alt+N`	`cat -n` `grep -n` `sed =`
String search	`/`	`Ctrl+W`
String replace		`Ctrl+\`	`"${var//1/2}"` `sed s`
	Delimiter ⁽ⁱⁿ⁾	Default delimiter ^{as a regexp}	Escaped character setup [1]	Example
`cut`	`-d`	`\t`	`$'\n'`	cut -d$'\n'
`sort`	`-t`	`\S\s` [2]	`$'\n'`	sort -t$'\t'
`column`	`-s`	`\s+`	`$'\n'`	column -t -s$'\t'
`xargs`	`-d`	`\s+`	`'\n'` `'\x0a'`	xargs -d'\n'
`grep`	n/a		`$'\t'`	grep -Ee $'\t'
`tr`	n/a		`'\n'` `'\t'`, but `$'\x1b'`	tr '\t' ' '
	bytes	chars	lines	words (fields)
`cut`	`-b`	`-c`		`-f`
`wc`	`-c`	`-m`	`-n`	`-w`
`head` `tail`	`-c`		`-n`
`uniq`		`-s` `-w`		`-f`
↓ cmd	result/exit code ↘	-n $VAR	-n "$VAR"	-z $VAR	-z "$VAR"
`sh -c '[ ... ]'`		true/0	false/1	true/0	true/0
`bash -c '[[ ... ]]'`		false/1	false/1	true/0	true/0
Author:	Alexandr Shavykin
Contact:	0.delameter@gmail.com
Date:	24-Jul-24 08:45:11 PDT