data
1 hour
What we said people would come away being able to do:
- extract text from files that match patterns
- find and replace text using patterns
- rearrange columns in files
brief recap of regular expressions
wildcards
- . (zero or more)
- ? (single )
extending matches
- * zero or more
- + one or more
sometimes these need to be escaped with a \ for them to work - depends on your environment
difference between ' and "
- ' is a literal quote, in bash everything is passed as is
- " bash will substitute inside these
this makes a difference if you want to use the contents of a bash variable as a pattern
basics of grep
grep 'pattern' file
make the pattern case insensitive
grep -i 'pattern' file
invert the search
grep -v 'pattern' file
count how many lines match
grep -c 'pattern' file
view context of results
# show the line number of the results
grep -n 'pattern' file
# show one extra line AFTER results (-a)
grep -a1 'pattern' file
# show one extra line BEFORE results (-b)
grep -b1 'pattern' file
find matches that form part of or entire words only
grep -w 'pattern' file
find patterns that are stored in a file
grep -f pattern_file file
Challenges
Pride and Predjudice:
- P and P: find number of lines that mention ???
Names:
- find the results for 1999
- find all entries for your favourite name
- find all entries for Calvin or Kelvin from the 1980s
sed (stream editor)
- read line
- execute command
- display result of line
view specific lines in a file
sed -n '5,10p' file
delete a specific line, eg the 10th line
sed -e '10d' file
or delete a range of lines
sed -e '5,10d' file
find and replace basic syntax
sed -e 's/find_pattern/replacement/g' file
using back references
- groups are started with ( and ended with )
- enables you to reference the bits that match each pattern and substitute them back in as part of the replacement
sed -e 's/\(group1\)/\1/g' file
convert from upper case to lower case (gnu sed)
sed -e 's/\(.*\)/\L\1/' file >
from lower to upper
sed -e 's/\(.*\)/\U\1/' input.txt > output.txt
Challenges
Names
- change file separator from tab to comma
- remove the first line
- change all dates from 1960 - 1969 be '1960s'
basic syntax for awk
awk '{print}' < file
can refer to specific columns using $
eg $1 for the first column, $2 for second etc
$0 refers to the original line
example to print first 2 columns
awk '{print $1, $2}' < file
we can also use conditionals:
example to print an entry from the first column if it is above 10
awk '{if($1 > 10){print $1}}' < file
other automatic variables that awk uses include:
NR: the row number
NF: the number of fields on line
example of how to print out a specific line
awk '{if(NR == 3){print}}' < file
or we can find out how many fields we have per line:
awk '{print NF}' < file
Challenges:
Names:
- Make the column first
- Print all the names that occurred more than 100 times in a year