Skip to content

Instantly share code, notes, and snippets.

@jcausey-astate
Created August 29, 2018 12:39
Show Gist options
  • Save jcausey-astate/4fe1fb2da3da3e50226d1645851742bb to your computer and use it in GitHub Desktop.
Save jcausey-astate/4fe1fb2da3da3e50226d1645851742bb to your computer and use it in GitHub Desktop.

Data Analysis Tools You (may) Already Have

This example will use the text of Grimms' Fairy Tales available from Project Gutenberg.

grep - Search in text!

The grep tool is installed by default on most Linux and UNIX-like systems.

Documentation is available at: https://en.wikibooks.org/wiki/Grep and https://www.gnu.org/software/grep/manual/grep.html

Let's say we want to count the number of times the word death occurs in the document. A simple first command to try might be:

grep "death" Grimms_Fairy_Tales.txt
princess given to him again; and after the king’s death he was heir to
by sheer force, my death would have been certain,--I should have been
death, we should keep together like good companions, and lest a new
dog, so that the wheels crushed him to death. ‘There,’ cried the
her off too easily: she shall die a much more cruel death; I will eat
wife, and should be king after his death; but whoever tried and did not
succeed, after three days and nights, should be put to death.
danger of death. If the tailor conquered and killed these two giants, he
witch was miserably burnt to death.
and that your marriage will soon take place, but it is with death that
condemned to death for their wicked deeds.
death. At last down he went into her stomach. ‘It is rather dark,’ said
was frightened to death, and went to bed and took all the keys with
sentenced to death, and was to be rolled into the water, in a barrel
long deserved death; tonight when she is asleep I will come and cut her
sorrowfully awaiting the break of day, when he should be led to death.
her father’s death; but his two brothers married the other two sisters.
down on her head, and she was crushed to death.
that he could not escape death. He sat for a while very sorrowfully,
condemned him to death. When he was led forth to die, he begged a last
wedding was celebrated, and after the king’s death, Dummling inherited
be put to death. The prince knew nothing of what was going on, till one
lay sick unto death, and desired to see him once again before his end.
dangerously ill, and near his death. He said to him: ‘Dear son, I wished
on pain of death, and the queen herself was to take the key into her
as a savage bear until I was freed by his death. Now he has got his
a century after their deaths. But they were best (and universally) known

wc - Word Count (it also counts lines and characters)

Nice start, but it isn't a count. And it has context. We could count the lines (the wc tool with its -l "lines" option can do this):

grep "death" Grimms_Fairy_Tales.txt | wc -l
      27

Great! We found 27 lines that contained the word "death". But this isn't always perfect. If a line contained the word "death" more than once, our count could be too low. We really need to isolate the matches, without context.

grep has an option -o meaning "only matches" that can do this for us:

Bash Note: It was taken for granted in the previous command that you know how to use the pipe operator (|) in Bash to send the output from one command right into the next. This is a basic building block in the Bash (and Linux/UNIX general) ecosystem. If you aren't familiar with this technique yet, search for and work several examples online and you will get there in no time.

grep -o "death" Grimms_Fairy_Tales.txt
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death
death

So we can combine this with wc -l to get the count:

grep -o "death" Grimms_Fairy_Tales.txt | wc -l
      27

Well, the number is the same. But let's try with something more likely to have collisions. Here's the count of the lines containing the word "the":

grep "the" Grimms_Fairy_Tales.txt | wc -l
    5728

Now here is the count of the matches to the word "the":

grep -o "the" Grimms_Fairy_Tales.txt | wc -l
    9659

We would have missed nearly 4000 instances of "the"!

More Grimm (grep with Regular Expressions)

Ok. Let's add some more words to the count. Let's say we want to count occurrences of any of these 15 "grim" words:

darkness    dying       ruin
demise      fatal       slain
destruction grave       slay
death       kill        slew
dead        murder      tomb    

grep can use regular expressions to match more complicated patterns than just exact words or phrases.

Regular expression syntax can be pretty complicated, so we will stick to a few simple things here:

Symbol Meaning
. Any single character.
* Zero or more of the item to the left.
? Zero or one of the item to the left.
| "or" (used to separate alternative matches)

So, we can use the following pattern with the -E flag in grep:

(darkness|demise|destruction|death|dead|dying|fatal|grave|kill|murder|ruin|slain|slay|slew|tomb)

Note: -E means extended regular expression syntax, and it prevents us from having to use backslashes to escape the | and parentheses.

head - Show first 10 (or N) lines.

First, let's try to get a sense of whether the output "looks" correct. To get the first 10 lines of output from a command that creates a lot of output, use the head command in Bash:

(NOTE: I'm also splitting the long command over multiple lines by using a backslash at the end of each.)

grep -i -E "(darkness|demise|destruction|death|dead|dying|fatal|grave|kill|murder|ruin|slain|slay|slew|tomb)"\
    Grimms_Fairy_Tales.txt\
    | head
came, and said, ‘Pray kill me, and cut off my head and my feet.’ But the
to him, as he got upon the bank, ‘Your brothers have set watch to kill
princess given to him again; and after the king’s death he was heir to
met him, and besought him with tears in his eyes to kill him, and cut
give me only a dry cow! If I kill her, what will she be good for? I hate
countryman began to look grave, and shook his head. ‘Hark ye!’ said he,
and how his master meant to kill him in the morning. ‘Make yourself
by sheer force, my death would have been certain,--I should have been
death, we should keep together like good companions, and lest a new
and nearly dead on the bank. Then the queen took pity on the little

Looks good! Now let's see the count (add the -o, -i (ignore case), and pipe the result into wc):

grep -o -i -E "(darkness|demise|destruction|death|dead|dying|fatal|grave|kill|murder|ruin|slain|slay|slew|tomb)"\
    Grimms_Fairy_Tales.txt\
    | wc -l
     183

So it looks like those "grim" words are mentioned 183 times in the text!

What's the Frequency, Kenneth?

What if we want to know how many times each of those grim words appears? Maybe we can find the most popular grim words with the Grimms?

To start, we need to record the occurences of each word. The grep command above was generating them, but then we passed them to wc and ended up with just an overall count. Instead, let's direct the output into a file that we can examine further. We'll call the file "grim_15_words.txt". In Bash, the > operator redirects output to a file (or other stream):

grep -o -i -E "(darkness|demise|destruction|death|dead|dying|fatal|grave|kill|murder|ruin|slain|slay|slew|tomb)"\
    Grimms_Fairy_Tales.txt\
    > grim_15_words.txt

Now we can take a look at the first 10 lines of that file using head:

head grim_15_words.txt
kill
kill
death
kill
kill
grave
kill
death
death
dead

Sounds like a gothic rock song from the 90's...

sort - Sorts lines in text.

One thing to notice here is that the words are not in any particular order.

If you were asked to count the frequencies of each, you would probably agree that it would be easier to do if they were sorted first. Why not make it easier on the computer as well? The sort command will be happy to sort them; here's the first 10 lines of the file in sorted order:

sort grim_15_words.txt | head
Murder
darkness
darkness
darkness
darkness
dead
dead
dead
dead
dead

uniq - Find unique tokens (and optionally count them).

With the words in sorted order, it just so happens that there is another Bash command that can analyze the frequencies for us. This one is called uniq (short for unique tokens). By default, it would just remove duplicates:

sort grim_15_words.txt | uniq
Murder
darkness
dead
death
dying
grave
kill
murder
ruin
slay
slew

But with the -c (count) option, it also counts the number of occurrances:

sort grim_15_words.txt | uniq -c
   1 Murder
   4 darkness
  59 dead
  27 death
   1 dying
  10 grave
  66 kill
  12 murder
   1 ruin
   1 slay
   1 slew

And of course we can sort that output, and by using the -r option in sort (for reverse), we can see the most popular words first:

sort grim_15_words.txt | uniq -c | sort -r
  66 kill
  59 dead
  27 death
  12 murder
  10 grave
   4 darkness
   1 slew
   1 slay
   1 ruin
   1 dying
   1 Murder

The Grimms really like to kill characters!

tr - Translate (or find-and-replace).

Maybe you don't like the fact that one "Murder" didn't get lumped into the "murder" bucket. ("Murder Bucket" must be the name of that goth rock band...)

Let's re-create the output file so that all the words are lowercase, then re-run the analysis.

There is more than one way to do this, but the tr (translate) command is one obvious choice. The syntax is a bit odd, but a quick Google search will turn up examples showing that you can use the character classes '[:upper:]' and '[:lower:]', so we can translate uppercase characters to lowercase with:

tr '[:upper:]' '[:lower:]'

This time, all the commands are shown in one code block:

# Search for the words using case-insensitive search, then translate any uppercase letters to lowercase:
grep -o -i -E "(darkness|demise|destruction|death|dead|dying|fatal|grave|kill|murder|ruin|slain|slay|slew|tomb)"\
    Grimms_Fairy_Tales.txt\
    | tr '[:upper:]' '[:lower:]'\
    > grim_15_words_lowercase.txt

# Now count the frequencies and display in decreasing order:
sort grim_15_words_lowercase.txt | uniq -c | sort -r
  66 kill
  59 dead
  27 death
  13 murder
  10 grave
   4 darkness
   1 slew
   1 slay
   1 ruin
   1 dying

Much better! Now let's see about some more cheerful words, from the list:

birth       growth     soul
born        heart      spirit
energy      life       vigor
essence     lifeblood  vitality
excitement  love       zest

Here's the search:

# Like before, search for words ignoring case:
grep -o -i -E "(birth|born|energy|essence|excitement|growth|heart|life|lifeblood|love|soul|spirit|vigor|vitality|zest)"\
    Grimms_Fairy_Tales.txt\
    | tr '[:upper:]' '[:lower:]'\
    > cheery_15_words_lowercase.txt

# And count the frequencies, display in decreasing order:
sort cheery_15_words_lowercase.txt | uniq -c | sort -r
 106 heart
  62 life
  60 love
  10 spirit
  10 born
   6 birth
   4 soul

Well, maybe the Grimms aren't so grim after all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment