illy/gist:1670417

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    The Power of Shell

-- Using shell programming for corpus analysis
S. Li
University of Birmingham
January, 2012
1. My story

I still remember the first supervision meeting with my supervisor.
He asked me which programming language I use.
Sorry, but nothing. I answered.
Then he said, do not worry, what you need to know is Shell.
Later, I worried about the coding skill, and asked him if I need to learn any programming skill, and he answered that Shell is very helpful for corpus analysis, if you learn sed or awk, that will be better.
At the beginning of last academic year, we received three lectures about using shell language for corpus analysis from the Research Methods of CL, which covered the most fundamental (but very helpful) part of shell.
I kept doubting it until I began to use shell by myself!
Now, I do not doubt it any more because of the magical power of shell.
2. Why do we use shell?

Shell is a very simple but robust language, and it has many 'dialects', for example, bash (Bourne Shell), zsh (Z Shell), csh (C Shell). The most common one is bash. You can find bash on Mac, or most Linux distributions, also, if you are a Windows user, you can use Crywin or other alternatives to use Shell.
Efficiency

What we need is not only the speed, but also the accuracy.


Speed


It is much faster than any other GUI tool, such as UAM corpus tool or AntConc.

E.g.: comparison with AntConc.


Accuracy


If you use other GUI tools, the accuracy depends on the designer's understanding.

The example of wc command and word count function in other GUI tools


Robustness


Shell can handle various materials, ranging from some individual lines to ten million-word files (maybe even larger data, but I have only tried the ten-million-word data so far).

E.g.: I used shell to deal with my one-million tweet corpus (about forteen million words), and its performance is quite optimistic.


Simple

Usually, the shell programming is only one line or several lines, much simpler than other languages, and much efficient than GUI analysis.


3. Some conventions of Shell##

As a programming language, Shell also has some conventions. If you know them before you dive into the Shell world, they will help you understand codes better.


# This symbol means COMMENT, anything behind this will be ignored by Shell. However, they are very useful to interpret the code, so please make some conmmetns to explain your codes wherever necessary.


$: It expands to a process id of Shell. (See 3.4.2 Special Parameters of Bash Refernce Manual)


|: This is called pipeline. In Shell, one line can only contains one command (It's like that in English, one sentence contains only one main verb); however, you can use pipeline to combine different commands together in one line (pipelins is simoliar to conjunction in the English clause).


4. Converting

Question 4.1: How can we convert a PDF to other readable format?
We often need to deal with PDF files, which is a big problem!
Please download a thesis from [eThsis @ Bham] (http://etheses.bham.ac.uk/464/), and put it on your Desktop directory.
Solution:
1. $ pdftotext FILENAME 

	# The default output is a txt file.

2. $ pdftotext -metahtml FILENAME 

	# This will export a simple html file (without index or png files)

3. $ pdf2html FILENAME 

	# The default output includes an indexed html file, a normal html file and some png files.

Answer 4.1
	$ pdftotext Desktop/Cheung09PhD.pdf** 


pdftotext is a default tool,


However, pdf2html, must be installed otherwise.


On Ubuntu, you can use
	$ sudo apt-get pdf2html

On Mac OS X, you can use MacPorts (It but be installed first, so do the following two approaches.)
	$ port pdf2html

Or, Homebrew:
	$ brew pdf2html

Or, Gentoo Prefix (highly recommended, but not very easy to install):
	$ emerge pdf2html


The principle of these tools is that they convert the file format, instead of doing any OCR recognition.


They are super fast!!!


Question 4.2: How can I change the encoding of a file?
Sometimes, if you input a non-English encoded files, such as a big5 file (a traditional Chinese encoding, Terminal might not recognise it correctly. Hence, you need convert its encoding to a universal encoding, e.g.: UTF-8.
Solution:
iconv -f ENCODING -t ENCODING INPUTFILE

If you would like to know how many encodings iconv can convert, you can use:
iconv -l


From the result, you can see that there are so many encodings on your computer, and Shell can help you convert as many as there.

5. Read a large file

Question 5.1: How can we deal with a huge plain text file or a tabular file?
Often, we have some 10M+ txt files as corpus data, some 10M+ csv/tsv files, or even larger. It will be extremely slow to use a normal text editor (Suppose you are a vim or emacs user, that will be fine) to open or read them  (Forget about MS office, orz!).
Using some simple shell commands will be fairly helpful.
Solutions: By this command you can read from the beginning of a file.
$ head FILENAME

$ head -NUMofLINE FILENAME 

# Use any number to repalce NUMofLINE

Question 5.2: How can we read a large file from the end?
Solutions:
$ tail FILENAME

$ tail -NUMofLINE FILENAME

Question 5.3: How can we read a large file?
Solutions:
$ less FILENAME

$ more FILENAME

6. Word count and other simple statistics

Question 6.1: How can we count the word, line, or character of one large file?
MS Word has a word count function, and so do some other softwares. We can also perform this by a simple command in Shell.
Solution:
$ wc FILENAME

$ wc -option FILENAME

wc just means the Word Count, but it has several options, by default:
1. -c for character

2. -l for line in shell; line means \n or \012

3. -b for byte

Line means a new line (hit a RETURN); in regex, it is \n.
6.2. A nicer solution
In addition, if you are familiar with awk, then using a simple awk script can be much nicer. (See The AWK Programming Language P.14)
$ awk '{ nc = nc + length ($0) + 1 

#number of character = number of character + lentgh of variables
			
	nw = nw + NF 

	#new word = new word + number of field
		
	}

END { print NR, "lines,", nw, "words", nc, "characters" }' FILENAME 

#NR means number of lines by default.

6.3 a more advanced solution
We can also use unigram to calculate the file size. The idea of unigram is to tokenise the original file to a word list: each line contains only one word, and remove all punctuation marks. Then, you can just use wc to calculate the line number, which is equal to the word count of the original file. I will not go to details about this, but just provide the script here.
tr ' ' '\012' | 

#convert all space to NEW LINE

wc -l 

# to count the line number

The complete script is:
tr ' ' '\012' | wc -l	

7. Looking at specific patterns

Question 7.1: If we want to do some manual analysis by looking at some specific patterns, what can we do?
In corpus analysis, manual analysis is a must. Sometimes, it is difficult to do this by a GUI tool. Either the data file is too large to process, or the pattern we want to look at is very flexible.
Solution:
$ grep -OPTION PATTERN INPUTFILE

$ grep -OPTION REGEX INPUTFILE

# You can use regular expression to improve the accuracy and robustness.

Generally, it outputs the whole line containing the pattern you searched; however, the -o option (only matched pattern) only inputs the exact pattern you searched. This would enrich the frequency count function of grep.
Question 7.2 How can we look at a pattern regardless of the case?
Solution:
$ grep -i PATTERN INPUTFILE 

The option -i means ignore the case, so the terminal considers the lowercase and uppercase as same pattern.
Question 6.3 Sometime, the grep is not powerful enough, what can we do?
One example is if we want to look at some word variants, the grep would not help. E.g.: if we want to know "I am", "I'm", "Im" at once, what can we do?
Solution:
$ grep -e PATTERN INPUTFILE

or
$ egrep -OPTION PATTERN INPUTFILE

It is strongly recommended to use the extended regular expression! Personally, I prefer egrep than grep.
$ egrep -i "\bi( am|m|'m)\b" INPUTFILE 

#\b means word boundary. Because the regular expression is very greedy or ambitious, they will match any possible pattern in the file. If the file contains a word like "William" or "Miami", they will also be included in the result. Thus, we must use some solutions to limit the ambitiousness of grep.

Notice: the grep or egrep output is based on the line occurrence. If you simply combine grep or egrep with wc to count the pattern occurrence, that will be problematic if one line contains more than one matched pattern (it only conuts once).
Question 7.4 What can we do to deal with one line containing one more matched pattern?
Solution:
$ grep -o PATTERN FILENAME 

This would only output the matched pattern, in other words, if one line contains one more pattern, it will output all matched patterns. Then, combining the wc, the result should be accurate.
Very important notice: for mac user, the grep version is too old (Version 2.5), so there is a serious conflict between -i and -o option. If you combine them, you will get a wrong result. Please update your grep through the way above immediately (the lastest version is 2.9)!
Question 7.5 Is there a convenient way to deal with counting in grep?
Sometimes, using pipeline is tedious, because you may forget it.
Solution:
$ grep -c PATTERN FILENAME 

With -c option, you can count the output easily (Personally, I still prefer using a pipeline to combint wc, because I often use grep | less to check the pattern, and then use grep | wc to count. Using a pipelie is very easy to change as the command is in the end).
Notice: You can always combine different options together, but make sure they do not conflict. Please read the man page carefully.
E.g.:
$ egrep -ioc "\bi( am|m|'m)\b" INPUTFILE 

#This script will count all matched patterns of "I am", "I'm", "Im" in the file.

$ grep -v PATTERN FILENAME

#Using this to look at the unmatched patterns; -v means reverse.

You can combine the different options above.
Always be aware of the ambitiousness or greed of regular expression!!!
NB: Please update grep to the newest version, it has a serious bug with -I and -o option.
8. Regular expression

This is used for fuzzy matching.
If you are familiar with CQP Syntax or Simple Query Syntax used on BNCweb, they are quite similar to the regular expression.
I will not go in details about this here, because this will take ages to discuss. You may refer to some cheatsheets.
I will introduce this in another document later.
9. Some tricks

There are many useful keyboard shortcuts in Terminal.
tab: find the relevant file or command

ctrl+c: abort a command

ctrl+a: go to the beginning of the current line

ctrl+e: go to the end of the current line

ctrl+u: erase the whole line	

ctrl+l: clean the screen

q: exit the current command

$ man COMMAND to look at the manual.

10. Final points

Any programming language is just like a foreign language (precisely, they are just artificial languages), if you can master any foreign language, then you can master any programming language.
Only you need to do is to keep practising.
Keep it simple, stupid! KISS philosophy
11. Extended reading


Use $ man command to refer the manual in shell.


egrep for linguists by Nikolaj Lindberg, STTS Södermalms talteknologiservice. (Highly recommended!)


grep for linguists by Stuart Robinson


Unix™ for Poets by Kenneth Ward Church, AT&T Bell Laboratories. (The ultimate manual which I am still learning it.)


Ngrams by Kenneth Ward Church, AT&T Bell Laboratories.


The Awk Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. (An old staff, but rather handy and comprehensive!)


Why you should learn just a little Awk - A Tutorial by Example by Greg Grothaus, Google.


Unix Shell Text Processing Tutorial (grep, cat, awk, sort, uniq) by Xah Lee


Sculpting text with regex, grep, sed, awk, emacs and vim by Matt Might, University of Utah


USEFUL ONE-LINE SCRIPTS FOR SED by Eric Pement


Sed - An Introduction and Tutorial  by Bruce Barnett


Awk also by Bruce Barnett


sed . . . the stream editor also by Eric Pement


16.4. Text Processing Commands in Advanced Bash-Scripting Guide by Mendel Cooper


Shell Refernce Manualby GNU (the ultimate manual)