Prelude: Why even bother?
Personally…
- Text is everywhere, better learn to process it
- The masters already solved it better
- Stable tools, well-known warts
- Personal/team productivity, tool-building autonomy
- To learn about
- good and successful design
- bad and successful design
- computer history
- For the fun of it
Intro: Some thought experiments
Which ones are functional, which aren’t?
- Bash
- Ruby
- Python
- Swift
- Kotlin
- Clojure
- Haskell
- FSharp
- Java
- Erlang
- APL
- Dart
- Forth / Factor
- Hack
- SmallTalk
- Prolog
- C
- C++
What makes Functional Programming “Functional”?
Is it Mathematical?
- Functions?
- Higher-ordered functions?
- Typed?
Is it “Declarative”?
- FP is a kind of declarative programming. What you say is what you get.
- map/filter/reduce? (automatic in Unix-land)
- recursion?
Is it Data-oriented?
- Data in data out?
- “Referentially transparent”?
- Message passing?
Is it Stateless?
- No shared state?
- No mutable state?
- No side effects?
Is it memory-managed?
- Garbage collection?
- Immutability?
Is it a System?
- A human discipline? (ref. accounting)
- Automated rules and guidelines?
- Managed Environment?
What makes Unix “Unix”? (Pike)
Something like the combination of:
- high-level programming language
- hierarchical file system
- uniform, unformatted files (text)
- separable shell
- distinct tools
- pipes
- regular expressions
- portability
- security
Intro: Motivating example
Doug McIllroy’s program, as seen in Classic Shell Scripting and More Shell Less Egg
# Problem statement (word frequency):
# - Read a file of text
# - Determine the n most frequently-used words
# - Print out a sorted list of all the words, along with their frequencies
# Douglas McIllroy's solutin from 1986
# 1. Transliterate complement (-c) of words into newlines, squeezing out (-s) duplicates
tr -cs A-Za-z '\n' |
# 2. Transliterate uppercase to lowercase
tr A-Z a-z |
# 3. Sort to bring identical words together
sort |
# 4. Replace each run of duplicate words with a single representative, and include a count
uniq -c |
# 5. Sort reverse (-r), numeric (-n)
sort -rn |
# 6. Pass through stream editor; quit after printing the given number of lines ${1} (or default to 10 lines)
sed ${1:-10}q
Intro: Demotivating example
VARIANT 1: with basic grep, cut, tr and
Simple regexps/patterns Assumes regular positional structure of stty output
# 1. Generate all available info about stty
stty -a |
# 2. Filter the line containing row/column information
grep -E "rows|columns" |
# 3. Extract 2nd and 3rd column
cut -d ';' -f 2,3 |
# 4. Remove annoying semicolon
tr -d ';' |
# 5. Optionally, trim annoying leading space
sed -E 's/^([[:space:]]+)//g'
VARIANT 1.3333: Refactor variant 1
After light reading of manpages of tools used Using output control of ‘cut’, we can eliminate the ‘tr’ stage
stty -a |
grep -E "rows|columns" |
cut -d ';' --output-delimiter='' -f 2,3 |
# 'tr' eliminated by output control in cut
# tr -d ';' |
sed -E 's/^([[:space:]]+)//g'
VARIANT 1.6666: Refactor 1.3333…
After deeper reading of grep’s manpage & about regexes Better use of grep’s flags and of regex allows us to drop ‘cut’
stty -a |
grep -o -E "(rows|columns)[[:space:]]+[[:digit:]]+" |
# cut -d ';' --output-delimiter='' -f 2,3 | # eliminated
# Bring back tr, to collapse newline-separated output into one line
tr '\n' ' ' |
# trim annoying trailing spaces
sed -E 's/([[:space:]]+)$//g'
VARIANT 2: after messing around with sed and
realising it might be enough alone!
stty -a |
sed -E -n 's/.*(rows[[:space:]]+[[:digit:]]+).*(columns[[:space:]]+[[:digit:]]+).*/\1 \2/p'
VARIANT 3: after discovering awk and
half a day of learning it
stty -a |
awk 'BEGIN { FS="; " } /rows|columns/ { print $2,$3 }'
Funda: Software Tools Principles
(Classic Shell Scripting)
- Most importantly, Do one thing well
- Process Lines of text, not binary
- Use regular expressions
- Default to standard I/O
- Don’t be chatty
- Generate the same output format accepted as input
- Let someone else do the hard part
- Detour to build specialised tools
Funda: Re-motivating perspective
(Unix Koans)
Koan: Master Foo and the Shell Tools
https://catb.org/esr/writings/unix-koans/shell-tools.html
A Unix novice came to Master Foo and said: “I am confused. Is it not the Unix way that every program should concentrate on one thing and do it well?”
Master Foo nodded.
The novice continued: “Isn’t it also the Unix way that the wheel should not be reinvented?”
Master Foo nodded again.
“Why, then, are there several tools with similar capabilities in text processing: sed, awk, and Perl? With which one can I best practice the Unix way?”
Master Foo asked the novice: “If you have a text file, what tool would you use to produce a copy with a few words in it replaced by strings of your own choosing?”
The novice frowned and said: “Perl’s regexps would be excessive for so simple a task. I do not know awk, and I have been writing sed scripts in the last few weeks. As I have some experience with sed, at the moment I would prefer it. But if the job only needed to be done once rather than repeatedly, a text editor would suffice.”
Master Foo nodded and replied: “When you are hungry, eat; when you are thirsty, drink; when you are tired, sleep.”
Upon hearing this, the novice was enlightened.
Demo: Master Foo and the Shell Tools
#
# And thus spake master foo
#
#
# "When you are hungry, eat."
#
function extract_row_column_fields_v1 {
grep -E "rows|columns" |
cut -d ';' -f 2,3 |
tr -d ';' |
sed -E 's/^([[:space:]]+)//g'
}
#
# "When you are thirsty, drink."
#
function extract_row_column_fields_v2 {
sed -E -n 's/.*(rows[[:space:]]+[[:digit:]]+).*(columns[[:space:]]+[[:digit:]]+).*/\1 \2/p'
}
#
# "When you are tired, sleep."
#
function extract_row_column_fields_v3 {
awk 'BEGIN { FS="; " } /rows|columns/ { print $2,$3 }'
}
# stty -a | extract_row_column_fields_v2
# The variants seen so far are referentially transparent
# And we can use whichever approach one we understands _as of today_.
Funda: Motivating example++
Douglas McIllroy’s program scales four orders of magnitude (at least).
So will ours, even though we are not Doug.
I didn’t try 5 orders magnitude…
If 10,000x isn’t satisfactory, then nothing will be. :)
function Doug_McIllroys_program {
# Problem statement (word frequency):
# - Read a file of text
# - Determine the n most frequently-used words
# - Print out a sorted list of all the words, along with their frequencies
# Douglas McIllroy's solutin from 1986
# 1. Transliterate complement (-c) of words into newlines, squeezing out (-s) duplicates
tr -cs A-Za-z '\n' |
# 2. Transliterate uppercase to lowercase
tr A-Z a-z |
# 3. Sort to bring identical words together
sort |
# 4. Replace each run of duplicate words with a single representative, and include a count
uniq -c |
# 5. Sort reverse (-r), numeric (-n)
sort -rn |
# 6. Pass through stream editor; quit after printing the given number of lines ${1} (or default to 10 lines)
sed ${1:-10}q
}
function setup_test_data {
local base_dir=${1:-"/home/adi/src/bitbucket/adityaathalye/drstrangepipes/data"}
local onex_file="${base_dir}/bashman_1x.txt"
local oneKx_file="${base_dir}/bashman_1000x.txt"
local tenKx_file="${base_dir}/bashman_10000x.txt"
# re-generate 1x bashman
cat /dev/null > ${onex_file} &&
man bash >> ${onex_file}
# re-generate 1,000x bashman
cat /dev/null > ${oneKx_file} &&
for _ in $(seq 1000)
do
cat ${onex_file} >> ${oneKx_file}
done
# re-generate 10,000x bashman
cat /dev/null > ${tenKx_file} &&
for _ in $(seq 10)
do cat ${oneKx_file} >> ${tenKx_file}
done
}
function run_Dougs_program_with_timing {
local fname=${1}
ls -sh ${fname}
wc -l ${1}
time Doug_McIllroys_program < ${1} > /dev/null &
}
Funda: Motivating example+++
We can turn Doug’s program into a swiss army toolkit
function flatten_to_word_list {
tr -cs A-Za-z '\n'
}
function tokenise_lowercase {
# Transliterate uppercase to lowercase
tr A-Z a-z
}
function frequencies {
# Produce frequency distribution of input
sort | uniq -c | sort -rn
}
function take_n {
sed ${1:-10}q
}
# Mix-and-match more stuff
function sort_dictionary {
sort -b -d -k2
}
function sort_rhyme {
rev | sort -b -d | rev
}
# eliminate stop-words
function drop_stopwords {
local stopwords=${1:-"the,is,to,a,of,if,and,in,or,be,by,not,with,for,when,it"}
shift
local grep_pattern=$(tr , '\|' <<<"${stopwords}")
grep -v -E ${grep_pattern}
}
# n-grams
function bigram {
local tempfile="/tmp/bigram$$"
cat - > ${tempfile}
paste ${tempfile} <(tail +2 ${tempfile})
rm ${tempfile}
}
function trigram {
local tempfile="/tmp/bigram$$"
cat - > ${tempfile}
paste ${tempfile} <(tail +2 ${tempfile}) <(tail +3 ${tempfile})
rm ${tempfile}
}
# git s'more reuse
function git_committers {
local git_dir=${1:?$(echo "ERROR Please provide path to git-managed directory.")}
cat <( cd ${git_dir};
git log master --pretty=format:%an ; ) |
frequencies
}
Experience: How do I Shell?
Avoid bad situations
- Fail fast and early
- Avoid manual state management
- Avoid portability, unless a design requirement
Avoid adding code
- Rely on system-provided guarantees & defaults
- Design / redesign my data better instead
- Use tool flags & options judiciously
discipline
Functional coding- Don’t control-flow, Pipeline all the things!!!!
- Compose functions
- Compose Unix tools
- Compose pipelines
- Compose programs
- Emit structured, line-oriented data
- amenable to structured text-processing
- design fn-as-cmdline-util (standard I/O/E!)
- emulate HoFs (but carefully - Interpreter rules!)
- “Functional core, Imperative shell” architecture
Use functions judiciously to
- Enforce invariants
- Control order of effects (obviously)
- Safely set/reset globals
- Improve maintainability & understandability
- Improve testability
Namespace by naming convention
- To avoid clobbering imports/source
- for readline and grep-friendliness
- “namespaced” fns: utilname_funcname
- “private” fns: __utilname_funcname
Don’t use shell if it’s a bad idea!
References, Tips, Tricks, Tutorials
- Code: My Terminal-based Noughts and Crosses game in bash
- Code: The shell utilities in NASA’s Common Metadata Repository
- Code: Awesome bash: A curated list of delightful Bash scripts and resources
- Code: Script downloads from Mt. Xia website, Dana French
- Code: Shell and Unix/Linux training from Mt. Xia, Dana French
- Books: Classic Shell Scripting, Arnold Robbins, Nelson Beebe, O’Reilly Media
- Books: Data Science at the Command Line, Jeroen Janssens, O’Reilly Media
- Books: Bioinformatics Data Skills, Vince Buffalo, O’Reilly Media
- Books: The Art of Unix Programming, Eric S. Raymond
- Books: The UNIX-HATERS Handbook, MIT
- Tutorial: Advanced tips for the bash command line, Jeff Terrell
- Video: Concise GNU Bash: An Introduction to Advanced Usage, James Panacciulli @ LinuxFest Northwest 2017
- Video: Advanced Bash Scripting, Dana French
- Video: Bash scripting for Bioinformatics, Phil Williams
- Video: Where GREP Came From (Brian Kernighan) - Computerphile
- Video: Unix Pipeline (Brian Kernighan) - Computerphile
- Video: UNIX Special: Profs Kernighan & Brailsford - Computerphile