Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Dr. StrangePipes or: How I learned to stop worrying && "function" in Shell

Prelude: Why even bother?

Personally…

  • Text is everywhere, better learn to process it
  • The masters already solved it better
  • Stable tools, well-known warts
  • Personal/team productivity, tool-building autonomy
  • To learn about
    • good and successful design
    • bad and successful design
    • computer history
  • For the fun of it

Intro: Some thought experiments

Which ones are functional, which aren’t?

  • Bash
  • Ruby
  • Python
  • Swift
  • Kotlin
  • Clojure
  • Haskell
  • FSharp
  • Java
  • Erlang
  • APL
  • Dart
  • Forth / Factor
  • Hack
  • SmallTalk
  • Prolog
  • C
  • C++

What makes Functional Programming “Functional”?

Is it Mathematical?

  • Functions?
  • Higher-ordered functions?
  • Typed?

Is it “Declarative”?

  • FP is a kind of declarative programming. What you say is what you get.
  • map/filter/reduce? (automatic in Unix-land)
  • recursion?

Is it Data-oriented?

  • Data in data out?
  • “Referentially transparent”?
  • Message passing?

Is it Stateless?

  • No shared state?
  • No mutable state?
  • No side effects?

Is it memory-managed?

  • Garbage collection?
  • Immutability?

Is it a System?

  • A human discipline? (ref. accounting)
  • Automated rules and guidelines?
  • Managed Environment?

What makes Unix “Unix”? (Pike)

Something like the combination of:

  • high-level programming language
  • hierarchical file system
  • uniform, unformatted files (text)
  • separable shell
  • distinct tools
  • pipes
  • regular expressions
  • portability
  • security

Intro: Motivating example

Doug McIllroy’s program, as seen in Classic Shell Scripting and More Shell Less Egg

# Problem statement (word frequency):
# - Read a file of text
# - Determine the n most frequently-used words
# - Print out a sorted list of all the words, along with their frequencies

# Douglas McIllroy's solutin from 1986
# 1. Transliterate complement (-c) of words into newlines, squeezing out (-s) duplicates
tr -cs A-Za-z '\n' |
    # 2. Transliterate uppercase to lowercase
    tr A-Z a-z |
    # 3. Sort to bring identical words together
    sort |
    # 4. Replace each run of duplicate words with a single representative, and include a count
    uniq -c |
    # 5. Sort reverse (-r), numeric (-n)
    sort -rn |
    # 6. Pass through stream editor; quit after printing the given number of lines ${1} (or default to 10 lines)
    sed ${1:-10}q

Intro: Demotivating example

VARIANT 1: with basic grep, cut, tr and

Simple regexps/patterns Assumes regular positional structure of stty output

# 1. Generate all available info about stty
stty -a |
    # 2. Filter the line containing row/column information
    grep -E "rows|columns" |
    # 3. Extract 2nd and 3rd column
    cut -d ';' -f 2,3 |
    # 4. Remove annoying semicolon
    tr -d ';' |
    # 5. Optionally, trim annoying leading space
    sed -E 's/^([[:space:]]+)//g'

VARIANT 1.3333: Refactor variant 1

After light reading of manpages of tools used Using output control of ‘cut’, we can eliminate the ‘tr’ stage

stty -a |
    grep -E "rows|columns" |
    cut -d ';' --output-delimiter='' -f 2,3 |
    # 'tr' eliminated by output control in cut
    # tr -d ';' |
    sed -E 's/^([[:space:]]+)//g'

VARIANT 1.6666: Refactor 1.3333…

After deeper reading of grep’s manpage & about regexes Better use of grep’s flags and of regex allows us to drop ‘cut’

stty -a |
    grep -o -E "(rows|columns)[[:space:]]+[[:digit:]]+" |
    # cut -d ';' --output-delimiter='' -f 2,3 | # eliminated
    # Bring back tr, to collapse newline-separated output into one line
    tr '\n' ' ' |
    # trim annoying trailing spaces
    sed -E 's/([[:space:]]+)$//g'

VARIANT 2: after messing around with sed and

realising it might be enough alone!

stty -a |
      sed -E -n 's/.*(rows[[:space:]]+[[:digit:]]+).*(columns[[:space:]]+[[:digit:]]+).*/\1 \2/p'

VARIANT 3: after discovering awk and

half a day of learning it

stty -a |
      awk 'BEGIN { FS="; " } /rows|columns/ { print $2,$3 }'

Funda: Software Tools Principles

(Classic Shell Scripting)

  • Most importantly, Do one thing well
  • Process Lines of text, not binary
  • Use regular expressions
  • Default to standard I/O
  • Don’t be chatty
  • Generate the same output format accepted as input
  • Let someone else do the hard part
  • Detour to build specialised tools

Funda: Re-motivating perspective

(Unix Koans)

Koan: Master Foo and the Shell Tools

https://catb.org/esr/writings/unix-koans/shell-tools.html

A Unix novice came to Master Foo and said: “I am confused. Is it not the Unix way that every program should concentrate on one thing and do it well?”

Master Foo nodded.

The novice continued: “Isn’t it also the Unix way that the wheel should not be reinvented?”

Master Foo nodded again.

“Why, then, are there several tools with similar capabilities in text processing: sed, awk, and Perl? With which one can I best practice the Unix way?”

Master Foo asked the novice: “If you have a text file, what tool would you use to produce a copy with a few words in it replaced by strings of your own choosing?”

The novice frowned and said: “Perl’s regexps would be excessive for so simple a task. I do not know awk, and I have been writing sed scripts in the last few weeks. As I have some experience with sed, at the moment I would prefer it. But if the job only needed to be done once rather than repeatedly, a text editor would suffice.”

Master Foo nodded and replied: “When you are hungry, eat; when you are thirsty, drink; when you are tired, sleep.”

Upon hearing this, the novice was enlightened.

Demo: Master Foo and the Shell Tools

#
# And thus spake master foo
#

#
# "When you are hungry, eat."
#
function extract_row_column_fields_v1 {
    grep -E "rows|columns" |
        cut -d ';' -f 2,3 |
        tr -d ';' |
        sed -E 's/^([[:space:]]+)//g'
}

#
# "When you are thirsty, drink."
#
function extract_row_column_fields_v2 {
    sed -E -n 's/.*(rows[[:space:]]+[[:digit:]]+).*(columns[[:space:]]+[[:digit:]]+).*/\1 \2/p'
}

#
# "When you are tired, sleep."
#
function extract_row_column_fields_v3 {
    awk 'BEGIN { FS="; " } /rows|columns/ { print $2,$3 }'
}

# stty -a | extract_row_column_fields_v2

# The variants seen so far are referentially transparent
# And we can use whichever approach one we understands _as of today_.

Funda: Motivating example++

Douglas McIllroy’s program scales four orders of magnitude (at least).

So will ours, even though we are not Doug.

I didn’t try 5 orders magnitude…

If 10,000x isn’t satisfactory, then nothing will be. :)

function Doug_McIllroys_program {
    # Problem statement (word frequency):
    # - Read a file of text
    # - Determine the n most frequently-used words
    # - Print out a sorted list of all the words, along with their frequencies

    # Douglas McIllroy's solutin from 1986
    # 1. Transliterate complement (-c) of words into newlines, squeezing out (-s) duplicates
    tr -cs A-Za-z '\n' |
        # 2. Transliterate uppercase to lowercase
        tr A-Z a-z |
        # 3. Sort to bring identical words together
        sort |
        # 4. Replace each run of duplicate words with a single representative, and include a count
        uniq -c |
        # 5. Sort reverse (-r), numeric (-n)
        sort -rn |
        # 6. Pass through stream editor; quit after printing the given number of lines ${1} (or default to 10 lines)
        sed ${1:-10}q
}

function setup_test_data {
    local base_dir=${1:-"/home/adi/src/bitbucket/adityaathalye/drstrangepipes/data"}
    local onex_file="${base_dir}/bashman_1x.txt"
    local oneKx_file="${base_dir}/bashman_1000x.txt"
    local tenKx_file="${base_dir}/bashman_10000x.txt"

    # re-generate 1x bashman
    cat /dev/null > ${onex_file} &&
        man bash >> ${onex_file}

    # re-generate 1,000x bashman
    cat /dev/null > ${oneKx_file} &&
        for _ in $(seq 1000)
        do
            cat ${onex_file} >> ${oneKx_file}
        done

    # re-generate 10,000x bashman
    cat /dev/null > ${tenKx_file} &&
        for _ in $(seq 10)
        do cat ${oneKx_file} >> ${tenKx_file}
        done
}

function run_Dougs_program_with_timing {
    local fname=${1}
    ls -sh ${fname}
    wc -l ${1}
    time Doug_McIllroys_program < ${1} > /dev/null &
}

Funda: Motivating example+++

We can turn Doug’s program into a swiss army toolkit

function flatten_to_word_list {
    tr -cs A-Za-z '\n'
}

function tokenise_lowercase {
    # Transliterate uppercase to lowercase
    tr A-Z a-z
}

function frequencies {
    # Produce frequency distribution of input
    sort | uniq -c | sort -rn
}

function take_n {
    sed ${1:-10}q
}

# Mix-and-match more stuff

function sort_dictionary {
    sort -b -d -k2
}

function sort_rhyme {
    rev | sort -b -d | rev
}

# eliminate stop-words
function drop_stopwords {
    local stopwords=${1:-"the,is,to,a,of,if,and,in,or,be,by,not,with,for,when,it"}
    shift
    local grep_pattern=$(tr , '\|' <<<"${stopwords}")
    grep -v -E ${grep_pattern}
}

# n-grams
function bigram {
    local tempfile="/tmp/bigram$$"
    cat - > ${tempfile}
    paste ${tempfile} <(tail +2 ${tempfile})
    rm ${tempfile}
}

function trigram {
    local tempfile="/tmp/bigram$$"
    cat - > ${tempfile}
    paste ${tempfile} <(tail +2 ${tempfile}) <(tail +3 ${tempfile})
    rm ${tempfile}
}

# git s'more reuse
function git_committers {
    local git_dir=${1:?$(echo "ERROR Please provide path to git-managed directory.")}
    cat <( cd ${git_dir};
           git log master --pretty=format:%an ; ) |
        frequencies
}

Experience: How do I Shell?

Avoid bad situations

  • Fail fast and early
  • Avoid manual state management
  • Avoid portability, unless a design requirement

Avoid adding code

  • Rely on system-provided guarantees & defaults
  • Design / redesign my data better instead
  • Use tool flags & options judiciously

Functional coding discipline

  • Don’t control-flow, Pipeline all the things!!!!
    • Compose functions
    • Compose Unix tools
    • Compose pipelines
    • Compose programs
  • Emit structured, line-oriented data
    • amenable to structured text-processing
  • design fn-as-cmdline-util (standard I/O/E!)
  • emulate HoFs (but carefully - Interpreter rules!)
  • “Functional core, Imperative shell” architecture

Use functions judiciously to

  • Enforce invariants
  • Control order of effects (obviously)
  • Safely set/reset globals
  • Improve maintainability & understandability
  • Improve testability

Namespace by naming convention

  • To avoid clobbering imports/source
  • for readline and grep-friendliness
  • “namespaced” fns: utilname_funcname
  • “private” fns: __utilname_funcname

Don’t use shell if it’s a bad idea!

References, Tips, Tricks, Tutorials

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.