Skip to content

Instantly share code, notes, and snippets.

@adityaathalye
Created November 20, 2019 14:26
Embed
What would you like to do?
Dr. StrangePipes or: How I learned to stop worrying && "function" in Shell

Prelude: Why even bother?

Personally…

  • Text is everywhere, better learn to process it
  • The masters already solved it better
  • Stable tools, well-known warts
  • Personal/team productivity, tool-building autonomy
  • To learn about
    • good and successful design
    • bad and successful design
    • computer history
  • For the fun of it

Intro: Some thought experiments

Which ones are functional, which aren’t?

  • Bash
  • Ruby
  • Python
  • Swift
  • Kotlin
  • Clojure
  • Haskell
  • FSharp
  • Java
  • Erlang
  • APL
  • Dart
  • Forth / Factor
  • Hack
  • SmallTalk
  • Prolog
  • C
  • C++

What makes Functional Programming “Functional”?

Is it Mathematical?

  • Functions?
  • Higher-ordered functions?
  • Typed?

Is it “Declarative”?

  • FP is a kind of declarative programming. What you say is what you get.
  • map/filter/reduce? (automatic in Unix-land)
  • recursion?

Is it Data-oriented?

  • Data in data out?
  • “Referentially transparent”?
  • Message passing?

Is it Stateless?

  • No shared state?
  • No mutable state?
  • No side effects?

Is it memory-managed?

  • Garbage collection?
  • Immutability?

Is it a System?

  • A human discipline? (ref. accounting)
  • Automated rules and guidelines?
  • Managed Environment?

What makes Unix “Unix”? (Pike)

Something like the combination of:

  • high-level programming language
  • hierarchical file system
  • uniform, unformatted files (text)
  • separable shell
  • distinct tools
  • pipes
  • regular expressions
  • portability
  • security

Intro: Motivating example

Doug McIllroy’s program, as seen in Classic Shell Scripting and More Shell Less Egg

# Problem statement (word frequency):
# - Read a file of text
# - Determine the n most frequently-used words
# - Print out a sorted list of all the words, along with their frequencies

# Douglas McIllroy's solutin from 1986
# 1. Transliterate complement (-c) of words into newlines, squeezing out (-s) duplicates
tr -cs A-Za-z '\n' |
    # 2. Transliterate uppercase to lowercase
    tr A-Z a-z |
    # 3. Sort to bring identical words together
    sort |
    # 4. Replace each run of duplicate words with a single representative, and include a count
    uniq -c |
    # 5. Sort reverse (-r), numeric (-n)
    sort -rn |
    # 6. Pass through stream editor; quit after printing the given number of lines ${1} (or default to 10 lines)
    sed ${1:-10}q

Intro: Demotivating example

VARIANT 1: with basic grep, cut, tr and

Simple regexps/patterns Assumes regular positional structure of stty output

# 1. Generate all available info about stty
stty -a |
    # 2. Filter the line containing row/column information
    grep -E "rows|columns" |
    # 3. Extract 2nd and 3rd column
    cut -d ';' -f 2,3 |
    # 4. Remove annoying semicolon
    tr -d ';' |
    # 5. Optionally, trim annoying leading space
    sed -E 's/^([[:space:]]+)//g'

VARIANT 1.3333: Refactor variant 1

After light reading of manpages of tools used Using output control of ‘cut’, we can eliminate the ‘tr’ stage

stty -a |
    grep -E "rows|columns" |
    cut -d ';' --output-delimiter='' -f 2,3 |
    # 'tr' eliminated by output control in cut
    # tr -d ';' |
    sed -E 's/^([[:space:]]+)//g'

VARIANT 1.6666: Refactor 1.3333…

After deeper reading of grep’s manpage & about regexes Better use of grep’s flags and of regex allows us to drop ‘cut’

stty -a |
    grep -o -E "(rows|columns)[[:space:]]+[[:digit:]]+" |
    # cut -d ';' --output-delimiter='' -f 2,3 | # eliminated
    # Bring back tr, to collapse newline-separated output into one line
    tr '\n' ' ' |
    # trim annoying trailing spaces
    sed -E 's/([[:space:]]+)$//g'

VARIANT 2: after messing around with sed and

realising it might be enough alone!

stty -a |
      sed -E -n 's/.*(rows[[:space:]]+[[:digit:]]+).*(columns[[:space:]]+[[:digit:]]+).*/\1 \2/p'

VARIANT 3: after discovering awk and

half a day of learning it

stty -a |
      awk 'BEGIN { FS="; " } /rows|columns/ { print $2,$3 }'

Funda: Software Tools Principles

(Classic Shell Scripting)

  • Most importantly, Do one thing well
  • Process Lines of text, not binary
  • Use regular expressions
  • Default to standard I/O
  • Don’t be chatty
  • Generate the same output format accepted as input
  • Let someone else do the hard part
  • Detour to build specialised tools

Funda: Re-motivating perspective

(Unix Koans)

Koan: Master Foo and the Shell Tools

https://catb.org/esr/writings/unix-koans/shell-tools.html

A Unix novice came to Master Foo and said: “I am confused. Is it not the Unix way that every program should concentrate on one thing and do it well?”

Master Foo nodded.

The novice continued: “Isn’t it also the Unix way that the wheel should not be reinvented?”

Master Foo nodded again.

“Why, then, are there several tools with similar capabilities in text processing: sed, awk, and Perl? With which one can I best practice the Unix way?”

Master Foo asked the novice: “If you have a text file, what tool would you use to produce a copy with a few words in it replaced by strings of your own choosing?”

The novice frowned and said: “Perl’s regexps would be excessive for so simple a task. I do not know awk, and I have been writing sed scripts in the last few weeks. As I have some experience with sed, at the moment I would prefer it. But if the job only needed to be done once rather than repeatedly, a text editor would suffice.”

Master Foo nodded and replied: “When you are hungry, eat; when you are thirsty, drink; when you are tired, sleep.”

Upon hearing this, the novice was enlightened.

Demo: Master Foo and the Shell Tools

#
# And thus spake master foo
#

#
# "When you are hungry, eat."
#
function extract_row_column_fields_v1 {
    grep -E "rows|columns" |
        cut -d ';' -f 2,3 |
        tr -d ';' |
        sed -E 's/^([[:space:]]+)//g'
}

#
# "When you are thirsty, drink."
#
function extract_row_column_fields_v2 {
    sed -E -n 's/.*(rows[[:space:]]+[[:digit:]]+).*(columns[[:space:]]+[[:digit:]]+).*/\1 \2/p'
}

#
# "When you are tired, sleep."
#
function extract_row_column_fields_v3 {
    awk 'BEGIN { FS="; " } /rows|columns/ { print $2,$3 }'
}

# stty -a | extract_row_column_fields_v2

# The variants seen so far are referentially transparent
# And we can use whichever approach one we understands _as of today_.

Funda: Motivating example++

Douglas McIllroy’s program scales four orders of magnitude (at least).

So will ours, even though we are not Doug.

I didn’t try 5 orders magnitude…

If 10,000x isn’t satisfactory, then nothing will be. :)

function Doug_McIllroys_program {
    # Problem statement (word frequency):
    # - Read a file of text
    # - Determine the n most frequently-used words
    # - Print out a sorted list of all the words, along with their frequencies

    # Douglas McIllroy's solutin from 1986
    # 1. Transliterate complement (-c) of words into newlines, squeezing out (-s) duplicates
    tr -cs A-Za-z '\n' |
        # 2. Transliterate uppercase to lowercase
        tr A-Z a-z |
        # 3. Sort to bring identical words together
        sort |
        # 4. Replace each run of duplicate words with a single representative, and include a count
        uniq -c |
        # 5. Sort reverse (-r), numeric (-n)
        sort -rn |
        # 6. Pass through stream editor; quit after printing the given number of lines ${1} (or default to 10 lines)
        sed ${1:-10}q
}

function setup_test_data {
    local base_dir=${1:-"/home/adi/src/bitbucket/adityaathalye/drstrangepipes/data"}
    local onex_file="${base_dir}/bashman_1x.txt"
    local oneKx_file="${base_dir}/bashman_1000x.txt"
    local tenKx_file="${base_dir}/bashman_10000x.txt"

    # re-generate 1x bashman
    cat /dev/null > ${onex_file} &&
        man bash >> ${onex_file}

    # re-generate 1,000x bashman
    cat /dev/null > ${oneKx_file} &&
        for _ in $(seq 1000)
        do
            cat ${onex_file} >> ${oneKx_file}
        done

    # re-generate 10,000x bashman
    cat /dev/null > ${tenKx_file} &&
        for _ in $(seq 10)
        do cat ${oneKx_file} >> ${tenKx_file}
        done
}

function run_Dougs_program_with_timing {
    local fname=${1}
    ls -sh ${fname}
    wc -l ${1}
    time Doug_McIllroys_program < ${1} > /dev/null &
}

Funda: Motivating example+++

We can turn Doug’s program into a swiss army toolkit

function flatten_to_word_list {
    tr -cs A-Za-z '\n'
}

function tokenise_lowercase {
    # Transliterate uppercase to lowercase
    tr A-Z a-z
}

function frequencies {
    # Produce frequency distribution of input
    sort | uniq -c | sort -rn
}

function take_n {
    sed ${1:-10}q
}

# Mix-and-match more stuff

function sort_dictionary {
    sort -b -d -k2
}

function sort_rhyme {
    rev | sort -b -d | rev
}

# eliminate stop-words
function drop_stopwords {
    local stopwords=${1:-"the,is,to,a,of,if,and,in,or,be,by,not,with,for,when,it"}
    shift
    local grep_pattern=$(tr , '\|' <<<"${stopwords}")
    grep -v -E ${grep_pattern}
}

# n-grams
function bigram {
    local tempfile="/tmp/bigram$$"
    cat - > ${tempfile}
    paste ${tempfile} <(tail +2 ${tempfile})
    rm ${tempfile}
}

function trigram {
    local tempfile="/tmp/bigram$$"
    cat - > ${tempfile}
    paste ${tempfile} <(tail +2 ${tempfile}) <(tail +3 ${tempfile})
    rm ${tempfile}
}

# git s'more reuse
function git_committers {
    local git_dir=${1:?$(echo "ERROR Please provide path to git-managed directory.")}
    cat <( cd ${git_dir};
           git log master --pretty=format:%an ; ) |
        frequencies
}

Experience: How do I Shell?

Avoid bad situations

  • Fail fast and early
  • Avoid manual state management
  • Avoid portability, unless a design requirement

Avoid adding code

  • Rely on system-provided guarantees & defaults
  • Design / redesign my data better instead
  • Use tool flags & options judiciously

Functional coding discipline

  • Don’t control-flow, Pipeline all the things!!!!
    • Compose functions
    • Compose Unix tools
    • Compose pipelines
    • Compose programs
  • Emit structured, line-oriented data
    • amenable to structured text-processing
  • design fn-as-cmdline-util (standard I/O/E!)
  • emulate HoFs (but carefully - Interpreter rules!)
  • “Functional core, Imperative shell” architecture

Use functions judiciously to

  • Enforce invariants
  • Control order of effects (obviously)
  • Safely set/reset globals
  • Improve maintainability & understandability
  • Improve testability

Namespace by naming convention

  • To avoid clobbering imports/source
  • for readline and grep-friendliness
  • “namespaced” fns: utilname_funcname
  • “private” fns: __utilname_funcname

Don’t use shell if it’s a bad idea!

References, Tips, Tricks, Tutorials

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment