sts10/why-whittle.markdown

## why-whittle.markdown

      
    Raw
  

              why-whittle.markdown
            
          
    Whittle vs print_first vs print_rand: An experiment

Context: This is a demonstration of different functionalities of a wordlist manipulation tool called Tidy.
Given this word list, which is sorted by Google Ngram word frequency with the most common word listed first, plus a few prefix words thrown in to make this hypothetical work well:
common
challenged
electrodes
chromium
ascribe
tripped
lamont
apostrophe
impound
ontologies
utopia
idling
utopianism
impounded
stippled
apostrophes
unpolluted
marquand
sitar
filo
dendrimers
pleomorphism
bursary

let's do some tests.
Let's say we want to make a 12-word, prefix-free list from this list. And that we want to prefer commonly used words, as defined in this case by the order in the original list (above).
Using print-rand

Tidy's print-rand option simply makes the list in memory, then randomly selects 12 words to print to the new list.
tidy -O -P --print-rand 12 -o print_rand_list.txt list_a.txt
produces:
dendrimers
apostrophes
impounded
ascribe
sitar
tripped
lamont
common
electrodes
ontologies
stippled
bursary

Annotating these words with their position in the original list as a stand-in "score" for how common the words on the outputted list are, we get:
21 dendrimers
16 apostrophes
14 impounded
5 ascribe
19 sitar
6 tripped
7 lamont
1 common
3 electrodes
10 ontologies
15 stippled
23 bursary

Summing these, we get a total "score" of 140. Reminder that we want a low score, as that means we got more of the commonly used words in the final list.
Using print-first

Tidy's print-first function takes the first N words from the generated list, after it's processed the inputted list. For this reason, as you might expect, it does better in our scoring system.
tidy -O -P --print-first 12 -o print_first_list.txt list_a.txt
we get
common
challenged
electrodes
chromium
ascribe
tripped
lamont
ontologies
idling
utopianism
impounded
stippled

1 common
2 challenged
3 electrodes
4 chromium
5 ascribe
6 tripped
7 lamont
10 ontologies
12 idling
13 utopianism
14 impounded
15 stippled

For a total score of 92.
Using whittle-to

Whittle-to is one of Tidy's newer feature. It repeatedly guesses at how many words to take from the top of the inputted list until the resulting list is exaclty the specified length (12 words, in this case).
tidy -O -P --whittle-to 12 -o whittled_list.txt list_a.txt
common
challenged
electrodes
chromium
ascribe
tripped
lamont
apostrophe
ontologies
idling
utopianism
impounded

With scores:
1 common
2 challenged
3 electrodes
4 chromium
5 ascribe
6 tripped
7 lamont
8 apostrophe
10 ontologies
12 idling
13 utopianism
14 impounded

For a total score of 85.
Thus whittling is the best choice of the three options outlined for preserving desired words when producing a prefix-word-free list.
Epilogue: Suffix Words

I'll just note here that, given the way this demo list happens to have been constructed, if we remove all suffix words instead of all prefix words, as we whittle down to 12 words, we get a "perfect" score of 78. (Removing all suffix words, is I argue, another procedure for making a list uniquely decodable).
tidy -O -S --whittle-to 12 list_a.txt
common
challenged
electrodes
chromium
ascribe
tripped
lamont
apostrophe
impound
ontologies
utopia
idling

With scores:
1 common
2 challenged
3 electrodes
4 chromium
5 ascribe
6 tripped
7 lamont
8 apostrophe
9 impound
10 ontologies
11 utopia
12 idling

78! Perfect!
Note that a third method, one I've named Schlinkert pruning also creates this same 78 list.