Last active
July 26, 2017 02:53
-
-
Save ChrisPenner/eb6a4efa0d57f39dc5f3c7bb2d31c2d7 to your computer and use it in GitHub Desktop.
Radix Sort and Trie Trees with Representable Functors
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Looking at my recent posts it's clear I've been on a bit of a Representable | |
kick lately; turns out there's a lot of cool things you can do with it! We'll | |
be adding 'sorting' to that list of things today. Representable Functors bring | |
with them an intrinsic notion of sorting; not in the traditional 'ordered' | |
sense, but rather a sense of 'structural' sorting. Since every 'slot' in a | |
Representable Functor `r` can be uniquely identified by some `Rep r` we can | |
talk about sorting items into some named slot in `r`. If we like we can also | |
define `Ord (Rep r)` to get a total ordering over the slots, but it's not | |
required. | |
I'll preface this post by saying I'm more interested in exploring the | |
structural 'form' of representable sorting than the performance of the | |
functions we'll define, in fact the performance of some of them as they're | |
written here is going to be quite poor as I'm sacrificing speed to gain | |
simplicity for the sake of pedagogy. The intent is to observe the system from a | |
high-level to see some interesting new patterns and shapes we gain from using | |
Representable to do our sorting. Most of the structures we build could quite | |
easily be optimized to perform reasonably if one was so inclined. | |
Building up Sorting over Representable | |
I'll step through my thought process on this one: | |
We've got a Representable Functor `r`; If we have a `Rep r` for some `a` we | |
know which slot to put it into in an `r a`. We can get a `Rep r` for every `a` | |
by using a function `a -> Rep r`. Now we want to embed the `a` into an `r a` | |
using the `Rep r`, the tool we have for this is `tabulate`, in order to know | |
which index is which and put it into the right slot we'll need to require | |
`Eq (Rep r)`. We know which slot our one element goes to, but we need something | |
to put into all the other slots. If `a` were a Monoid we could use `mempty` for | |
the other slots; then if we map that function over every element in an input | |
list we could build something like this: | |
```haskell | |
(Representable r, Monoid a, Eq (Rep r)) => (a -> Rep r) -> [a] -> [r a] | |
``` | |
We want a single `r a` as a result rather than a list, so we need to collapse | |
`[r a]`. We could use `mconcat` if `r a` was a Monoid! We can actually write a | |
monoid instance for any representable if the inner `a` type also has a Monoid | |
instance, later we'll define a custom newtype wrapper with that instance! This | |
gives: | |
```haskell | |
(Representable r, Monoid a, Eq (Rep r)) => (a -> Rep r) -> [a] -> r a | |
``` | |
We can generalize the list to any foldable by just calling | |
`Data.Foldable.toList` on it, and we get: | |
```haskell | |
(Representable r, Monoid a, Foldable f, Eq (Rep r)) => (a -> Rep r) -> f a -> r a | |
``` | |
Nifty! But this requires that every `a` type we want to work is also a Monoid, | |
that's going to seriously limit the usecases for this. We can increase the | |
utility by allowing the caller to specifying a way to build a Monoid from an | |
`a`: | |
```haskell | |
(Representable r, Monoid m, Foldable f, Eq (Rep r)) => (a -> Rep r) -> (a -> m) -> f a -> r m | |
``` | |
And that's our final fully generalized type signature! | |
We're going to need a bunch of imports for this, prepare yourself: | |
> {-# language DeriveFunctor #-} | |
> {-# language TypeFamilies #-} | |
> {-# language MultiParamTypeClasses #-} | |
> {-# language FlexibleInstances #-} | |
> {-# language ScopedTypeVariables #-} | |
> {-# language FlexibleContexts #-} | |
> | |
> {-# OPTIONS_GHC -fno-warn-orphans #-} | |
> | |
> module RepSort where | |
> | |
> import Data.Distributive (Distributive(..)) | |
> import Data.Functor.Rep (Representable(..), Co(..), distributeRep) | |
> import Data.Monoid (Sum(..)) | |
> import qualified Data.Stream.Infinite as S (Stream, iterate) | |
> import Control.Comonad.Cofree (Cofree) | |
> import qualified Data.Sequence as Seq (Seq, fromList) | |
So here's my implementation for `repSort`: | |
> -- Firstly, the signature we came up with: | |
> repSort :: (Representable r, Monoid m, Foldable f, Eq (Rep r)) => (a -> Rep r) -> (a -> m) -> f a -> r m | |
> repSort indOf toM = unMRep . foldMap (MRep . tabulate . desc) | |
> where | |
> -- desc takes an 'a' from the foldable and returns a descriptor function which can be passed to 'tabulate', | |
> -- The descriptor just returns mempty unless we're on the slot where the 'a's result is supposed to end up. | |
> desc a i | |
> | i == indOf a = toM a | |
> | otherwise = mempty | |
> | |
> | |
> -- Here's our newtype with a Monoid over Representables | |
> newtype MRep r a = MRep {unMRep ::r a} | |
> deriving (Show, Eq, Functor) | |
> | |
> instance (Monoid a, Representable r) => Monoid (MRep r a) where | |
> -- The empty Representable is filled with mempty | |
> mempty = MRep $ tabulate (const mempty) | |
> -- We can just tabulate a new representable where the value is the `mappend` of | |
> -- the other two representables. BTW (index a `mappend` index b) depends on | |
> -- the monoid instance for functions, so go check that out if you haven't seen it! | |
> (MRep a) `mappend` (MRep b) = MRep . tabulate $ index a `mappend` index b | |
Using `repSort` | |
Great! Let's see some examples so we can get a handle on what this does! First I'll | |
set up a super simple but useful Representable for Pair: | |
> data Pair a = Pair a a | |
> deriving (Show, Eq, Functor) | |
> | |
> -- This instance is required, but we can just lean on our Representable instance | |
> -- `Data.Functor.Rep` provides all sorts of these helpers. | |
> instance Distributive Pair where | |
> distribute = distributeRep | |
> | |
> instance Representable Pair where | |
> -- Bool is a great index for this! | |
> type Rep Pair = Bool | |
> index (Pair a _) True = a | |
> index (Pair _ b) False = b | |
> | |
> tabulate desc = Pair (desc True) (desc False) | |
So since Pair is indexed by a Bool the `a -> Rep Pair` is actually just a | |
predicate `a -> Bool`! Let's try sorting out some odd and even integers! | |
Remember that `repSort` needs a function from `a -> Rep r`, in this case | |
`Rep r ~ Bool` (`~` means 'is equal to' when we're talking about types), so we | |
can use `odd` to split the odd and even integers up! Next it needs a function | |
which transforms an `a` into a monoid! The simplest one of these is `(:[])` | |
which just puts the element into a list! Let's see what we get! | |
> sortedInts :: Pair [Int] | |
> sortedInts = repSort odd (:[]) [1..10] | |
```haskell | |
λ> sortedInts | |
Pair [1,3,5,7,9] [2,4,6,8,10] | |
``` | |
We used lists in the last example, but remember that the function is | |
generalized over that parameter so we can acutally choose any Monoid we like! | |
Let's say we wanted the sums of all odd and even ints respectively between 1 | |
and 10: | |
> oddEvenSums :: Pair (Sum Int) | |
> oddEvenSums = repSort odd Sum [1..10] | |
```haskell | |
λ> oddEvenSums | |
Pair (Sum {getSum = 25}) (Sum {getSum = 30}) | |
``` | |
Choosing our own monoid and index function gives us a lot of flexibility and power! | |
This pattern generalizes to any Representable you can think of, and most | |
Representables which have an interesting `Rep r` will have some cool features! | |
Indexing by Integers using Stream | |
Let's try another Functor and see what happens, here we'll go with an infinite | |
`Stream` from | |
[Data.Stream.Infinite](http://hackage.haskell.org/package/streams-3.3/docs/Data-Stream-Infinite.html) | |
in the [streams package](http://hackage.haskell.org/package/streams), whose | |
representation is `Int`. | |
With a representation type of `Int` we could do all sorts of things! Note here | |
how the Functor (`Stream`) is infinite, but the representable is actually | |
bounded by the size of Int. This is fine as long as we don't try fold the | |
result or get every value out of it in some way, we'll primarily be using the | |
`index` function from `Representable` so it should work out okay. | |
Though the streams are infinite Haskell's inherent laziness helps us out, not only | |
can we represent infinite things without a problem, Haskell | |
won't actually calculate the values stored in any slots where we don't look, | |
and since the whole thing is a data structure any computations that do occur | |
are automatically memoized! This also means that you don't pay the cost for any | |
value transformation or monoidal append unless you actually look inside the bucket. | |
Only the initial `a -> Rep r` must be computed for each element. | |
Let's sort some stuff! See if you can figure this one out: | |
> byLength :: S.Stream [String] | |
> byLength = repSort length (:[]) ["javascript", "purescript", "haskell", "python"] | |
```haskell | |
λ> index byLength 10 | |
["javascript","purescript"] | |
λ> index byLength 7 | |
["haskell"] | |
λ> index byLength 3 | |
[] | |
``` | |
We didn't have to change our implementation of repSort at all! `index` knows how | |
to find values in a `Stream` from an `Int` and all of that complexity is taken care | |
of for us in the instance of `Representable`. | |
The fact that `Int` is the `Rep` for Stream provides us with a few quick wins, | |
any Enumerable type can be injected into `Int` via `fromEnum` from the Prelude: | |
`fromEnum: (Enum e) => e -> Int`. This means we can turn any Enumerable type | |
into an index into `Stream` without much trouble and we gain a whole new set of | |
possibilities: | |
> byFirstChar :: S.Stream [String] | |
> -- We get the Int value from the first char of a string and use that as the index! | |
> byFirstChar = repSort (fromEnum . head) (:[]) ["cats", "antelope", "crabs", "aardvarks"] | |
```haskell | |
λ> index byFirstChar . fromEnum $ 'c' | |
["cats","crabs"] | |
λ> index byFirstChar . fromEnum $ 'a' | |
["antelope","aardvarks"] | |
λ> index byFirstChar . fromEnum $ 'z' | |
[] | |
``` | |
So that's all pretty cool, but working with a single infinite stream gets | |
unwieldy quickly when we start dealing with indexes in the thousands! `Stream` | |
is effectively a linked list, so we need to traverse the whole thing every | |
time! Straying a bit from sorting into the idea of data storage let's say we | |
wanted to store values in a structure where they're keyed by a `String`. The | |
first step would be to find a Representable where the index could be a `String`. | |
Hrmm, our `Stream` representation can index by `Char`, which is close; what if | |
we nested further representables and used a 'path' to the value as the index? | |
Something like `Stream (Stream (Stream ...))`, We have an infinite | |
tree of trees at this point, but we'll never reach the end to get a value! | |
Whenever I think of tagging a tree-like structure with values I go straight to | |
Cofree! | |
Diving into Tries using Cofree | |
One way you can think of Cofree is as a Tree where every branch and node has a | |
value `a` and the branching structure is determined by some Functor! For | |
example, a simple Rose tree is isomorphic to `Cofree [] a`, a binary tree is | |
isomorphic to `Cofree Pair a`, etc. | |
We want a tree where the branching structure is indexed by a `String`, so let's | |
give `Cofree Stream a` a try! Effectively this creates an infinite number of | |
branches at every level of the tree, but in practice we'll only actually index | |
paths which are represented by some string, the rest of the structure will | |
be filled with boring old `mempty`s. | |
As it turns out, we made a good choice! `Cofree r a` is Representable whenever | |
`r` is Representable! But what's the type which indexes into it? It relies on | |
the underlying Representable Index, but since the tree could have multiple | |
layers it needs a sequence of those indexes! That's why the index type for | |
`Cofree` of a Representable is `Seq (Rep r)`. Under the hood the `Rep` for our | |
structure is going to be specialized to `Seq Int`, but we can easily write | |
`mkInd :: String -> Seq Int` to index by Strings! | |
> mkInd :: String -> Seq.Seq Int | |
> mkInd = Seq.fromList . fmap fromEnum | |
Great! Now we can 'sort' values by strings and index into a tree structure | |
performantly! If this whole sort of structure is looking a bit familiar you've | |
probably seen it by the name of a ["Trie | |
Tree"](https://en.wikipedia.org/wiki/Trie), a datastructure often used in | |
[Radix sorts](https://en.wikipedia.org/wiki/Radix_sort) and search problems. | |
Its advantage is that it gives `O(m)` lookups where `m` is the length of the | |
key string, in our case it will be slightly slower than the traditional method | |
since we don't have `O(1)` random memory access, and have to move through each | |
child tree to the appropriate place before diving deeper, but you could fix | |
that pretty easily by using a proper `Vector` as the underlying representation | |
rather than `Stream`. I'll leave that one as an exercise for the reader. | |
Building Maps from Tries | |
That's a whole lot of explaining without any practical examples, so I bet you're | |
itching to try out our new Trie-based Sorter/Map! With a few helpers we can build | |
something quickly! | |
> -- I'm going to specialize the signature to `Cofree Stream` to make it a bit more | |
> -- readable, but you could re-generalize this idea if you wanted to. | |
> trieSort :: (Monoid m, Foldable f) => (a -> String) -> (a -> m) -> f a -> Cofree S.Stream m | |
> trieSort getInd = repSort (mkInd . getInd) | |
> | |
> -- Build a map out of some key and a Monoidal value! | |
> trieMap :: Monoid m => [(String, m)] -> Cofree S.Stream m | |
> trieMap = trieSort fst snd | |
> | |
> get :: Cofree S.Stream a -> String -> a | |
> get r ind = index r (mkInd ind) | |
> | |
> bankAccounts :: Cofree S.Stream (Sum Int) | |
> bankAccounts = trieMap [("Bob", Sum 37), ("Sally", 5)] | |
```haskell | |
λ> get bankAccounts "Bob" | |
Sum {getSum = 37} | |
λ> get bankAccounts "Sally" | |
Sum {getSum = 5} | |
-- Empty keys are mempty | |
λ> get bankAccounts "Edward" | |
Sum {getSum = 0} | |
``` | |
> withdrawals :: Cofree S.Stream (Sum Int) | |
> withdrawals = trieMap [("Bob", Sum (-10))] | |
```haskell | |
-- And of course we can still make use of our MRep helper to combine maps! | |
λ> get (unMRep $ MRep bankAccounts `mappend` MRep withdrawals ) "Bob" | |
Sum {getSum = 27} | |
``` | |
There are TONS of other possibilities here, we can swap out the underlying | |
Representable to get new behaviour and performance, we can use | |
`Data.Functor.Compose` to nest multiple `Representable`s to build new and | |
interesting structures! We can sort, store, lookup and search using `repSort`! | |
Let me know which interesting ideas you come up with! Either leave a comment | |
here or find me on Twitter [\@chrislpenner](https://twitter.com/chrislpenner). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment