nikomatsakis/hashtable-prob.md

## hashtable-prob.md

      
    Raw
  

              hashtable-prob.md
            
          
    The "Hashtable Problem"

There is a danger with typeclasses only on functions because they
permit a lack of coherence.  This danger is particularly acute with
collections, where the definition of a particular interface---such as
hashable or ord---may be used in structuring the data structure. This
renders the data structure corrupt when used with a different
implementation.
To see the problem, consider a hashtable defined something like this:
mod ht {
    iface hash {
        fn hash() -> uint;
        fn eq(other: self) -> bool;
    }
    
    type t<K,V> = { buckets: [bucket<K,V>] }
    
    fn create<K:hash,V>() -> @t<K,V> { 
        ...
    }
    
    fn put<K:hash,V>(ht: @t<K,V>, k: K, v: V) { 
        ...
    }
    
    fn get<K:hash,V>(ht: @t<K,V>, k: K) -> V {
        ... // fails if not present
    }
}

This looks reasonable.  Now we create two modules, Module1 and
Module2, which both share a reference to the same hashtable, but
which have different implementations of the hash interface in scope:
mod Module1 {
    impl hash for uint {
        fn hash() -> uint { ret self; }
        fn eq(other: int) -> bool { ret self == other; }
    }
    ...
    fn foo() -> str {
        let h = ht::create();
        ht::put(ht, 3u, "hi"); // 3u.hash() == 3u here
        ret Module2::bar(h);
    }
}
 
mod Module2 {
    impl hash for uint {
        fn hash() -> uint { ret self / 2; }
        fn eq(other: int) -> bool { ret self == other; }
    }
    ...
    fn bar(h: @ht::t<uint, str>) -> str {
        ret ht_get(ht, 3u); // fails because 3u.hash() == 1u here
    }
}

Here, Module2::bar() will fail to find any value for 3u because it
is using an incompatible hash implementation.  The annoying problem
here is that both modules are correct in and of themselves, they are
just using mutually inconsistent definitions of hash for uint.
A solution of sorts

We cannot completely rule out the possibility of writing code like
this.  I believe it is inherent in any solution that involves both
scoped implementations (i.e., not global, as in Haskell) and open
types (i.e., methods can be declared even after the type has been
defined, unlike Java or C#).
However, we can make it possible to write code that detects such
mismatches.  The idea has two parts.  First, we allow type variables
in types to have interfaces associated with them:
type HT<K : hash,V> = { ... }

Next, we say that the value for a type variable is not just a type,
like uint, but rather a tuple type : impl* of a type and zero or
more implementations, one for each interface bound.  Therefore, the
full type of the hashtable h in our example would not be
HT<uint,str> but rather HT<uint : Module1::hash, str>.  However,
the user does not (usually) have to write the implementations by hand;
they are supplied by the compiler based on what is in scope for the
given type.
Now, armed with this new idea, the fully specified form of the
function bar() from Module2 would be:
fn bar(h: @ht::t<uint: Module2::hash, str>) { ... }

As a result, the compiler will specify a type error at the call
Module2::bar(h) in Module1::foo(), because the type of the
parameter h in bar() does not match the type of the value being
provided (the implementation modules do not match).
Default implementations

When the value of a type variable is written explicitly, it can be
written either ty:impl1,impl2 (explicit implementations for each
interface) or just ty.  If only a type is specified, the compiler
will look for imported implementations and insert the default value.
Note that when the value for a type argument is a type variable, as
with the variable K here:
fn put<K:hash,V>(ht: @t<K,V>, k: K, v: V);

Defaults do not come into play here.  Remember that the "value" of a
type variable is now the tuple ty:impl*.  So the type @t<K,V> is
actually fully specified. (One caveat: we allow an implicit upcast
from ty:impl1, impl2 to ty:impl1 or ty:impl2 or just ty. In
other words, you are allowed to use K in a location where fewer
interfaces, or just a plain type, is required).
What if I don't care which hash function is being used?

Now, you might argue that it is still weird that the imports in
Module2::bar() are affecting what hash function is used for a
hashtable that is being passed in as a parameter, for which the hash
function has already (presumably) been specified when the hashtable
was created.  There are pluses and minuses to this.
The plus is that this full specification would allow for a very high
degree of optimization if we did monomorphizing: there would be no
dynamic calls at all involved in using the hashtable.  Also, so long
as you use a consistent set of implementations across most of your
methods, there is no annotation overhead since the detail
implementations that the compiler supplies will all match.
However, it would still be nice to be able to write a function that
operates over any hashtable that uses uint as its key type,
regardless of what hash function is in use.  This is basically
polymorphism.  Thankfully, we have an answer to that: interfaces.  So,
we define an interface where the K type variant is not necessarily
hashtable (after all, this interface could also be fulfilled by a
binary tree or some other kind of map):
iface map<K,V> {
    fn put(k: K, v: V);
    fn get(k: K) -> V;
}

And now we have to implement the interface for the type ht::t:
impl<K:hash,V> map<K,V> for ht::t<K,V> {
    fn put(k: K, v: V) { ret ht::put(self, k, v); }
    fn get(k: K) { ret ht::get(self, k); }
}

This interface hides the precise hash implementation, encapsulating it
so to speak.  Now the Module2::bar() function can be rewritten:
fn bar(m: map<uint,str>) -> str {
    ret m.get(3);
}

And the foo function can be written:
fn foo() -> str {
    let h = ht::create();
    ht::put(ht, 3u, "hi");
    ret Module2::bar(h as map);
}

There are some niggling implementation details to be worked out, as
the compiler will (in a dynamic setting) have to stash the hash
implementation pointer somewhere so it is available for use in the
put() and get() functions.  In a monomorphic setting, customized
variants of put() and get() would be generated in any case.