Skip to content

Instantly share code, notes, and snippets.

@espringe
Last active December 17, 2015 23:39
Show Gist options
  • Save espringe/5691183 to your computer and use it in GitHub Desktop.
Save espringe/5691183 to your computer and use it in GitHub Desktop.
Call for a better way to share and use code

We need a way way better way of simply publishing and using simple libraries, functions, classes etc.

Or for people who prefer a convoluted example:

I wanted to add a precondition to my function that every element is unique. If there's a duplicate, a helpful assert should crash the program (in dev mode, at least).

A quick hackish solution might be to use (scala's) standard library, with:

assert(data.distinct.size == data.size, "Data contains a dupe")

While it works, it's not particularly efficient (data contains millions of elements) and worst of all, it gives a rather useless error.

Writing something a bit more specialized is honestly not too bad:

  def hasDuplicate[T](in: Iterable[T]): Option[T] = {
    val c = collection.mutable.Set[T]()

    for (i <- in) {
      if (c.contains(i))
        return Some(i)
      else
        c += i
    }

    None
  }

At this point, it does the job. Not particularly well, mind you, it causes hundreds of megabytes of memory to be used. There's probably a few superficial optimizations I could make (use a concrete class, instead of interface) but it's not going to change much. It's going to use O(n) memory, and cause millions of allocations -- but at this point, I can't justify working on it any longer, it's code for a stupid assert. And if it becomes too much of a problem, I'll just have to remove this assert -- rather than trying to optimize it.

Perhaps the best way to solve this, is by using an appropriate sized bloom filter to narrow down the initial data to a tiny set of potential duplicates (and an offset to where they could be), then do a search for these duplicates (perhaps even by using a binary Aho–Corasick search). It's an interesting problem, but writing such code would not be trivial -- it'll take benchmarks, and a lot of tests. With almost not chance of reusing, in my case it's simply not feasible to write.

However, I'm sure other people, smarter people, have run into the same problem, and done the work. However, due to how hugely onerous it is at the moment to create libraries (and to a lesser extent consume them), there's almost no chance I'll easily be able to reuse their code and work.

What we need, is for it to be trivial to import code import github.espringe.has-dupe#f3244 it needs to be safe, and reliable (i.e. can't just disappear) and importantly it needs to be super trivial for people to expose their work like this. Having to do much more than push it to some source hosting is too onerous.

C'mon guys, we can do better. It's 2013 and the preferred way of sharing snippets of code is copy&pasting from the first google stackoverflow result ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment