Skip to content

Instantly share code, notes, and snippets.

@djspiewak
Last active November 28, 2023 15:03
Show Gist options
  • Save djspiewak/2ae2570c8856037a7738 to your computer and use it in GitHub Desktop.
Save djspiewak/2ae2570c8856037a7738 to your computer and use it in GitHub Desktop.
Scala Collections Proposal

Collections Redesign Proposal

I'm going to start off by motivating what I'm doing here. And I want to be clear that I'm not "dissing" the existing collections implementation or anything as unproductively negative as that. It was a really good experiment, it was a huge step forward given what we knew back in 2.8, but now it's time to learn from that experiment and do better. This proposal uses what I believe are the lessons we can learn about what worked, what didn't work, and what is and isn't important about collections in Scala.

This is going to start out sounding really negative and pervasively dismissive, but bear with me! There's a point to all my ranting. I want to be really clear about my motivations for the proposal being the way that it is.

Problems

Generic Interfaces

Not. Good. At least, not in the way that they're used by the existing Scala collections library, or the Java collections library for that matter. Not good at all. Here's why:

def foo(is: Seq[Int]): MyThingy = {
  // ...
}

What can we do with is here? Well, a better question is what can we sanely do, because the Scala collections library allows us to do a lot of things that we really shouldn't. For example, we can run is :+ 42 and get a result of type Seq[Int]. Does that even make sense? What if someone chooses to pass in a Seq where prepend is efficient by append isn't, like List? What sort of collection do we get as a result? Does it share any structure with the input? I have no idea.

Here's an even better example. Let's fill out the implementation of this function a bit:

def foo(is: Seq[Int]): MyThingy = {
  val t = new MyThingy
  for (idx <- 0 until is.length) {
    t.accumulate(is(idx))
  }
  t
}

If I showed you this code and asked you what the big-O complexity of it was, could you tell me the answer? No, you cannot, because it depends on the runtime type of is! Collections and complexity are inherently tied together. When you ask for a value of type Seq, what you're saying is that you want a "bunch of stuff", but you don't care about the efficiency of accessing or manipulating that stuff.

Look at it this way: do you ever create a value of type Seq? I mean, the Seq companion object does offer a constructor method. It's totally possible. And yet, we never ever do it. So why do we constantly ask for these values?

Choosing a collection is a very deliberate and intentional thing. We want to select our collection to match the complexity bounds of our problem. Encouraging (and even enabling!) users to work in terms of generic types that throw away this information is wrong, and very very much an anti-pattern.

There are some useful abstraction lines that we can draw, but something as blunt as Seq simply isn't going to cut it, and it encourages absolutely awful behavior.

And of course, the best part about all of this is that the above type signature isn't even right. If we really wanted to do this correctly, we would have to do something like the following:

def foo[CC[a] <: Seq[a]](is: CC[Int])(implicit cbf: CanBuildFrom[CC, Int, CC]): MyThing = ...

You get the picture.

Set vs Set

Have you ever gotten a type error that says something about "expected Set[A], got scala.collection.immutable.Set[A]"? Yeah. That should never ever happen. The fact that this even exists is terrifying and confusing, and they are honestly the type errors that I dread the most when working with Scala (monad transformer stacks are easier to diagnose!).

These errors, of course, come from the parallel and inherited hierarchy between mutable and immutable collections. Frankly, I don't find this hierarchy to be even slightly useful. If I have a mutable collection, I don't want to pass it around masquerading behind a generic immutable interface. It's just not a helpful feature, and it causes a lot of headache. The mutable collections are necessary, and some of them (e.g. ListBuffer and TrieMap) are phenomenally useful, but in no way do they need to be part of any shared pantheon.

Implementation Inheritence

Half of the performance issues in the current collections library stem from specific collection A inheriting an implementation that is shared with B and C, but which was written against Iterable and thus is not optimized for any of the above! The map function is an amazingly apt example of this. If you run map against Vector, it will create a new vector, incrementally, building from left to right and growing the trie through successive array copies, inserting the data with a head pointer each time. This is tremendously less efficient (in constant factors), than what could be done with a custom map implementation on Vector, which is to say allocate the entire trie at once, in place, and then mutate the data in a nice tight loop amenable to cache line optimization.

map is a trivial example, but you get the idea. The current collections library pervasively inherits implementations, mostly because it is tremendously convenient to do so, and all of those implementations fall into the category of "functional, but not great". As a heavy user of the collections library, this becomes a big problem, since I have to think very carefully about what methods are inherited and what methods are overridden and where the performance pitfalls might be. The IList type in Scalaz very notably replicates the functionality of List, but because it is using specialized functions that were implemented specifically for IList, its performance numbers are generally much better than those of List. And this is the most basic collection imaginable! (other than perhaps Option)

Implementation inheritence was originally done because of the large number of duplicated implementations between collections in 2.7, and the number of "gaps" in functionality, where a method you expected to exist everywhere only existed on certain collections, or returned the wrong type. This is a problem that is resolvable in two ways that are not implementation inheritence. First, if your implementations are really copy/paste copies of each other, then why do you have two separate collections? What is the difference between the two at that point? Second, what utilities do we really need and where do they make sense? Implementation inheritence gives us everything everywhere, and with the same type signatures, but that's not necessarily what we want.

CanBuildFrom

You knew it was coming... I have never, ever called map on BitSet passing a function of type Int => String. Not even once. The majority of the schenanigans in CanBuildFrom exist to support the following three features:

  • Bizarre maps on things
    • BitSet
    • String
    • SortedSet
  • TypeTag capturing on Array
  • Mapping on Range

Honestly, I just don't need those features, and goals like making String and Array "collectiony" can be done in better ways, so long as we're not trying to inherit implementations. As for Range, that's a problem of specialization (our lazy collection type only works on numbers).

Custom Collections

Have you ever tried to implement a new collection that sits in the collections hierarchy? I have. It's awful. It is unbelievably difficult to get right. And as a consequence, NO ONE DOES IT! Pretty much everyone just sits outside of the collections hierarchy and has conversion methods to and fro, because that's a radically easier thing to do. And so, part of the major selling point of CanBuildFrom, all of the Like traits and the inheritence, simply goes away! These aren't useful features because no one can use them, because they're far to complicated and arcane.

It's also worth noting that custom collection types have been hard-broken in the past, even in a way that cannot be cross-built around, with the conversion from Traversable to GenTraversable in 2.9. So the compatibility precedent isn't particularly good here, and seeing as I was basically the only one who complained about that particular breakage at the time, I feel confident in saying that very few (if any) would notice if we broke it again.

Goals

Here's what we should strive for:

  • Transparent efficiency, declared in the types (use List if you mean List; Vector if you mean Vector; etc)
  • Implementation reuse of things that are trivially and efficiently defined in terms of other top-level functions
    • cbf() ++= xs is not a "top-level" function
    • zip defined in terms of zipWith is a good example
    • sum defined in terms of fold is another good example
    • It's very easy to go too far here. Only things that really make sense!
  • Abstraction along higher-level boundaries, offering less and more tailored power
    • If you just want to map over a collection, require Functor
    • If you just want to reduce a collection (e.g. sum, etc), require Foldable
  • Convenient and seamless syntax
  • Convenient and seamless errors
  • Types that you can put in front of a beginner without hand-waving about "use-case"
  • Useful extensibility
    • Defining your own collection that can be abstracted over in the same ways as built-ins
    • Getting shared implementations "for free" where it makes sense (e.g. zipWith)
  • Surface compatibility, with deprecation
    • I hate this restriction, but I think it's necessary
    • I will give the caveat though that, any code that uses CanBuildFrom explicitly, any code which uses GenTraversable or any of the crazy inheritence stuff, is to be considered "out of scope"
    • Highly bizarre and never-used "misfeatures" (such as mapping a BitSet to String) are out of scope

The compatibility goal is really important, because every piece of Scala code in existence touches the collections library. However, I have no qualms about hurling vast quantities of deprecation warnings at people who are doing bad things that I want to get rid of. And I also have no qualms about breaking code that exploits internal machinery and implementation details of the current collections API, which is exactly what CanBuildFrom is.

Implementation

Without further ado, have some code. Obviously this is a rough-ish sketch, but I tried to show enough to convince myself (and others) that it would be possible to do everything. The collections library is huge.

sealed trait List[+A] {
def map[B](f: A => B): List[B] = this match {
case head :: tail => f(head) :: (tail map f)
case Nil => Nil
}
}
object List {
implicit object instances extends Applicative[List] { /* ... */ }
}
final case class ::[+A](head: A, tail: List[A]) extends List[A]
case object Nil extends List[Nothing]
// btw, we should have a specialized SingletonVector and EmptyVector instance, because tiny vectors are VERY common and VERY slow
final class Vector[+A] private (trie: Array[AnyRef], val length: Int) {
// note that we can even optimistically respecialize here!
def map[B](f: A => B): Vector[B] = {
val trie2 = allocateTrie(length)
var i = 0
while (i < length) {
// mutate in place!
trie2(i) = trie(i)
i += 1
}
new Vector[B](trie2, length)
}
}
object Vector {
implicit object instances extends Applicative[List] { /* ... */ }
}
// I have ALWAYS wanted these
object +: {
def unapply[A](vec: Vector[A]): Option[(A, Vector[A])] = ???
}
object :+ {
def unapply[A](vec: Vector[A]): Option[(Vector[A], A)] = ???
}
// type Map[A, B] = HashMap[A, B]
final class HashMap[A, B] private (trie: Array[AnyRef], val size: Int) {
def mapValues[C](f: B => C): HashMap[A, C] = ???
// could tuple it for compatibility if we wanted
def map[C](f: (A, B) => C): HashMap[A, C] = ???
}
object HashMap {
// conspicuously absent here: there is no Functor[HashMap]! I'm not going to get all religious on this one, but it DOES get weird to implement
// there would, of course, be a Foldable[HashMap], where fold = foldLeft = foldRight.
}
// typeclasses
// Mappable/CanMap might be less scary of a name, but it doesn't matter
trait Functor[F[_]] {
def create[A](a: A): F[A]
def map[A](fa: F[A])(f: A => B): F[B]
}
object Functor {
def apply[F[_]](implicit F: Functor[F]) = F
// oh yeah, I went there!
implicit object ArrayFunctor extends Functor[Array] {
// "but how?!", you ask
// "ah, through the magic of respecialization!", I answer
//
// it is possible to efficiently inspect the runtime types involved here (notably, the functions)
// and reconstruct the type tags we need to perform this operation. not only that, but the process
// of doing this inspection allows us to safely cast our way into efficient, unboxed access in tight loops
//
// note that we have more information with which to do this sort of thing than we ever used to,
// due to the magic of specialization on Function1 and the explicit nature of TypedTag. In other words,
// we're not recreating the 2.7 arrays madness. we're simply getting sane arrays that finally interop
// in the way you would expect, without imposing the assumptions and inefficiencies of a collections
// framework on top of them
def map[A, B](arr: Array[A])(f: A => B): Array[B] = ???
}
}
implicit class FunctorSyntax[F[_], A](self: F[A])(implicit F: Functor[F]) {
def map[B](f: A => B): F[B] = Functor[F].map(self)(f)
}
// views
// TODO insert jsuereth's proposal, because it's awesome and works perfectly here
// note that in his proposal, his View would have a Foldable instance
// if you want to force a view, you just use one of the reductions on Foldable (which could include stuff like toList/toVector, if we want)
// ranges are just views, btw. this completely sidesteps the problem of mapping a Range with a function that returns String
// the problem it introduces is that mapping ranges, in general, would be lazy. I don't see a way around this, unfortunately.
// could catch it relatively easily with a compiler warning, but there's no question that it would be a breaking semantic change.
//
// foreach would continue to work as expected (since foreach would be a forcing reduction to Unit)
// Zippable?
trait Applicative[F[_]] extends Functor[F] {
// I think this is a more accessible definition of applicative than apply2
// TODO actually this is not a very useful definition. revision coming soon...
def zipWith[A, B, C](fa: F[A], fb: F[B])(f: (A, B) => F[C]): F[C]
def zip[A, B](fa: F[A], fb: F[B]): F[(A, B)] = zipWith(fa, fb) { (_, _) }
}
// Traversable?
trait Foldable[F[_]] {
// folds in an unspecified order, for efficiency
def fold[A, B](fa: F[A])(init: B)(f: (B, A) => B): B
def foldLeft[A, B](fa: F[A])(init: B)(f: (B, A) => B): B
def foldRight[A, B](fa: F[A])(init: B)(f: (A, B) => B): B
def length[A](fa: F[A]): Int = ???
def sum[A](f: F[A])(implicit N: Numeric[A]): A = ???
// more reductions here!
}
trait Indexable[F[_]] extends Foldable[F] {
def get[A](fa: F[A])(idx: Int): Option[A]
// we will use `apply` for the syntax enrichment, for obvious reasons
def unsafeGet(fa: F[A])(idx: Int): A = get(fa).get // TODO more efficient impl
}
// compatibility shims
trait Seq[+A] {
@deprecated
def map[B](f: A => B): Seq[B]
@deprecated
def :+[B >: A](b: B): Seq[B]
// etc
}
object Seq {
implicit def lift[F[_]: Foldable: Functor, A](fa: F[A]): Seq[A] = new Seq[A] {
// hey, if you want to do something like this, the efficiency is on your own head!
def :+[B >: A](b: B): Seq[B] = lift(fa.foldLeft(Vector.empty[B]) { _ :+ _ } :+ b)
// though we will take some small stab at efficiency here... (also there are dirty tricks we can play to make repeated +:/:+ on Seq faster)
def +:[B >: A](b: B): Seq[B] = lift(b :: fa.foldRight(List.empty[B]) { _ :: _ })
// etc
}
}
// we can define other shims analogously
// arrays
implicit class RichArray[A: TypeTag](val arr: Array[A]) extends AnyVal {
// note: we can get REALLY cute here and do some extremely fun optimistic respecialization!
// in other words, we can actually make this function absurdly more efficient than current impls,
// even without compiler help!
def map[B: TypeTag](f: A => B): Array[B] = ...
}
// we can define RichString in a similar way if we want to

Let's start at the top...

Base Collection Types

List is a List. It's not ListLike. It's not Iterable. It's just a List. Now, for the first time, that "implement a List" example that you show in Scala training material can be "implement THE List"! That seems nice. Due to the insane niceness of ListBuffer, we could make tail a var and I wouldn't quibble too much.

Vector is a Vector. It is unconnected to List in any direct sense. Note that both List and Vector have their own map implementations. These implementations aren't overriding anything; nobody inherits map! You have to implement your own map if you want it, and in doing so, you can make it much more efficient. These implementations are significantly faster than the map implementations on these collection types in the existing library, and they even open up the possibility for further optimizations currently impossible in the library. For example, it's possible to fully specialize the tip Array(s) inside of Vector and then optimistically respecialize (without compiler help or type signature changes) the map function (and similar) to treat primitives specially, operating unboxed in while loops without loss of generality. How exciting would it be to have a functional, immutable Vector that beats java.util.ArrayList for primitive operations, but remains a general purpose collection? Pretty darn exciting.

I added extractors for Vector, because I've always wanted them and they make a heck of a lot more sense (and can be implemented much more efficiently) if they only work on Vector. Call it a wishlist.

Map[A, B] = HashMap[A, B]. One interesting thing here is that the map function on HashMap is different from map on List or Vector. For compatibility reasons, we could tuple this function without a problem, but the point is that we don't have to. Everyone has their own map, and functions like zip aren't even defined on HashMap (which is good, because I have no idea what that even means in the current library).

Implicits and Reuse

Here's where we start getting into more interesting waters. While declaring a function that expects a Seq is not a useful thing to do, as a general rule, declaring a function that expects something for which there is a map operation is. This is where the typeclasses come into play. I named them Functor, Foldable, Indexable, and there would be a Monad, but I really don't care about the names. Call it HasMap or Mappable or something if it seems more user-friendly, and that's not a facetious suggestion! Usability is important.

This comes round to the goal of having abstraction defined along boundaries where it makes sense. Seq is not a sensible abstraction boundary, but map, zip, foldLeft, sum, and more are. We can also pick and choose more precisely which elements we need (do we need zip, or just map? do we really need fold?), as opposed to being stuck with the "all or nothing" that is inheritence.

This also gives us implementation reuse along reasonable lines. The zip implementation is trivially defined in terms of zipWith (which is how I defined Applicative, which you could call Zippable if you wanted). This sort of implementation reuse makes sense. Unless your collection happens to have some magically strange treatment of tuples, there would never really be a reason to have an implementation of zip that is distinct from the implementation of zipWith. So that's the litmus test. If the use-cases which falsify the sharing are really outlandish (but still achievable), then it's ok to share.

Critically, because we're using type classes and syntax enrichments (which I exemplify in my code), it is very easy to override these implementations and get the most-specific types, because the structure of the typeclass will force you to do so. We don't need CanBuildFrom or even a weaker variant of it to support sharing of implementation. We can rely on more traditional typeclasses.

Views

@jsuereth's views proposal (implementing everything on top of foldLeft) is fantastic, though I would argue it would be better implemented on top of fold instead of foldLeft, to allow for more efficient traversals in some cases. His proposal also fits beautifully into my broader collections framework. In fact, a lot of his machinery would simply disappear by implementing Foldable[View]. Reductions are forcing, obviously, and their implementations are neatly reused since they're all defined in terms of fold!

On another sidebar, I consider toList, toVector and such to be reductions. You're just folding your Foldable (or View, as the case may be) into a specific collection type. I use toList and toVector quite a bit in practical code in order to specifically side-step CanBuildFrom witchcraft on Seq and IndexedSeq.

Ranges

In my proposal, Range no longer exists, and instead we simply use View! Obviously it's possible to define foreach on Foldable (as indeed we should), and it would be an eager implementation (effectively, reducing to Unit). This completely sidesteps the whole "mapping a Range to String" problem, because Range itself is nothing special. There's no problem with a View[String] any more than there is a problem with View[Int].

Here's the cloud to the silver lining though, and those who remember the whole withFilter debate are probably already thinking about it: map on View is lazy:

// in my proposal, this prints 0, 1, 2, ..., 9
for (i <- 0 until 10) {
  println(i)
}

// in my proposal, this... does nothing
for (i <- 0 until 10) yield {
  println(i)
}

Back in the day, this was a huge gotcha for Scala beginners, and I have very little reason to believe it wouldn't still be a huge gotcha. I honestly don't know what to do about this. Defining type Range = View[Int] is just incredibly appealing, and it makes everything much much simpler. One possible idea would be to have a special compiler warning that checks for "discarded" values of type View, which are likely to be errors of the sort we see above. Honestly, I'd like a compiler warning that checks for discarded values in general, but that's just me.

This would break backwards compatibility. There's no two ways around it. This is a common use-case, and my proposal breaks it (possibly with a warning, but still...) and recreates a situation which was widely considered a "beginner trap" back in the day. I'm still thinking about this problem, and I'd like to find a solution. For now though... #frownyface

Shims

Here we get to the really insane part of the proposal: compatibility shims. See, the problem is that, as much as I hate Seq and I honestly and truly believe it is never the right thing to use, it does exist and has existed for a long time. People have written vast swaths of code which use it, and we can't just arbitrarily decide to break it.

My implementation of Seq in this proposal is available as an implicit conversion from any type which has the appopriate typeclasses defined (in this case, Foldable and Functor, though we may want something else). Within Seq, we can implement all of the existing signatures in terms of these typeclasses. The implementations will not be efficient, but that doesn't matter. Anyone who is using Seq to do real things doesn't know what their efficiency properties are now (I certainly don't), so I feel pretty justified in making this a slow path. Everything in Seq would be @deprecated.

We can play the same trick with IndexedSeq, Iterable and similar. In fact, with a bit of inheritence, we can reuse implementations between these shims (here, for a change, reuse by inheritence doesn't hurt us), so I don't expect too much code duplication here.

Array, String and other

Ah, Array! Worked about 80% of the time in 2.7. Works 100% of the time in 2.8, but no one understands why, and it seems to have a very bipolar identity. This proposal has a third idea for how we can make it work.

Optimistic respecialization is something that people have been toying with a lot recently. Basically, the trick takes advantage of two things. First, the value of classOf[Array[Int]] is static and can be acquired at class load time without penalty. Second, the value of classOf[Int => Int] (and all similar functions) is also static, unique and can be acquired at class load time without penalty. Being able to acquire and use these values allows us to write generic type signatures which preserve parametricity (such as the map function on RichArray, which takes a TypeTag in my proposal but doesn't actually need it) and which recover the underlying type information. (remember, anything that is not a primitive array is just Array[AnyRef])

This is really exciting, because it means that, for most of what we want to do inside of RichArray, we can implement generic code that operates directly on arrays in tight loops without boxing. This suddenly makes the entirety of our collections operations usable on Array, which is a type that is generally only used by people who really want efficiency.

There may be some weird corner cases currently possible with the "seq-like" String and Array variants in present-day Scala that we cannot make possible in this proposal. I don't see anything important though. Feedback welcome. The tl;dr though is that syntax enrichment (via implicit classes) is a far better way to make String and Array behave in nice ways than is magical implicit coercion with CanBuildFrom constraints, which is what we have today.

@djspiewak
Copy link
Author

I don't agree that lazy staged behavior should be the default. I do agree that it makes sense in a very large number of cases, far more than is generally believed right now. But when I use a function on a collection, I want to know what its performance characteristics are and I want to know where that time/memory is being spent. Furthermore, staged operations of that sort inhibit compositionality, since you can no longer have a map which is of type List[A] => List[B]. Instead, you get List[A] => Staged[List, B], with a paired function Staged[F, _] ~> F, which is far less useful IMO.

Furthermore, you miss out on efficiencies which could otherwise be achieved. For example, drop on List is very efficient to implement in a direct fashion. If you stage the operation, you get the flexibility of fusing it with map, filter and other such operations to achieve potentially higher efficiency. However, if you stage that operation and it isn't fused with anything else (i.e. you really just wanted drop and nothing else), then the resulting efficiency of the drop . toList is enormously lower than simply evaluating drop directly.

So there are tradeoffs, and while staging and fusing is quite a bit more common than one would naively expect, it may or may not be the most common use-case. And even if it is, it's certainly not what people expect, and the tradeoffs it carries are severe and unintuitive.

@fanf
Copy link

fanf commented Oct 16, 2015

@djspiewak sorry for the possibly dumb question, but for the missing on some efficiency part, is it an inherent limitation of the viewducer model? There is no way to make the toList be a noop?
And next question: how Paul achieve it in psp? Not with viewducer, I suppose, and if so, does his method is uncompatible with your proposition (again, sorry for what my be silly question/evident answer).

@m50d
Copy link

m50d commented Oct 16, 2015

@tpolecat sure, if you mutate arrays. Equally the instance for List is invalid if you call System.identityHashCode. The question is then which subset of Scala we define the functor laws in relation to.

I'd argue:

  1. Through accidents of history - partly Java's varargs but mostly the lack of immutable collections in the Java standard library - many JVM libraries accept or return Array even when there is no intent to mutate the array.
  2. Many users find that the most effective approach when using such libraries is to assume that arrays are not mutated and/or that they're used linearly (rather than e.g. defensively copy). It is unfortunate that these assumptions are unchecked by the type system and they undoubtedly increase the defect rate for such code; nevertheless it remains so.
  3. In code which is already making that assumption (i.e. will already be incorrect if that assumption is violated), the existence of a Functor[Array] instance is a strict improvement. I'm not against drawing a distinction (feature flag, different imports, or the like) between Functor instances which are valid only under the assumption that Arrays are not mutated and Functor instances which are valid under weaker assumptions, but I do want the expressiveness advantages of Functor[Array] to at least be available to code which explicitly opts into it.
  4. Use cases in which it is desirable both to call existing JVM libraries and to use abstractions like Functor are Scala's most important constituency, because they form the language's USP - use cases for a Scala-like language which don't involve calling existing JVM libraries are well served by Haskell, and use cases which don't involve immutability and/or Functor-like abstractions have a range of language options available on the JVM.

@mancvso
Copy link

mancvso commented Oct 31, 2015

I just wanted to note that, in a latter stage, the amount of implicits and traits should be reduced to avoid delivering even grater compile times.

Artima's article
http://www.artima.com/articles/compile_time.html

@yawaramin
Copy link

@mancvso note that implicit resolution times are about to drop significantly thanks to Miles Sabin's work on inductive resolution https://twitter.com/milessabin/status/779298833981435904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment