I'm going to start off by motivating what I'm doing here. And I want to be clear that I'm not "dissing" the existing collections implementation or anything as unproductively negative as that. It was a really good experiment, it was a huge step forward given what we knew back in 2.8, but now it's time to learn from that experiment and do better. This proposal uses what I believe are the lessons we can learn about what worked, what didn't work, and what is and isn't important about collections in Scala.
This is going to start out sounding really negative and pervasively dismissive, but bear with me! There's a point to all my ranting. I want to be really clear about my motivations for the proposal being the way that it is.
Not. Good. At least, not in the way that they're used by the existing Scala collections library, or the Java collections library for that matter. Not good at all. Here's why:
def foo(is: Seq[Int]): MyThingy = {
// ...
}
What can we do with is
here? Well, a better question is what can we sanely do, because the Scala collections library allows us to do a lot of things that we really shouldn't. For example, we can run is :+ 42
and get a result of type Seq[Int]
. Does that even make sense? What if someone chooses to pass in a Seq
where prepend is efficient by append isn't, like List
? What sort of collection do we get as a result? Does it share any structure with the input? I have no idea.
Here's an even better example. Let's fill out the implementation of this function a bit:
def foo(is: Seq[Int]): MyThingy = {
val t = new MyThingy
for (idx <- 0 until is.length) {
t.accumulate(is(idx))
}
t
}
If I showed you this code and asked you what the big-O complexity of it was, could you tell me the answer? No, you cannot, because it depends on the runtime type of is
! Collections and complexity are inherently tied together. When you ask for a value of type Seq
, what you're saying is that you want a "bunch of stuff", but you don't care about the efficiency of accessing or manipulating that stuff.
Look at it this way: do you ever create a value of type Seq
? I mean, the Seq
companion object does offer a constructor method. It's totally possible. And yet, we never ever do it. So why do we constantly ask for these values?
Choosing a collection is a very deliberate and intentional thing. We want to select our collection to match the complexity bounds of our problem. Encouraging (and even enabling!) users to work in terms of generic types that throw away this information is wrong, and very very much an anti-pattern.
There are some useful abstraction lines that we can draw, but something as blunt as Seq
simply isn't going to cut it, and it encourages absolutely awful behavior.
And of course, the best part about all of this is that the above type signature isn't even right. If we really wanted to do this correctly, we would have to do something like the following:
def foo[CC[a] <: Seq[a]](is: CC[Int])(implicit cbf: CanBuildFrom[CC, Int, CC]): MyThing = ...
You get the picture.
Have you ever gotten a type error that says something about "expected Set[A]
, got scala.collection.immutable.Set[A]
"? Yeah. That should never ever happen. The fact that this even exists is terrifying and confusing, and they are honestly the type errors that I dread the most when working with Scala (monad transformer stacks are easier to diagnose!).
These errors, of course, come from the parallel and inherited hierarchy between mutable and immutable collections. Frankly, I don't find this hierarchy to be even slightly useful. If I have a mutable collection, I don't want to pass it around masquerading behind a generic immutable interface. It's just not a helpful feature, and it causes a lot of headache. The mutable collections are necessary, and some of them (e.g. ListBuffer
and TrieMap
) are phenomenally useful, but in no way do they need to be part of any shared pantheon.
Half of the performance issues in the current collections library stem from specific collection A
inheriting an implementation that is shared with B
and C
, but which was written against Iterable
and thus is not optimized for any of the above! The map
function is an amazingly apt example of this. If you run map
against Vector
, it will create a new vector, incrementally, building from left to right and growing the trie through successive array copies, inserting the data with a head pointer each time. This is tremendously less efficient (in constant factors), than what could be done with a custom map
implementation on Vector
, which is to say allocate the entire trie at once, in place, and then mutate the data in a nice tight loop amenable to cache line optimization.
map
is a trivial example, but you get the idea. The current collections library pervasively inherits implementations, mostly because it is tremendously convenient to do so, and all of those implementations fall into the category of "functional, but not great". As a heavy user of the collections library, this becomes a big problem, since I have to think very carefully about what methods are inherited and what methods are overridden and where the performance pitfalls might be. The IList
type in Scalaz very notably replicates the functionality of List
, but because it is using specialized functions that were implemented specifically for IList
, its performance numbers are generally much better than those of List
. And this is the most basic collection imaginable! (other than perhaps Option
)
Implementation inheritence was originally done because of the large number of duplicated implementations between collections in 2.7, and the number of "gaps" in functionality, where a method you expected to exist everywhere only existed on certain collections, or returned the wrong type. This is a problem that is resolvable in two ways that are not implementation inheritence. First, if your implementations are really copy/paste copies of each other, then why do you have two separate collections? What is the difference between the two at that point? Second, what utilities do we really need and where do they make sense? Implementation inheritence gives us everything everywhere, and with the same type signatures, but that's not necessarily what we want.
You knew it was coming... I have never, ever called map
on BitSet
passing a function of type Int => String
. Not even once. The majority of the schenanigans in CanBuildFrom
exist to support the following three features:
- Bizarre
map
s on thingsBitSet
String
SortedSet
TypeTag
capturing onArray
- Mapping on
Range
Honestly, I just don't need those features, and goals like making String
and Array
"collectiony" can be done in better ways, so long as we're not trying to inherit implementations. As for Range
, that's a problem of specialization (our lazy collection type only works on numbers).
Have you ever tried to implement a new collection that sits in the collections hierarchy? I have. It's awful. It is unbelievably difficult to get right. And as a consequence, NO ONE DOES IT! Pretty much everyone just sits outside of the collections hierarchy and has conversion methods to and fro, because that's a radically easier thing to do. And so, part of the major selling point of CanBuildFrom
, all of the Like
traits and the inheritence, simply goes away! These aren't useful features because no one can use them, because they're far to complicated and arcane.
It's also worth noting that custom collection types have been hard-broken in the past, even in a way that cannot be cross-built around, with the conversion from Traversable
to GenTraversable
in 2.9. So the compatibility precedent isn't particularly good here, and seeing as I was basically the only one who complained about that particular breakage at the time, I feel confident in saying that very few (if any) would notice if we broke it again.
Here's what we should strive for:
- Transparent efficiency, declared in the types (use
List
if you meanList
;Vector
if you meanVector
; etc) - Implementation reuse of things that are trivially and efficiently defined in terms of other top-level functions
cbf() ++= xs
is not a "top-level" functionzip
defined in terms ofzipWith
is a good examplesum
defined in terms offold
is another good example- It's very easy to go too far here. Only things that really make sense!
- Abstraction along higher-level boundaries, offering less and more tailored power
- If you just want to
map
over a collection, requireFunctor
- If you just want to reduce a collection (e.g.
sum
, etc), requireFoldable
- If you just want to
- Convenient and seamless syntax
- Convenient and seamless errors
- Types that you can put in front of a beginner without hand-waving about "use-case"
- Useful extensibility
- Defining your own collection that can be abstracted over in the same ways as built-ins
- Getting shared implementations "for free" where it makes sense (e.g.
zipWith
)
- Surface compatibility, with deprecation
- I hate this restriction, but I think it's necessary
- I will give the caveat though that, any code that uses
CanBuildFrom
explicitly, any code which usesGenTraversable
or any of the crazy inheritence stuff, is to be considered "out of scope" - Highly bizarre and never-used "misfeatures" (such as mapping a
BitSet
toString
) are out of scope
The compatibility goal is really important, because every piece of Scala code in existence touches the collections library. However, I have no qualms about hurling vast quantities of deprecation warnings at people who are doing bad things that I want to get rid of. And I also have no qualms about breaking code that exploits internal machinery and implementation details of the current collections API, which is exactly what CanBuildFrom
is.
Without further ado, have some code. Obviously this is a rough-ish sketch, but I tried to show enough to convince myself (and others) that it would be possible to do everything. The collections library is huge.
@m50d @djspiewak the problem with
Functor[Array]
is that it violates the identity law. If you take an array referencea
and replace it witha.map(identity)
or vice-versa, the functor law says that this must be a no-op. But doing this can change the behavior of your program because what you get back frommap
is a new reference, and reference equality is important for mutable structures (as opposed to immutable structures, where [ostensible] parametricity renders it unobservable). In general this is why scalaz has no instances for mutable types.