public
Created

Faster char split

  • Download Gist
gistfile1.scala
Scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
// Since this link got passed around some, I'm going to paste here the
// versions made by sirthias. If you want to see the old ones,
// look up the history.
 
// Well, actually, these are non-pimped versions. See the comments to
// a link where they are provided as pimped methods on String.
 
import scala.annotation.tailrec
 
def fastSplit(s: String, delimiter: Char): List[String] = {
@tailrec
def split(end: Int, elements: List[String]): List[String] = {
val ix = s.lastIndexOf(delimiter, end - 1)
if (ix < 0)
s.substring(0, end) :: elements
else
split(ix, s.substring(ix + 1, end) :: elements)
}
split(s.length, Nil)
}
 
// And this one produces a Stream, so you don't need to split any more segments than you need.
 
def lazySplit(s: String, delimiter: Char): Stream[String] = { // based on an implemented by Jed Wesley-Smith
def split(start: Int): Stream[String] = {
val ix = s.indexOf(delimiter, start)
if (ix < 0)
Stream.cons(s.substring(start), Stream.Empty)
else
Stream.cons(s.substring(start, ix), split(ix + 1))
}
split(0)
}
 
// This produces an Iterator, which is faster than Stream
 
def splitIterator(s: String, delimiter: Char): Iterator[String] = new Iterator[String] {
private var hasNextFlag = true
var start = 0
def hasNext = hasNextFlag
def next =
if (hasNext) {
val ix = s.indexOf(delimiter, start)
val result = if (ix < 0) {
hasNextFlag = false
s.substring(start)
} else s.substring(start, ix)
start = ix + 1
result
} else throw new java.util.NoSuchElementException("next on empty iterator")
}

Daniel, why not a Stream?

def split(s: String, c: Char): Stream[String] = {
def stream(i: Int): Stream[String] = {
val p = s indexOf (c, i)
if (p < 0) Stream.cons(s.substring(i), Stream.Empty)
else Stream.cons(s.substring(i, p), { println(p); stream(p + 1) })
}
stream(0)
}

You know you can fork the gist and then produce your version?

Great idea, actually! Once I replaced a split in Perl with a regex match, and reduced an execution time from 30 minutes to 7 seconds, precisely because I did not need to split the whole thing, just the first 100 or so fields.

didn't realise about the forking actually, nice!

yeah, this one can have nicer memory usage particularly if you don't iterate across all elements

You can make the eager version of this even cleaner and more efficient by searching for the delimiter from the back of the string rather than the front.
I just committed a polished version of these two implementations (with added tests) to spray:

https://github.com/spray/spray/commit/77b962a4ca8c71692c5b917bbcc6353a90c79ce2

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.