Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Comparing executing time using Pseq and ParStream while preprocessing 1.4GB text file

Using ParStream

  • Real: 00:00:32.700, CPU: 00:02:01.165, GC gen0: 285, gen1: 80, gen2: 5
  • Real: 00:00:31.421, CPU: 00:01:59.621, GC gen0: 234, gen1: 65, gen2: 4
  • Real: 00:00:32.865, CPU: 00:02:02.335, GC gen0: 250, gen1: 70, gen2: 5
  • Average 32328.7ms

Using PSeq

  • Real: 00:00:34.108, CPU: 00:02:06.470, GC gen0: 203, gen1: 58, gen2: 5
  • Real: 00:00:33.386, CPU: 00:02:06.392, GC gen0: 224, gen1: 63, gen2: 5
  • Real: 00:00:33.720, CPU: 00:02:07.764, GC gen0: 218, gen1: 63, gen2: 5
  • Average 33738.0ms

Improvment: ~4%

#time "on"
let readAllHashes nSkip nSentences =
sentencesFile
|> File.ReadLines
|> ParStream.ofSeq
|> ParStream.skip nSkip
|> ParStream.take nSentences
|> ParStream.map(fun line ->
let separatorIndex = line.IndexOf(' ')
let sentenceItself = line.Substring(separatorIndex + 1)
sentence2hashes sentenceItself)
let shinglingBeginingAndEnd nSkip nSentences =
readAllHashes nSkip nSentences
|> ParStream.toArray
|> Array.mapi(fun id hashes -> spair id hashes)
|> Array.collect(fun pair ->
let id = spair_fst pair
let hashes = spair_snd pair
shingling nWordInShingle hashes
|> Array.map(fun shingle -> spair shingle id))
let allGroups nSkip nSentences =
shinglingBeginingAndEnd nSkip nSentences
// group by sub sets of hashes
|> ParStream.ofArray
|> ParStream.groupBy(fun pair -> spair_fst pair)
// filter out groups with only single element inside
|> ParStream.filter(fun (_, g) -> g |> Seq.length > 1)
|> ParStream.map(fun (_, g) -> g |> Seq.map(fun pair -> spair_snd pair) |> Array.ofSeq)
|> ParStream.toArray
allGroups nSkip nSentences
@palladin

This comment has been minimized.

Copy link

palladin commented Dec 2, 2014

I made some changes to see if we can improve the performance
https://gist.github.com/palladin/bc278fc010e4d244ef7a
You need to get the latest source from the master
https://github.com/nessos/Streams/tree/master

@akimboyko

This comment has been minimized.

Copy link
Owner Author

akimboyko commented Dec 6, 2014

Hi Nick,
Thanks for your reply! I have tried your sample, and found out following:

  • Stream.mapi nice addition to API
  • ParStream.mapi definetely needed in API too!!
  • in my case reading from file using File.ReadLines |> ParStream.ofSeq work faster that sequesial reading using Stream on files > 1GB (may be due to SSD or RAID-10 on server)
  • Not sure about inline keyword, coz I'm not calling this functions too many times or within loops
  • Stream.groupBy return Stream<'Key * Seq<'T>>, may be changinh to Stream<'Key * Stream<'T>>will be better?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.