Skip to content

Instantly share code, notes, and snippets.

@Vindaar
Created November 8, 2019 20:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Vindaar/0fe9340d40e3b70c006e8b5c3c380f87 to your computer and use it in GitHub Desktop.
Save Vindaar/0fe9340d40e3b70c006e8b5c3c380f87 to your computer and use it in GitHub Desktop.

fast code + fast code = slow code

Apparently.

So after finishing with the major refactoring of the scale filling in ggplotnim I re-ran the chapter2 examples file.

Before the refactor that file ran in:

basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ nim c chapter2.nim
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2
./chapter2  7.07s user 0.13s system 99% cpu 7.229 total
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ nim c -d:danger chapter2.nim
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2
./chapter2  1.51s user 0.13s system 99% cpu 1.645 total

which I assumed were (unfortunately) the current run times, due to a combination of parsing the ~53,000 line long diamonds.csv file, converting it to a data frame and then handling the drawing of all those points.

(Aside: For instance for simplicity each graphical primitive is drawn individually in cairo creating and destroying a drawing context each, instead of accumulating all objects and handling all in one go (for that one would have to revert all transformations, which I wanted to avoid in the beginning). Also I know that the data frame implementation is quite slow at the moment, since everything is essentially an “object type” in the language of numpy and pandas (each individual element in the DF is stored as a Value object, which is similar to a JsonNode).)

Anyways, I was slightly disappointed to find that after the refactor I got:

basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ nim c chapter2.nim
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2
./chapter2  29.27s user 0.17s system 99% cpu 29.568 total
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ nim c -d:danger chapter2.nim
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2 
./chapter2  2.36s user 0.08s system 98% cpu 2.466 total

I knew that the code would most likely be a little slower than before, because I changed the data handling without thinking about performance on large datasets (instead of accessing the DF directly, I wrapped the data access in a closure for a more uniform handling for different kinds of geoms, which included another level of nesting Value -> ScaleValue…).

However, the code being slower by more than a factor of 4 made me investigate what the heck is going on. Since I knew that by far the slowest part was

  • the reading of the diamonds csv file
  • the creation of its DF
  • the drawing of its plot

with the other plots probably making up only a small portion of the time, I just commented out the diamonds related code and got:

#let diamonds = toDf(readCsv("../data/diamonds.csv"))
let economics = toDf(readCsv("../data/economics.csv"))

# works but ylabels are to close to the plot so that they are
# on top of the tick values
# also not as fast as it could be :/
#ggplot(diamonds, aes("carat", "price")) +
#  geom_point() +
#  ggsave("figs/2.3_5.pdf")
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ nim c chapter2.nim
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2   
./chapter2  0.80s user 0.01s system 99% cpu 0.808 total

which kind of proved that to me. So then I only commented out the diamonds plot creation, but keeping the CSV parsing + DF conversion still in and got:

let diamonds = toDf(readCsv("../data/diamonds.csv"))
let economics = toDf(readCsv("../data/economics.csv"))

# works but ylabels are to close to the plot so that they are
# on top of the tick values
# also not as fast as it could be :/
#ggplot(diamonds, aes("carat", "price")) +
#  geom_point() +
#  ggsave("figs/2.3_5.pdf")
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2
./chapter2  15.16s user 0.03s system 99% cpu 15.213 total

so half the time accounted for I guess? Let’s also remove the DF conversion:

let diamondsCsv = readCsv("../data/diamonds.csv")
#let diamonds = toDf(readCsv("../data/diamonds.csv"))
let economics = toDf(readCsv("../data/economics.csv"))

# works but ylabels are to close to the plot so that they are
# on top of the tick values
# also not as fast as it could be :/
#ggplot(diamonds, aes("carat", "price")) +
#  geom_point() +
#  ggsave("figs/2.3_5.pdf")
basti at void in ~/CastData/ExternCode/ggplotnim/bookPlots ツ time ./chapter2
./chapter2  14.63s user 0.02s system 99% cpu 14.741 total

aha! So the parsing takes almost 14 s! Let’s verify that. Create a /tmp/testread.nim with:

import ggplotnim

let dcsv = readCsv("/home/basti/CastData/ExternCode/ggplotnim/data/diamonds.csv")
basti at void in /tmp λ nim c testread.nim
basti at void in /tmp λ time ./testread 
./testread  0.98s user 0.02s system 99% cpu 1.002 total

Huh?

Well, at this point I went back and forth, inserting random timers into my code and being extremely confused. For some reason it seemed like the ggplot calls above the diamonds parsing caused the parsing to be slower?!?!?! Guess I’m running a quantum computer. :P

I decided to figure out what the limit to this is. Eventually I ended up at the following test file weirdFastSlow.nim:

import ggplotnim, times

proc fastRead() =
  let csvD = readCsv("data/diamonds.csv")
  let df = toDf(csvD)
  echo df

proc fastPlot() =
  let mpg = toDf(readCsv("data/mpg.csv"))
  ggplot(mpg, aes(x = "displ", y = "hwy")) +
    geom_point() +
    ggsave("figs/2.3_1.pdf")
  ggplot(mpg, aes("displ", "hwy")) +
    geom_point() +
    ggsave("figs/2.3_2.pdf")
  ggplot(mpg, aes("cty", "hwy")) +
    geom_point() +
    ggsave("figs/2.3_4.pdf")

proc fastTogether() =
  fastPlot()
  fastRead()

proc slowTogether() =
  block:
    let mpg = toDf(readCsv("data/mpg.csv"))
    ggplot(mpg, aes(x = "displ", y = "hwy")) +
      geom_point() +
      ggsave("figs/2.3_1.pdf")
    ggplot(mpg, aes("displ", "hwy")) +
      geom_point() +
      ggsave("figs/2.3_2.pdf")
    ggplot(mpg, aes("cty", "hwy")) +
      geom_point() +
      ggsave("figs/2.3_4.pdf")
  block:
    let csvD = readCsv("data/diamonds.csv")
    let df = toDf(csvD)
    echo df

when isMainModule:
  let t0 = epochTime()
  fastPlot()
  let t1 = epochTime()
  echo "Plot first: ", (t1 - t0)
  fastRead()
  let t2 = epochTime()
  echo "Read first: ", (t2 - t1)
  fastTogether()
  let t3 = epochTime()
  echo "Fast together ", (t3 - t2)
  slowTogether()
  let t4 = epochTime()
  echo "Together ", (t4 - t3)
  • 1 proc which only parses, converts to DF and echoes the DF
  • 1 proc, which only parses the much smaller mpg dataset and creates 3 plots
  • 1 proc, which calls the above two internally
  • and finally a proc, which contains the code of the first two procs in its body, each in a block.

Running the code then results in the following funny result (removed the DF echo output):

basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c bookPlots/weirdFastSlow.nim
basti at void in ~/CastData/ExternCode/ggplotnim ツ time bookPlots/weirdFastSlow
Plot first: 0.2592968940734863                                                  
Dataframe with 10 columns and 53940 rows:                                       
...
Read first: 1.393932580947876
Dataframe with 10 columns and 53940 rows:
...
Fast together 1.590697765350342
Dataframe with 10 columns and 53940 rows:
...
Together 15.17449021339417
bookPlots/weirdFastSlow  18.33s user 0.04s system 99% cpu 18.442 total          

So, a proc which essentially does the same as two individual procs is almost 10x slower. Right, that makes sense. :P

At this point it was clear that I had to figure out what the problem is. I don’t want such a time bomb in my code.

Things that came to my mind to try:

  • look at asm of the slow and fast version, maybe something is noticable -> yeah right, me as someone who probably only spent about 5 hours in my life in total looking at asm…
  • use perf to investigate what’s going on
  • use valgrind together with callgrind

I first dabbled with perf, but got a little frustrated (I managed to get it working, more on that later), so I went the asm route next.

Men who stare at… Guy who stares at asm

Fortunately mratsim opened an issue about asm generation from Nim code recently here (otherwise I’d have to hunt for the best compiler flags :P): nim-lang/Nim#12578

So I created a slow.nim and a fast.nim with the following code:

# slow.nim
import ggplotnim

proc slowTogether() =

  let mpg = toDf(readCsv("data/mpg.csv"))
  ggplot(mpg, aes(x = "displ", y = "hwy")) +
    geom_point() +
    ggsave("figs/2.3_1.pdf")
  ggplot(mpg, aes("displ", "hwy")) +
    geom_point() +
    ggsave("figs/2.3_2.pdf")
  ggplot(mpg, aes("cty", "hwy")) +
    geom_point() +
    ggsave("figs/2.3_4.pdf")
  let csvD = readCsv("data/diamonds.csv")
  let df = toDf(csvD)
  echo df

when isMainModule:
  slowTogether()

and:

# fast.nim
import ggplotnim

proc fastRead() =
  let csvD = readCsv("data/diamonds.csv")
  let df = toDf(csvD)
  echo df

proc fastPlot() =
  let mpg = toDf(readCsv("data/mpg.csv"))
  ggplot(mpg, aes(x = "displ", y = "hwy")) +
    geom_point() +
    ggsave("figs/2.3_1.pdf")
  ggplot(mpg, aes("displ", "hwy")) +
    geom_point() +
    ggsave("figs/2.3_2.pdf")
  ggplot(mpg, aes("cty", "hwy")) +
    geom_point() +
    ggsave("figs/2.3_4.pdf")

when isMainModule:
  fastPlot()
  fastRead()

For reference these are the run times of the above two:

basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c bookPlots/slow && time bookPlots/slow
... 
bookPlots/slow  14.80s user 0.03s system 99% cpu 14.921 total
basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c bookPlots/fast && time bookPlots/fast
...
bookPlots/fast  1.67s user 0.03s system 99% cpu 1.703 total

so 1.67 s vs. 14.80 s.

To look at the assembly of these we compile with:

basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --passC:"-fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -masm=intel -fverbose-asm -S" --debugger:native --noLinking:on bookPlots/slow.nim
basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --passC:"-fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -masm=intel -fverbose-asm -S" --debugger:native --noLinking:on bookPlots/fast.nim

This produces the asm output in the Nim cache directory, typically here: ~/.cache/nim/. The files we’re interested in are:

They look something like this:

.LC0:
	.string	"slowTogether"
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC1:
	.string	"/home/basti/CastData/ExternCode/ggplotnim/bookPlots/slow.nim"
	.align 8
.LC2:
	.string	"/home/basti/CastData/ExternCode/ggplotnim/src/ggplotnim.nim"
	.align 8
.LC6:
	.string	"/home/basti/src/nim/nim_git_repo/lib/system.nim"
	.text
	.globl	slowTogether__bCXXx182645e8PWKobq26w
	.hidden	slowTogether__bCXXx182645e8PWKobq26w
	.type	slowTogether__bCXXx182645e8PWKobq26w, @function
slowTogether__bCXXx182645e8PWKobq26w:
.LFB4:
	.file 2 "/home/basti/CastData/ExternCode/ggplotnim/bookPlots/slow.nim"
	.loc 2 3 75 is_stmt 1 view -0
	.cfi_startproc
	push	r15	#
	.cfi_def_cfa_offset 16
	.cfi_offset 15, -16
	push	r14	#
	.cfi_def_cfa_offset 24
	.cfi_offset 14, -24
	push	r13	#
	.cfi_def_cfa_offset 32
	.cfi_offset 13, -32
	push	r12	#
	.cfi_def_cfa_offset 40
	.cfi_offset 12, -40
	push	rbp	#
	.cfi_def_cfa_offset 48
	.cfi_offset 6, -48
	push	rbx	#
	.cfi_def_cfa_offset 56
	.cfi_offset 3, -56
	sub	rsp, 753192	#,
	.cfi_def_cfa_offset 753248
# /home/basti/CastData/ExternCode/ggplotnim/bookPlots/slow.nim:3: proc slowTogether() =
	.loc 2 3 75 is_stmt 0 view .LVU9
	mov	rax, QWORD PTR fs:40	# tmp276, MEM[(<address-space-1> long unsigned int *)40B]
	mov	QWORD PTR 753176[rsp], rax	# D.2982, tmp276
	xor	eax, eax	# tmp276
	.loc 2 3 77 is_stmt 1 view .LVU10
	.loc 2 3 126 view .LVU11
	.loc 2 3 179 view .LVU12
	.loc 2 3 229 view .LVU13
	.loc 2 3 275 view .LVU14
	.loc 2 3 325 view .LVU15
	.loc 2 3 374 view .LVU16
	.loc 2 3 418 view .LVU17
	.loc 2 3 464 view .LVU18
	.loc 2 3 508 view .LVU19
	.loc 2 3 558 view .LVU20
	.loc 2 3 605 view .LVU21
	.loc 2 3 656 view .LVU22
	.loc 2 3 706 view .LVU23
	.loc 2 3 751 view .LVU24
	.loc 2 3 798 view .LVU25
	.loc 2 3 843 view .LVU26
	.loc 2 3 894 view .LVU27
	.loc 2 3 941 view .LVU28
	.loc 2 3 992 view .LVU29
	.loc 2 3 1042 view .LVU30
	.loc 2 3 1087 view .LVU31
	.loc 2 3 1134 view .LVU32
	.loc 2 3 1179 view .LVU33
	.loc 2 3 1233 view .LVU34
	.loc 2 3 1281 view .LVU35
	.loc 2 3 1319 view .LVU36
	.loc 2 3 1319 view .LVU37
	lea	rax, .LC0[rip]	# tmp278,
	mov	QWORD PTR 232[rsp], rax	# FR_.procname, tmp278
	.loc 2 3 1319 view .LVU38
	lea	rax, .LC1[rip]	# tmp279,
	mov	QWORD PTR 248[rsp], rax	# FR_.filename, tmp279
	.loc 2 3 1319 view .LVU39
	mov	QWORD PTR 240[rsp], 0	# FR_.line,
	.loc 2 3 1319 view .LVU40
	mov	WORD PTR 256[rsp], 0	# FR_.len,
	.loc 2 3 1319 view .LVU41

and

.LBE336:
	.loc 6 1000000 80 is_stmt 1 view .LVU594
	.loc 2 21 2 view .LVU595
	mov	QWORD PTR 16[rsp], 21	# FR_.line,
	.loc 2 21 2 view .LVU596
	lea	rbx, .LC1[rip]	# tmp95,
	mov	QWORD PTR 24[rsp], rbx	# FR_.filename, tmp95
	.loc 2 21 76 view .LVU597
	.loc 2 21 2 view .LVU598
	call	fastPlot__GdV69c9bJIeDVwXhxavk0MvA	#
.LVL151:
	.loc 2 22 2 view .LVU599
	mov	QWORD PTR 16[rsp], 22	# FR_.line,
	.loc 2 22 2 view .LVU600
	mov	QWORD PTR 24[rsp], rbx	# FR_.filename, tmp95
	.loc 2 22 76 view .LVU601
	.loc 2 22 2 view .LVU602
	call	fastRead__GdV69c9bJIeDVwXhxavk0MvA_2	#
# and somewhere else:
.LC6:
	.string	"fastRead"
	.section	.rodata.str1.8
	.align 8
.LC7:
	.string	"/home/basti/src/nim/nim_git_repo/lib/system.nim"
	.text
	.globl	fastRead__GdV69c9bJIeDVwXhxavk0MvA_2
	.hidden	fastRead__GdV69c9bJIeDVwXhxavk0MvA_2
	.type	fastRead__GdV69c9bJIeDVwXhxavk0MvA_2, @function
fastRead__GdV69c9bJIeDVwXhxavk0MvA_2:
.LFB5:
	.loc 2 3 75 is_stmt 1 view -0
	.cfi_startproc
	push	r12	#
	.cfi_def_cfa_offset 16
	.cfi_offset 12, -16
	push	rbp	#
	.cfi_def_cfa_offset 24
	.cfi_offset 6, -24
	push	rbx	#
	.cfi_def_cfa_offset 32
	.cfi_offset 3, -32
	sub	rsp, 176	#,
	.cfi_def_cfa_offset 208
# /home/basti/CastData/ExternCode/ggplotnim/bookPlots/fast.nim:3: proc fastRead() =
	.loc 2 3 75 is_stmt 0 view .LVU419
	mov	rax, QWORD PTR fs:40	# tmp108, MEM[(<address-space-1> long unsigned int *)40B]
	mov	QWORD PTR 168[rsp], rax	# D.3031, tmp108
	xor	eax, eax	# tmp108
	.loc 2 3 77 is_stmt 1 view .LVU420
	.loc 2 3 131 view .LVU421
	.loc 2 3 179 view .LVU422
	.loc 2 3 216 view .LVU423
	.loc 2 3 216 view .LVU424
	lea	rax, .LC6[rip]	# tmp110,
	mov	QWORD PTR 40[rsp], rax	# FR_.procname, tmp110
	.loc 2 3 216 view .LVU425
	lea	rax, .LC1[rip]	# tmp111,
	mov	QWORD PTR 56[rsp], rax	# FR_.filename, tmp111
	.loc 2 3 216 view .LVU426
	mov	QWORD PTR 48[rsp], 0	# FR_.line,
	.loc 2 3 216 view .LVU427
	mov	WORD PTR 64[rsp], 0	# FR_.len,
	.loc 2 3 216 view .LVU428
.LVL100:
.LBB312:
.LBI312:
	.loc 3 463 8 view .LVU429
.LBB313:
	.loc 3 463 46 view .LVU430
	.loc 3 464 2 view .LVU431
	.loc 3 464 3 view .LVU432
# and 
.LC0:
	.string	"fastPlot"
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC1:
	.string	"/home/basti/CastData/ExternCode/ggplotnim/bookPlots/fast.nim"
	.align 8
.LC2:
	.string	"/home/basti/CastData/ExternCode/ggplotnim/src/ggplotnim.nim"
	.text
	.globl	fastPlot__GdV69c9bJIeDVwXhxavk0MvA
	.hidden	fastPlot__GdV69c9bJIeDVwXhxavk0MvA
	.type	fastPlot__GdV69c9bJIeDVwXhxavk0MvA, @function
fastPlot__GdV69c9bJIeDVwXhxavk0MvA:
.LFB4:
	.file 2 "/home/basti/CastData/ExternCode/ggplotnim/bookPlots/fast.nim"
	.loc 2 8 73 is_stmt 1 view -0
	.cfi_startproc
	push	r15	#
	.cfi_def_cfa_offset 16
	.cfi_offset 15, -16
	push	r14	#
	.cfi_def_cfa_offset 24
	.cfi_offset 14, -24
	push	r13	#
	.cfi_def_cfa_offset 32
	.cfi_offset 13, -32
	push	r12	#
	.cfi_def_cfa_offset 40
	.cfi_offset 12, -40
	push	rbp	#
	.cfi_def_cfa_offset 48
	.cfi_offset 6, -48
	push	rbx	#
	.cfi_def_cfa_offset 56
	.cfi_offset 3, -56
	sub	rsp, 753064	#,
	.cfi_def_cfa_offset 753120
# /home/basti/CastData/ExternCode/ggplotnim/bookPlots/fast.nim:8: proc fastPlot() =
	.loc 2 8 73 is_stmt 0 view .LVU9
	mov	rax, QWORD PTR fs:40	# tmp264, MEM[(<address-space-1> long unsigned int *)40B]
	mov	QWORD PTR 753048[rsp], rax	# D.2990, tmp264
	xor	eax, eax	# tmp264
	.loc 2 8 75 is_stmt 1 view .LVU10
	.loc 2 8 124 view .LVU11
	.loc 2 8 177 view .LVU12
	.loc 2 8 227 view .LVU13
	.loc 2 8 273 view .LVU14
	.loc 2 8 323 view .LVU15
	.loc 2 8 372 view .LVU16
	.loc 2 8 416 view .LVU17
	.loc 2 8 462 view .LVU18
	.loc 2 8 506 view .LVU19
	.loc 2 8 556 view .LVU20
	.loc 2 8 603 view .LVU21
	.loc 2 8 654 view .LVU22
	.loc 2 8 704 view .LVU23
	.loc 2 8 749 view .LVU24
	.loc 2 8 796 view .LVU25
	.loc 2 8 841 view .LVU26
	.loc 2 8 892 view .LVU27
	.loc 2 8 939 view .LVU28
	.loc 2 8 990 view .LVU29
	.loc 2 8 1040 view .LVU30
	.loc 2 8 1085 view .LVU31
	.loc 2 8 1132 view .LVU32
	.loc 2 8 1177 view .LVU33
	.loc 2 8 1177 view .LVU34
	lea	rax, .LC0[rip]	# tmp266,
	mov	QWORD PTR 200[rsp], rax	# FR_.procname, tmp266
	.loc 2 8 1177 view .LVU35
	lea	rax, .LC1[rip]	# tmp267,
	mov	QWORD PTR 216[rsp], rax	# FR_.filename, tmp267
	.loc 2 8 1177 view .LVU36
	mov	QWORD PTR 208[rsp], 0	# FR_.line,
	.loc 2 8 1177 view .LVU37
	mov	WORD PTR 224[rsp], 0	# FR_.len,
	.loc 2 8 1177 view .LVU38

The thing is I don’t even know from where to where I should compare these two files. I just assume that the effective “function” ends where the nim lines about framePtr appear. I guess that’s where the stack frame is switched to leave the function? I have no idea. :)

Well, since I have no idea how to best look at assembly, I did what every disciple of emacs would surely do in this case: grab ediff and compare the two files!

And well. As you can imagine that didn’t turn out that well. Either the differences were just different numbers on some labels or the differences were too large for me to make any sense of..

So I abandoned this ship. jump

perf for the rescue (I hope?)

So going back to perf. According to the ultimate wisdom of the interwebs, I’m supposed to compile my programs to look at them with perf along the following lines:

nim c --passC:"-Og -ggdb -g3 -fno-omit-frame-pointer" chapter2.nim

From here we can either tell perf to give us statistics about running the binary:

basti at void in ~/CastData/ExternCode/ggplotnim ツ perf stat -ddd bookPlots/slow
Dataframe with 10 columns and 53940 rows:
        Idx      carat        cut      color    clarity      depth      table      price          x          y          z
          0       0.23      Ideal          E        SI2       61.5         55        326       3.95       3.98       2.43
          1       0.21    Premium          E        SI1       59.8         61        326       3.89       3.84       2.31
          2       0.23       Good          E        VS1       56.9         65        327       4.05       4.07       2.31
          3       0.29    Premium          I        VS2       62.4         58        334        4.2       4.23       2.63
          4       0.31       Good          J        SI2       63.3         58        335       4.34       4.35       2.75
          5       0.24  Very Good          J       VVS2       62.8         57        336       3.94       3.96       2.48
          6       0.24  Very Good          I       VVS1       62.3         57        336       3.95       3.98       2.47
          7       0.26  Very Good          H        SI1       61.9         55        337       4.07       4.11       2.53
          8       0.22       Fair          E        VS2       65.1         61        337       3.87       3.78       2.49
          9       0.23  Very Good          H        VS1       59.4         61        338          4       4.05       2.39
         10        0.3       Good          J        SI1         64         55        339       4.25       4.28       2.73
         11       0.23      Ideal          J        VS1       62.8         56        340       3.93        3.9       2.46
         12       0.22    Premium          F        SI1       60.4         61        342       3.88       3.84       2.33
         13       0.31      Ideal          J        SI2       62.2         54        344       4.35       4.37       2.71
         14        0.2    Premium          E        SI2       60.2         62        345       3.79       3.75       2.27
         15       0.32    Premium          E         I1       60.9         58        345       4.38       4.42       2.68
         16        0.3      Ideal          I        SI2         62         54        348       4.31       4.34       2.68
         17        0.3       Good          J        SI1       63.4         54        351       4.23       4.29        2.7
         18        0.3       Good          J        SI1       63.8         56        351       4.23       4.26       2.71
         19        0.3  Very Good          J        SI1       62.7         59        351       4.21       4.27       2.66

 Performance counter stats for 'bookPlots/slow':

          5,339.39 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            19,184      page-faults:u             #    0.004 M/sec                  
    21,020,453,174      cycles:u                  #    3.937 GHz                      (38.42%)
    57,148,967,453      instructions:u            #    2.72  insn per cycle           (46.12%)
     7,900,793,796      branches:u                # 1479.719 M/sec                    (46.12%)
         2,314,388      branch-misses:u           #    0.03% of all branches          (46.14%)
    11,142,331,629      L1-dcache-loads:u         # 2086.819 M/sec                    (46.16%)
        42,863,931      L1-dcache-load-misses:u   #    0.38% of all L1-dcache hits    (46.17%)
         3,491,544      LLC-loads:u               #    0.654 M/sec                    (30.79%)
         1,201,566      LLC-load-misses:u         #   34.41% of all LL-cache hits     (30.79%)
   <not supported>      L1-icache-loads:u                                           
         3,962,707      L1-icache-load-misses:u                                       (30.79%)
    11,110,873,725      dTLB-loads:u              # 2080.927 M/sec                    (30.79%)
           148,175      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   (30.77%)
           106,566      iTLB-loads:u              #    0.020 M/sec                    (30.75%)
             5,602      iTLB-load-misses:u        #    5.26% of all iTLB cache hits   (30.73%)
   <not supported>      L1-dcache-prefetches:u                                      
   <not supported>      L1-dcache-prefetch-misses:u                                   

       5.339677584 seconds time elapsed

       5.283961000 seconds user
       0.023719000 seconds sys

which is semi helpful. I guess reading 0.03 % of all branches missed at least means the troulbes are not due to branch misses.

(the -ddd is the verbosity level I believe, the more d perf gets, the happier it is to work for us it seems… cough).

Or we can tell perf to generate a proper report for us, which in principle is just:

perf record <binary>

Interesting arguments are:

-o <outname>
--call-graph dwarf --

the former outputs to the specified file (I highlight it, because.. see below) and the latter changes the format. It generates a call stack, which can be easier traversed. However with the latter option I don’t know how to access the annotation?

So first I went:

perf record -o slow.data bookPlots/slow

which worked just fine. Let’s look at that file I thought:

basti at void in ~/CastData/ExternCode/ggplotnim ツ perf report slow.data

… opens a mostly empty ncurses window … I press enter and am greeted with:

perf: Segmentation fault
-------- backtrace --------
perf(+0x3a6f94)[0x565168d2cf94]
/usr/lib/libc.so.6(+0x3b9af)[0x7fd36c3b79af]
perf(+0x3a449e)[0x565168d2a49e]
perf(perf_evlist__tui_browse_hists+0x378)[0x565168d2b438]
perf(cmd_report+0x2e95)[0x565168bbe165]
perf(+0x2a0db2)[0x565168c26db2]
perf(main+0x66b)[0x565168ba2c4b]
/usr/lib/libc.so.6(__libc_start_main+0xea)[0x7fd36c3a2dea]
perf(_start+0x29)[0x565168ba2ea9]

what da hell? In 2 minutes I managed to segfault perf. Good job Vindaar!

Googling at first didn’t help. My man page of perf is also fun:

PERF(1)                           perf Manual                          PERF(1)

NAME
       perf - Performance analysis tools for Linux

SYNOPSIS
       perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

OPTIONS
       --debug
           Setup debug variable (see list below) in value range (0, 10). Use
           like: --debug verbose # sets verbose = 1 --debug verbose=2 # sets
           verbose = 2

               List of debug variables allowed to set:
                 verbose          - general debug messages
                 ordered-events   - ordered events object debug messages
                 data-convert     - data convert command debug messages
                 stderr           - write debug output (option -v) to stderr
                                    in browser mode

       --buildid-dir
           Setup buildid cache directory. It has higher priority than
           buildid.dir config file option.

       -v, --version
           Display perf version.

       -h, --help
           Run perf help command.

DESCRIPTION
       Performance counters for Linux are a new kernel-based subsystem that
       provide a framework for all things performance analysis. It covers
       hardware level (CPU/PMU, Performance Monitoring Unit) features and
       software features (software counters, tracepoints) as well.

SEE ALSO
       perf-stat(1), perf-top(1), perf-record(1), perf-report(1), perf-list(1)

perf 5.3.1_1                      10/02/2019                           PERF(1)
/tmp/man.XXXXsRnSeE (END)

(yeah, by now I realize that I should have gone with

perf help <xyz>

I googled the segfault and didn’t find anything helpful (guess what, the name perf does not help your googling even a bit..). At some point I realized that the syntax to supply the data file is probably wrong, and yep. It is:

perf report -i <your-darn-file>

I mean seriously. Am I the first person to make that mistake, were the other people either too lazy to report the not very helpful segfault or were the developers simply to lazy to fix it?

Well, after I got over that initial bump, perf was actually really useful!

I generated the reports for both files:

basti at void in ~/CastData/ExternCode/ggplotnim ツ perf record -o slow_dwarf.data --call-graph dwarf -- bookPlots/slow
basti at void in ~/CastData/ExternCode/ggplotnim ツ perf record -o fast_dwarf.data --call-graph dwarf -- bookPlots/fast

And compared the two side by side (slow on the LHS, fast on the RHS):

Samples: 21K of event 'cycles:u', Event count (approx.): 21099477017                                                   │Samples: 2K of event 'cycles:u', Event count (approx.): 2503069571
  Children      Self  Command  Shared Object            Symbol                                                         │  Children      Self  Command  Shared Object            Symbol
+   97.40%     0.00%  slow     [unknown]                [.] 0xffffffffffffffff                                        ◆│+   83.35%     0.00%  fast     fast                     [.] _start                                                   ◆
+   96.23%     0.00%  slow     slow                     [.] slowTogether__bCXXx182645e8PWKobq26w                      ▒│+   83.35%     0.00%  fast     libc-2.30.so             [.] __libc_start_main                                        ▒
+   92.81%     0.09%  slow     slow                     [.] readCsv__IFciN2r3VwHQdzf7Sonedw                           ▒│+   83.35%     0.00%  fast     fast                     [.] main                                                     ▒
+   89.42%     0.18%  slow     slow                     [.] collectCT__XHio9cMpnLoH7GyCj1Z9besg                       ▒│+   83.35%     0.00%  fast     fast                     [.] NimMain                                                  ▒
+   89.24%     0.02%  slow     slow                     [.] collectCTBody__XHio9cMpnLoH7GyCj1Z9besg_2                 ▒│+   83.16%     0.00%  fast     fast                     [.] NimMainInner                                             ▒
+   86.36%     0.00%  slow     slow                     [.] copyStringRC1                                             ▒│+   83.16%     0.00%  fast     fast                     [.] NimMainModule                                            ▒
+   86.35%     0.06%  slow     slow                     [.] newObjRC1                                                 ▒│+   83.16%     0.00%  fast     fast                     [.] fastRead__GdV69c9bJIeDVwXhxavk0MvA_2                     ▒
+   85.95%    12.54%  slow     slow                     [.] markStackAndRegisters__U6T7JWtDLrWhtmhXSoy9a6g            ▒│+   60.05%     0.95%  fast     fast                     [.] readCsv__IFciN2r3VwHQdzf7Sonedw                          ▒
+   85.73%     0.00%  slow     slow                     [.] gcMark__x5SbLN3uVBCsEa67N20nPwsystem (inlined)            ▒│+   30.17%     1.20%  fast     fast                     [.] collectCT__XHio9cMpnLoH7GyCj1Z9besg                      ▒
+   74.47%    37.26%  slow     slow                     [.] interiorAllocatedPtr__NuzKjA4SX9afyji9cHHIuKpQ            ▒│+   29.02%     0.04%  fast     fast                     [.] collectCTBody__XHio9cMpnLoH7GyCj1Z9besg_2                ▒
+   55.23%     0.00%  slow     slow                     [.] isAccessible__EZspgz843pJ9adnD0ZpJO6wsystem (inlined)     ▒│+   26.79%     0.37%  fast     fast                     [.] toDf__OdkyX08ij8cszGSKXYpErA                             ▒
+   38.27%    21.87%  slow     slow                     [.] contains__9b5xR7VBZVwQDvk5Nr9bDKdQ                        ▒│+   24.83%     1.73%  fast     fast                     [.] rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ                       ▒
+   17.61%    17.61%  slow     slow                     [.] intSetGet__O3FRrWKKUdi8uRTGxiPdIg                         ▒│+   20.21%     0.26%  fast     fast                     [.] genericAssign                                            ▒
+   11.80%     0.00%  slow     slow                     [.] popFrame (inlined)                                        ▒│+   20.03%     1.60%  fast     fast                     [.] genericAssignAux__U5DxFPRpHCCZDKWQzM9adaw                ▒
+   10.21%     0.00%  slow     slow                     [.] usrToCell__QFQqcLB3lgOdwipkv9a60xwsystem (inlined)        ▒│+   19.52%     2.59%  fast     fast                     [.] genericAssignAux__69cRBG8C28ydeYlu49a9aOJCQ              ▒
+    7.73%     0.00%  slow     slow                     [.] nimFrame (inlined)                                        ▒│+   15.73%     1.50%  fast     fast                     [.] copyString                                               ▒
+    7.66%     0.00%  slow     slow                     [.] popFrame (inlined)                                        ▒│+   14.98%     1.45%  fast     fast                     [.] nsuStrip                                                 ▒
+    7.53%     0.00%  slow     slow                     [.] nimFrame (inlined)                                        ▒│+   14.54%     0.07%  fast     fast                     [.] toPersistentVector__8SFGsOWADGPetLxZAIChPA               ▒
+    6.84%     0.00%  slow     slow                     [.] pageIndex__cMO7rAWQ0KzYXPVgB0ZR3gsystem (inlined)         ▒│+   14.29%     0.26%  fast     fast                     [.] X5BX5D___77tSPwrHe1Kl49cYFFBeLkQ                         ▒
+    5.53%     0.00%  slow     slow                     [.] popFrame (inlined)                                        ▒│+   14.12%     0.22%  fast     fast                     [.] rawNewStringNoInit                                       ▒
+    5.08%     0.00%  slow     slow                     [.] nimFrame (inlined)                                        ▒│+   14.11%     0.00%  fast     [unknown]                [.] 0xffffffffffffffff                                       ▒
+    3.95%     0.26%  slow     slow                     [.] rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ                        ▒│+   13.92%     0.00%  fast     fast                     [.] collectCycles__XHio9cMpnLoH7GyCj1Z9besg_3                ▒
+    3.37%     0.03%  slow     slow                     [.] toDf__OdkyX08ij8cszGSKXYpErA                              ▒│+   13.90%     0.07%  fast     fast                     [.] newObjNoInit                                             ▒
+    2.80%     0.22%  slow     slow                     [.] nsuStrip                                                  ▒│+   13.69%     2.93%  fast     fast                     [.] substr__2yh9cer0ymNRHlOOg8P7IuA                          ▒
+    2.59%     0.38%  slow     slow                     [.] substr__2yh9cer0ymNRHlOOg8P7IuA                           ▒│+   13.57%     0.23%  fast     fast                     [.] copyStringRC1                                            ▒
+    2.55%     0.14%  slow     slow                     [.] newObj                                                    ▒│+   13.38%     8.21%  fast     fast                     [.] unsureAsgnRef                                            ▒
+    2.41%     0.05%  slow     slow                     [.] genericAssign                                             ▒│+   13.34%     0.45%  fast     fast                     [.] newObj                                                   ▒
+    2.40%     0.00%  slow     slow                     [.] nimFrame (inlined)                                        ▒│+   13.30%     0.53%  fast     fast                     [.] readRow__b09aWQWdQLmOgKmRBW0eKTQ                         ▒
+    2.39%     0.24%  slow     slow                     [.] genericAssignAux__U5DxFPRpHCCZDKWQzM9adaw                 ▒│+   13.03%     0.91%  fast     fast                     [.] newObjRC1                                                ▒
+    2.31%     0.28%  slow     slow                     [.] genericAssignAux__69cRBG8C28ydeYlu49a9aOJCQ               ▒│+   12.24%     1.48%  fast     fast                     [.] parseField__6mfYg4gHtbgPqGUu2wyijQ                       ▒
+    2.25%     0.06%  slow     slow                     [.] mnewString                                                ▒│+   12.10%     0.23%  fast     fast                     [.] rawNewString                                             ▒
+    2.20%     0.04%  slow     slow                     [.] rawNewString                                              ▒│+   12.06%     0.11%  fast     fast                     [.] mnewString                                               ▒
+    2.00%     0.00%  slow     slow                     [.] toPersistentVector__8SFGsOWADGPetLxZAIChPA                ▒│+   11.50%     0.31%  fast     fast                     [.] markS__SOJE9bROCOc8iabVsKM64Sg                           ▒
+    1.89%     0.16%  slow     slow                     [.] copyString                                                ▒│+   10.13%     1.40%  fast     fast                     [.] markStackAndRegisters__U6T7JWtDLrWhtmhXSoy9a6g           ▒
+    1.76%     0.02%  slow     slow                     [.] X5BX5D___77tSPwrHe1Kl49cYFFBeLkQ                          ▒│+   10.13%     0.00%  fast     fast                     [.] gcMark__x5SbLN3uVBCsEa67N20nPwsystem (inlined)           ▒
+    1.71%     0.01%  slow     slow                     [.] rawNewStringNoInit                                        ▒│+   10.02%     6.34%  fast     fast                     [.] forAllChildren__XCvXrotwhq0gugZtuZTNPQ                   ▒

where we can immedetialy see that something is fishy. The right (fast) looks about what I’d expect. We have main, NimMain, etc. and eventually at ~83% fastRead (which I’d expect to be the slowest, see the rough checks at the beginning) and with about 60% readCsv. This is logical, because if readCsv is running, then main has to be above that, since it calls it in the first place!

On the LHS however, the first entry is

+   97.40%     0.00%  slow     [unknown]                [.] 0xffffffffffffffff

what the hell is that supposed to mean?

Also slowTogether and specifically readCsv make up 96% and 92% respectively (so note that this is a higher percentage in a more than 5x longer run time!). So the readCsv in the LHS case is doing some crazy shit.

Let’s expand the call stack of the readCsv line on the LHS (press e to expand it completely or enter level by level):

-   92.81%     0.09%  slow     slow                     [.] readCsv__IFciN2r3VwHQdzf7Sonedw          
   - 92.72% readCsv__IFciN2r3VwHQdzf7Sonedw                                                          
      - 86.32% copyStringRC1                                                                         
         - 86.31% newObjRC1                                                                          
            - 86.10% collectCT__XHio9cMpnLoH7GyCj1Z9besg                                             
               - 86.03% collectCTBody__XHio9cMpnLoH7GyCj1Z9besg_2                                    
                  - 84.13% markStackAndRegisters__U6T7JWtDLrWhtmhXSoy9a6g                            
                     - gcMark__x5SbLN3uVBCsEa67N20nPwsystem (inlined)                                
                        - 70.88% interiorAllocatedPtr__NuzKjA4SX9afyji9cHHIuKpQ                      
                           - 54.09% isAccessible__EZspgz843pJ9adnD0ZpJO6wsystem (inlined)            
                              - 35.36% contains__9b5xR7VBZVwQDvk5Nr9bDKdQ                            
                                 - 16.08% intSetGet__O3FRrWKKUdi8uRTGxiPdIg                          
                                      5.41% popFrame (inlined)                                       
                                      4.99% nimFrame (inlined)                                       
                                   7.44% popFrame (inlined)                                          
                                   2.25% nimFrame (inlined)                                          
                              - 6.70% pageIndex__cMO7rAWQ0KzYXPVgB0ZR3gsystem (inlined)              
                                   2.27% nimFrame (inlined)                                          
                                   1.01% popFrame (inlined)                                          
                                6.27% popFrame (inlined)                                             
                                3.29% nimFrame (inlined)                                             
                                1.06% intSetGet__O3FRrWKKUdi8uRTGxiPdIg                              
                             4.30% popFrame (inlined)                                                
                             1.99% nimFrame (inlined)                                                
                        - 9.99% usrToCell__QFQqcLB3lgOdwipkv9a60xwsystem (inlined)                   
                             7.29% nimFrame (inlined)                                                
                             0.61% popFrame (inlined)                                                
                          0.97% contains__9b5xR7VBZVwQDvk5Nr9bDKdQ                                   
                    0.96% interiorAllocatedPtr__NuzKjA4SX9afyji9cHHIuKpQ                             
      - 2.79% nsuStrip                                                                               
         - 2.58% substr__2yh9cer0ymNRHlOOg8P7IuA                                                     
            - 2.21% mnewString                                                                       
               - 2.14% rawNewString                                                                  
                  - newObj                                                                           
                     - 1.94% rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ                                      
                        - 1.45% collectCT__XHio9cMpnLoH7GyCj1Z9besg                                  
                           - 1.40% collectCTBody__XHio9cMpnLoH7GyCj1Z9besg_2                         
                              - 1.07% markStackAndRegisters__U6T7JWtDLrWhtmhXSoy9a6g                 
                                 - gcMark__x5SbLN3uVBCsEa67N20nPwsystem (inlined)                    
                                    - 0.90% interiorAllocatedPtr__NuzKjA4SX9afyji9cHHIuKpQ           
                                         0.68% isAccessible__EZspgz843pJ9adnD0ZpJO6wsystem (inlined) 
      - 1.48% readRow__b09aWQWdQLmOgKmRBW0eKTQ                                                       
         - 1.35% parseField__6mfYg4gHtbgPqGUu2wyijQ                                                  
              1.08% unsureAsgnRef                                                                    
      - 0.83% X5BX5D___l0219aS8r1rZqdRemYqAVGQ                                                       
         - 0.74% rawGet__gS7Fz9bCHsSKnHns9brIjsogtables (inlined)                                    
              0.54% hash__6PCYkKlCNhq9cnRLnqWKkwQ                                                    
      - 0.60% rowEntry__iSnSJzyyzosqzYvzg3RnzQ                                                       
           0.55% find__ZFKPMFh1RWpGOSS69c9a5CHAsystem (inlined)                                    

So we see that 86% is copyStringRC1, newObjRC and then collectCT. I guess the last is related to garbage collection. So for some reason in this case the readCsv proc is copying strings and collecting them for a living?

Before we do anything with this knowledge, let’s compare this to the expansion of readCsv on the RHS:

-   60.05%     0.95%  fast     fast                     [.] readCsv__IFciN2r3VwHQdzf7Sonedw         
   - 59.10% readCsv__IFciN2r3VwHQdzf7Sonedw                                                         
      - 14.95% nsuStrip                                                                             
         - 13.54% substr__2yh9cer0ymNRHlOOg8P7IuA                                                   
            - 10.75% mnewString                                                                     
               - rawNewString                                                                       
                  - newObj                                                                          
                     - 9.09% rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ                                     
                        - 4.82% collectCT__XHio9cMpnLoH7GyCj1Z9besg                                 
                           - 4.28% collectCTBody__XHio9cMpnLoH7GyCj1Z9besg_2                        
                              - 2.53% markStackAndRegisters__U6T7JWtDLrWhtmhXSoy9a6g                
                                 - gcMark__x5SbLN3uVBCsEa67N20nPwsystem (inlined)                   
                                    - 2.09% interiorAllocatedPtr__NuzKjA4SX9afyji9cHHIuKpQ          
                                       - 1.39% isAccessible__EZspgz843pJ9adnD0ZpJO6wsystem (inlined)
                                            0.85% contains__9b5xR7VBZVwQDvk5Nr9bDKdQ                
                              - 1.45% collectZCT__EN6T32AMm3va9bsrdxtF0cg                           
                                   0.65% rawDealloc__K7uQ6aTKvW6OnOV8EMoNNQ                         
                          2.56% rawAlloc__mE4QEVyMvGRVliDWDngZCQ                                    
                          1.11% addNewObjToZCT__xkq8ZbJr9bIUKHN9cK4tT5NAsystem (inlined)            
                     - 1.14% zeroMem__FsvhfMqbswETmyMWtamDewsystem (inlined)                        
                          nimZeroMem (inlined)                                                      
                        - nimSetMem__zxfKBYntu9cBapkhrCOk1fgmemory (inlined)                        
                             0.91% __memset_avx2_unaligned_erms                                     
      - 13.54% copyStringRC1                                                                        
         - 13.00% newObjRC1                                                                         
            - 10.95% collectCT__XHio9cMpnLoH7GyCj1Z9besg                                            
               - 10.81% collectCTBody__XHio9cMpnLoH7GyCj1Z9besg_2                                   
                  - 6.47% collectCycles__XHio9cMpnLoH7GyCj1Z9besg_3                                 
                     - 5.33% markS__SOJE9bROCOc8iabVsKM64Sg                                         
                        - 4.18% forAllChildren__XCvXrotwhq0gugZtuZTNPQ                              
                           - 0.83% Marker_tySequence__sM4lkSb7zS6F7OVMvW9cffQ                       
                              - nimGCvisit                                                          
                                doOperation__sl6eqhLncFedgwzv6TlMVw                                 
                          1.03% containsOrIncl__qhy8GaXaPs9bLqr6V8CV9cFg                            
                     - 1.10% sweep__XHio9cMpnLoH7GyCj1Z9besg_5                                      
                          0.88% contains__ClLkUQKF8KrRxQPdAJDd5w                                    
                  - 2.82% markStackAndRegisters__U6T7JWtDLrWhtmhXSoy9a6g                            
                     - gcMark__x5SbLN3uVBCsEa67N20nPwsystem (inlined)                               
                        - 2.38% interiorAllocatedPtr__NuzKjA4SX9afyji9cHHIuKpQ                      
                           - 1.76% isAccessible__EZspgz843pJ9adnD0ZpJO6wsystem (inlined)            
                                0.98% contains__9b5xR7VBZVwQDvk5Nr9bDKdQ                            
                  - 1.48% collectZCT__EN6T32AMm3va9bsrdxtF0cg                                       
                       0.68% rawDealloc__K7uQ6aTKvW6OnOV8EMoNNQ                                     
              1.10% rawAlloc__mE4QEVyMvGRVliDWDngZCQ                                                
      - 13.26% readRow__b09aWQWdQLmOgKmRBW0eKTQ                                                     
         - 12.24% parseField__6mfYg4gHtbgPqGUu2wyijQ                                                
            - 9.59% unsureAsgnRef                                                                   
               - 3.54% isOnStack__plOlFsQAAvcYd3nF5LfWzw                                            
                    1.03% getActiveStack__muO7WGG730PTJWpoUnluHg                                    
                    0.76% popFrame (inlined)                                                        
               - 1.74% usrToCell__QFQqcLB3lgOdwipkv9a60xwsystem (inlined)                           
                    1.33% nimFrame (inlined)                                                        
                 1.37% decRef__AT1eRuflKWyTTBdLjEDZbg_2system (inlined)                             
                 0.87% incRef__AT1eRuflKWyTTBdLjEDZbgsystem (inlined)                               
                 0.72% popFrame (inlined)                                                           

I don’t know about you, but those percentages seem to be a lot more sane to me. It’s stripping, copying and reading rows at about an equal amount (the stuff below I left out to not extend this even more, but those are hashing related and rowEntry.

So assuming that the GC is actually our issue, let’s compile the code without any gc and see what happens:

basti at void in ~/CastData/ExternCode/ggplotnim ツ time bookPlots/slow 
Dataframe with 10 columns and 53940 rows:                               
...                                                                     
bookPlots/slow  0.91s user 0.06s system 99% cpu 0.972 total             
basti at void in ~/CastData/ExternCode/ggplotnim ツ time bookPlots/fast
Dataframe with 10 columns and 53940 rows:                              
...
bookPlots/fast  0.94s user 0.05s system 99% cpu 0.995 total            

(remember the nim c bookPlots/fast call from above took 1.67 s! I guess it’s expected that the code would be faster without the GC in the fast case, but it’s great to see that also the slow code is just as fast!

Does this also hold for different GCs?

  • boehm:
    basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --gc:boehm bookPlots/slow && time bookPlots/slow
    Dataframe with 10 columns and 53940 rows:
    ...
    bookPlots/slow  1.24s user 0.04s system 141% cpu 0.905 total
    basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --gc:boehm bookPlots/fast && time bookPlots/fast
    Dataframe with 10 columns and 53940 rows:
    ...
    bookPlots/fast  1.29s user 0.06s system 147% cpu 0.914 total
        
  • markAndSweep:
basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --gc:markAndSweep bookPlots/slow && time bookPlots/slow
Dataframe with 10 columns and 53940 rows:
...
bookPlots/slow  1.42s user 0.03s system 99% cpu 1.458 total
basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --gc:markAndSweep --verbosity:0 --hints:off --warnings:off bookPlots/fast && time bookPlots/fast
Dataframe with 10 columns and 53940 rows:
...
bookPlots/fast  1.33s user 0.03s system 99% cpu 1.370 total
  • and as a bonus destructors:
basti at void in ~/CastData/ExternCode/ggplotnim ツ nim c --gc:destructors bookPlots/slow && time bookPlots/slow
fatal.nim(39)            sysFatal
Error: unhandled exception: liftdestructors.nim(332, 14) `t.assignment != nil`  [AssertionError]

I guess destructors wins the race!

For now that’s all I figured out, since I also had to write all this stuff…

More?

Oh, btw. This page is quite good according to some people: http://brendangregg.com/perf.html

Since I don’t trust http sites to be around next year, here are all one liners from that site:

  1. One-Liners

Some useful one-liners I’ve gathered or written. Terminology I’m using, from lowest to highest overhead:

statistics/count: increment an integer counter on events sample: collect details (eg, instruction pointer or stack) from a subset of events (once every …) trace: collect details from every event Listing Events

# Listing all currently known events:
perf list

# Listing sched tracepoints:
perf list 'sched:*'

Counting Events

# CPU counter statistics for the specified command:
perf stat command

# Detailed CPU counter statistics (includes extras) for the specified command:
perf stat -d command

# CPU counter statistics for the specified PID, until Ctrl-C:
perf stat -p PID

# CPU counter statistics for the entire system, for 5 seconds:
perf stat -a sleep 5

# Various basic CPU statistics, system wide, for 10 seconds:
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10

# Various CPU level 1 data cache statistics for the specified command:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command

# Various CPU data TLB statistics for the specified command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command

# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command

# Using raw PMC counters, eg, counting unhalted core cycles:
perf stat -e r003c -a sleep 5 

# PMCs: counting cycles and frontend stalls via raw specification:
perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5

# Count syscalls per-second system-wide:
perf stat -e raw_syscalls:sys_enter -I 1000 -a

# Count system calls by type for the specified PID, until Ctrl-C:
perf stat -e 'syscalls:sys_enter_*' -p PID

# Count system calls by type for the entire system, for 5 seconds:
perf stat -e 'syscalls:sys_enter_*' -a sleep 5

# Count scheduler events for the specified PID, until Ctrl-C:
perf stat -e 'sched:*' -p PID

# Count scheduler events for the specified PID, for 10 seconds:
perf stat -e 'sched:*' -p PID sleep 10

# Count ext4 events for the entire system, for 10 seconds:
perf stat -e 'ext4:*' -a sleep 10

# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10

# Count all vmscan events, printing a report every second:
perf stat -e 'vmscan:*' -a -I 1000

Profiling

# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command

# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C:
perf record -F 99 -p PID

# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID sleep 10

# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10

# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID --call-graph dwarf sleep 10

# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11):
perf record -F 99 -ag -- sleep 10

# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11):
perf record -F 99 -g -- sleep 10

# If the previous command didn't work, try forcing perf to use the cpu-clock event:
perf record -F 99 -e cpu-clock -ag -- sleep 10

# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup:
perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10

# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -a --call-graph dwarf sleep 10

# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?):
perf record -F 99 -a --call-graph lbr sleep 10

# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5

# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5 

# Sample on-CPU kernel instructions, for 5 seconds:
perf record -e cycles:k -a -- sleep 5 

# Sample on-CPU user instructions, for 5 seconds:
perf record -e cycles:u -a -- sleep 5 

# Sample on-CPU user instructions precisely (using PEBS), for 5 seconds:
perf record -e cycles:up -a -- sleep 5 

# Perform branch tracing (needs HW support), for 1 second:
perf record -b -a sleep 1

# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file):
perf top -F 49

# Sample CPUs at 49 Hertz, and show top process names and segments, live:
perf top -F 49 -ns comm,dso

Static Tracing

# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a

# Sample (take a subset of) context-switches, until Ctrl-C:
perf record -e context-switches -a

# Trace all context-switches, until Ctrl-C:
perf record -e context-switches -c 1 -a

# Include raw settings used (see: man perf_event_open):
perf record -vv -e context-switches -a

# Trace all context-switches via sched tracepoint, until Ctrl-C:
perf record -e sched:sched_switch -a

# Sample context-switches with stack traces, until Ctrl-C:
perf record -e context-switches -ag

# Sample context-switches with stack traces, for 10 seconds:
perf record -e context-switches -ag -- sleep 10

# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default):
perf record -e context-switches -ag -T

# Sample CPU migrations, for 10 seconds:
perf record -e migrations -a -- sleep 10

# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect -ag

# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_accept* -ag

# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag

# Sample at most 100 block device requests per second, until Ctrl-C:
perf record -F 100 -e block:block_rq_insert -a

# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a

# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'

# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'

# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'

# Sample minor faults (RSS growth) with stack traces, until Ctrl-C:
perf record -e minor-faults -ag

# Trace all minor faults with stack traces, until Ctrl-C:
perf record -e minor-faults -c 1 -ag

# Sample page faults with stack traces, until Ctrl-C:
perf record -e page-faults -ag

# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a 

# Trace kswapd wakeup events, until Ctrl-C:
perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag

# Add Node.js USDT probes (Linux 4.10+):
perf buildid-cache --add `which node`

# Trace the node http__server__request USDT event (Linux 4.10+):
perf record -e sdt_node:http__server__request -a

Dynamic Tracing

# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional):
perf probe --add tcp_sendmsg

# Remove the tcp_sendmsg() tracepoint (or use "--del"):
perf probe -d tcp_sendmsg

# Add a tracepoint for the kernel tcp_sendmsg() function return:
perf probe 'tcp_sendmsg%return'

# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo):
perf probe -V tcp_sendmsg

# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs

# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg

# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo):
perf probe -V tcp_sendmsg:81

# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'

# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific):
perf probe 'tcp_sendmsg bytes=%cx'

# Trace previously created probe when the bytes (alias) variable is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'

# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'

# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo):
perf probe 'tcp_sendmsg size'

# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'

# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo):
perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'

# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a

# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo):
perf probe 'tcp_sendmsg:81 seglen'

# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo):
perf probe 'do_sys_open filename:string'

# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'

# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc

# Add a tracepoint for this user-level static probe (USDT, aka SDT event):
perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry

# List currently available dynamic probes:
perf probe -l

Mixed

# Trace system calls by process, showing a summary refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm

# Trace sent network packets by on-CPU process, rolling output (no clear):
stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings

# Sample stacks at 99 Hertz, and, context switches:
perf record -F99 -e cpu-clock -e cs -a -g 

# Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8):
perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g 

Special

# Record cacheline events (Linux 4.10+):
perf c2c record -a -- sleep 10

# Report cacheline events from previous recording (Linux 4.10+):
perf c2c report

Reporting

# Show perf.data in an ncurses browser (TUI) if possible:
perf report

# Show perf.data with a column for sample count:
perf report -n

# Show perf.data as a text report, with data coalesced and percentages:
perf report --stdio

# Report, with stacks in folded format: one line per stack (needs 4.4):
perf report --stdio -n -g folded

# List all events from perf.data:
perf script

# List all perf.data events, with data header (newer kernels; was previously default):
perf script --header

# List all perf.data events, with customized fields (< Linux 4.1):
perf script -f time,event,trace

# List all perf.data events, with customized fields (>= Linux 4.1):
perf script -F time,event,trace

# List all perf.data events, with my recommended fields (needs record -a; newer kernels):
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso 

# List all perf.data events, with my recommended fields (needs record -a; older kernels):
perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso

# Dump raw contents from perf.data as hex (for debugging):
perf script -D

# Disassemble and annotate instructions with percentages (needs some debuginfo):
perf annotate --stdio

Using valgrind to see truth

That didn’t stop me from also taking a look at valgrind, or more specifically callgrind to see what it has to say. Well, good thing I started to run valgrind already in the afternoon, so it’s finished by now!… Well I’m kidding of course, instead of lightning fast 30 s it valgrind only took a quick 9 minutes to run the weirdFastSlow.nim from above. :P (Note: I’m aware that one cannot compare valgrind and perf, I’m joking here of course. :) :) :) )

So I called valgrind with the following options:

valgrind --tool=callgrind bookPlots/weirdFastSlow

which gave us:

==7359== Callgrind, a call-graph generating cache profiler                                                                                                                                                                                    
==7359== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.                                                                                                                                                                 
==7359== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info                                                                                                                                                                   
==7359== Command: bookPlots/weirdFastSlow                                                                                                                                                                                                     
==7359==                                                                                                                                                                                                                                      
==7359== For interactive control, run 'callgrind_control -h'.                                                                                                                                                                                 
Plot first: 9.817448616027832                                                                                                                                                                                                                 
Dataframe with 10 columns and 53940 rows:                                                                                                                                                                                                     
        Idx      carat        cut      color  ...
...
Read first: 53.51081871986389                              
Dataframe with 10 columns and 53940 rows:  
...                                                                                                                                                                                                        
Fast together 58.7412965297699        
Together 535.3335099220276
==7359== 
==7359== Events    : Ir
==7359== Collected : 157614448556
==7359== 
==7359== I   refs:      157,614,448,556

So at least we also see the factor of ~10 with valgrind.

Now taking a look at the callgrind report with kcachegrind:

kcachegrind callgrind.out.7359

we are greeted with the following image:

./kcachegrind_weirdSlowFast.png

Here I already highlighted the readCsv line on the left. On the right we can already see the interesting thing (which we already know sort of): The slowTogether proc makes up ~81 % of the cycles of readCsv whereas fastRead only makes up 10 %! Sure, there is a small difference (and that’s the reason fastPlot shows up too): in the slowTogether proc there are two readCsv calls, one for mpg (~230 lines of CSV) and diamonds (~53000 lines of csv!), whereas in fastRead only the diamonds file is read.

But still, that’s definitely pretty insane.

And with this I’m stopping for the day, since I need to eat and sleep!

@Vindaar
Copy link
Author

Vindaar commented Nov 8, 2019

And the missing kcachegrind image.

kcachegrind_weirdSlowFast

@mratsim
Copy link

mratsim commented Nov 8, 2019

Do you perhaps have a full stack? status-im/nimbus-eth2#370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment