Skip to content

Instantly share code, notes, and snippets.

@shenwei356
Created August 4, 2020 12:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shenwei356/35d336dbb273c1e03e625b6034267c39 to your computer and use it in GitHub Desktop.
Save shenwei356/35d336dbb273c1e03e625b6034267c39 to your computer and use it in GitHub Desktop.
  8            .          .           TEXT ·__mm_add_epi32(SB),0,$0 
  9        640ms      640ms               VMOVDQU x+0(FP), Y0 
 10        5.62s      5.62s               VMOVDQU y+32(FP), Y1 
 11        4.81s      4.81s               VPADDD  Y1, Y0, Y0 
 12        1.16s      1.16s               VMOVDQU Y0, q+64(FP) 
 13        1.30s      1.30s               VZEROUPPER 
 14            .          .               RET 
@shenwei356
Copy link
Author

Why retrieving the second parameter (L10) is much slower than the first one (L9)?

@clausecker
Copy link

The time taken by specific instructions is often not really indicative of which instructions take how long due to the out-of-order nature of modern processors. The time pprof measures is instead the time the CPU is stuck on one instruction without being able to progress to the next one because all its resources are occupied. As soon as an appropriate execution unit is free, the CPU can proceed to the next instruction.

As I said earlier, if the whole loop you call this function in can be written in assembly, all these data moves can be eliminated and your code is likely going to be a lot faster. Writing an assembly function to wrap a single instruction like this is pretty pointless.

@shenwei356
Copy link
Author

I see. I post another thread.

Thanks for you sincere advice again. I'll try to learn assembly, which is so useful for improving performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment