Notice that when running it on a Linux VM on OSX I get about a 4x speedup relative to running it natively on OSX.
Also notice from the profile that the Linux version spends about 90% of its time in the myexp
call and only about 3% in the line @inbounds d_i2_j += ((points[k, i] - data[k, j])^2)
. (I know these aren't actually times, but rather times when the profiler registered that line. But it gives an approximation to time spent).
The OSX version spends about 57% of its time (so about 5.3 seconds on average) in myexp
and about 33% of its time (about 3 seconds on average) in the line @inbounds d_i2_j += ((points[k, i] - data[k, j])^2)
.
I can't explain why, on the same hardware, the OSX version is so much slower here.