Skip to content

Instantly share code, notes, and snippets.

@davidavdav
Last active August 29, 2015 14:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save davidavdav/a2332f6620c3bb259d51 to your computer and use it in GitHub Desktop.
Save davidavdav/a2332f6620c3bb259d51 to your computer and use it in GitHub Desktop.
internal or external sumsq() in inner loop ~ factor ~ 10 execution time
## weirdness in innerloop optimization
## similar to sumabs2()
function mydot{T}(x::Array{T})
s = zero(T)
for xx in x
s += xx*xx
end
s
end
function f!(x::Matrix, y::Matrix, n, usemydot=false)
nx, d = size(x)
r = similar(x, size(x,1), n)
yi = similar(x, d)
for k=1:n
a = 1.0
b = 1.0
rand!(y) # flush caches, y is some function of x
y += x # this takes sone time but is not the focus here
for i=1:nx
for j=1:d
yi[j] = y[i,j]
end
## This should be the critical lines
if usemydot
s = mydot(yi)
else
s = sumabs2(yi)
end
r[i,k] = a + b*s
end
end
r
end
## typical usage
x = randn(10000, 60)
y = randn(10000, 60)
n = 100
function prof(usemydot=false)
Profile.clear()
@profile f!(x, y, n, usemydot)
Profile.print(format=:flat)
end
@davidavdav
Copy link
Author

## use mydot(), a sum-squares
@time prof(true)
## 1.5 s

## use Base.sumabs2() or a BLAS dot() for that matter
@time prof(false)
## 16. sec

@ivarne
Copy link

ivarne commented Nov 8, 2014

You likely have a bug on line 22. += isn't a in-place operator, but a short syntax for y = y+x. That way you only modify y once.

@ivarne
Copy link

ivarne commented Nov 8, 2014

By the way, I don't get a significant difference between the two tests on master nor release-0.3.

@waTeim
Copy link

waTeim commented Nov 8, 2014

Is there any way to not do this? You're killing the CPU cache by having the inner loop on the 2nd index because of column-major order.

for i=1:nx
    for j=1:d
        yi[j] = y[i,j]
    end

@davidavdav
Copy link
Author

@ivarne, this is not about line 22. These lines are just there as a placeholder for some more complicated calculations that would distract from the innerloop that is following. Basically y = somefunction(x) right there. I have included it because it might influence the cache.

OK, so if you don't see a difference in performance, then there might be something going on with my set-up. I'll try on the linux servers rather than my laptop. I always compile everything from source. I've tested this for release-0.3 and master myself, with the same results.

As a result of a discussion on julia-users, I replaced NumericExtensions.sumsq() by Base.sumabs2, but my timing results are the same.

@waTeim. I realize the row-vector yi is breaking column major order, the structure of y results from some larger in-place BLAS trmm! operation, it might be possible to flip the dimensions there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment