saschatimme/codegen.md

## codegen.md

      
    Raw
  

              codegen.md
            
          
    I demonstrate what I mean with this using HomotopyContinuation.jl.
using HomotopyContinuation
@var x
f = System([4*x^4+3*x^3+2*x^2+x-5])
Given a symbolic system of polynomials f, in HC.jl there are two ways
to optimize them for numerical evaluation.
First the expression is optimized with similar techniques as you do (common subexpression elimination and then a simple multivariate horner-schema).
Then the optimized expression is preprared for repeated numerical evaluation. This can be done using an InterpretedSystem which is very similar to what you do (record all operations in arrays and "play" them for evaluation). The other is CompiledSystem where
an actual straight line program is compiled:
julia> F = CompiledSystem(f)
Compiled: System of length 1
 1 variables: x

 -5 + x*(1 + x*(2 + x*(3 + 4*x)))

# This code is automatically generated for the evaluation of `F` 
julia> HomotopyContinuation.ModelKit._evaluate!_impl(typeof(F))
quote
    begin
       @boundscheck checkbounds(u, 1:1)
       @boundscheck checkbounds(x, 1:1)
       @boundscheck p === nothing || checkbounds(p, 1:0)
    end
    @inbounds begin
            _ι1 = 4 * x[1]
            _ι2 = 3 + _ι1
            _ι3 = x[1] * _ι2
            _ι4 = 2 + _ι3
            _ι5 = x[1] * _ι4
            _ι6 = 1 + _ι5
            _ι7 = x[1] * _ι6
            _ι8 = -5 + _ι7
            u[1] = _ι8
    end
    u
end

# We can see that this is turned into not many assembly instructions
julia> @code_native debuginfo=:none evaluate!([0.0], F, [2.0])
	.section	__TEXT,__text,regular,pure_instructions
	subq	$40, %rsp
	movq	%rdi, %rax
	movabsq	$5250033792, %rcx       ## imm = 0x138ED2880
	vmovaps	(%rcx), %xmm0
	vmovups	%xmm0, 24(%rsp)
	cmpq	$0, 24(%rdi)
	jle	L148
	vmovups	%xmm0, 8(%rsp)
	cmpq	$0, 24(%rdx)
	jle	L170
	movq	(%rdx), %rcx
	vmovsd	(%rcx), %xmm0           ## xmm0 = mem[0],zero
	movabsq	$5250033752, %rcx       ## imm = 0x138ED2858
	vmulsd	(%rcx), %xmm0, %xmm1
	movabsq	$5250033760, %rcx       ## imm = 0x138ED2860
	vaddsd	(%rcx), %xmm1, %xmm1
	vmulsd	%xmm1, %xmm0, %xmm1
	movabsq	$5250033768, %rcx       ## imm = 0x138ED2868
	vaddsd	(%rcx), %xmm1, %xmm1
	vmulsd	%xmm1, %xmm0, %xmm1
	movabsq	$5250033776, %rcx       ## imm = 0x138ED2870
	vaddsd	(%rcx), %xmm1, %xmm1
	vmulsd	%xmm1, %xmm0, %xmm0
	movabsq	$5250033784, %rcx       ## imm = 0x138ED2878
	vaddsd	(%rcx), %xmm0, %xmm0
	movq	(%rax), %rcx
	vmovsd	%xmm0, (%rcx)
	addq	$40, %rsp
	retq
L148:
	movabsq	$throw_boundserror, %rcx
	leaq	24(%rsp), %rsi
	movq	%rax, %rdi
	callq	*%rcx
	ud2
L170:
	movabsq	$throw_boundserror, %rax
	leaq	8(%rsp), %rsi
	movq	%rdx, %rdi
	callq	*%rax
	ud2
Comparing the performance of InterpretedSystem and CompiledSystem we see that the latter is around 10x faster.
julia> u = zeros(1)
x1-element Array{Float64,1}:
 0.0

julia> x = [rand()]
1-element Array{Float64,1}:
 0.3889664871743841

julia> using BenchmarkTools

julia> @btime evaluate!($u, $F, $x)
  3.750 ns (0 allocations: 0 bytes)
1-element Array{Float64,1}:
 -4.04033706522339

julia> @btime evaluate!($u, $G, $x)
  35.033 ns (0 allocations: 0 bytes)
1-element Array{Float64,1}:
 -4.04033706522339
However, if the expressions are large than the compilation step for CompiledSystem can become prohibitively slow.