plafl/julia-first-impresions.tex

## julia-first-impresions.tex
\section{Intro}\label{intro}

I have been a happy Python user for more than 20 years. I use it daily
as a data scientist with the classical scientific stack and a little of
PyTorch when I get fancy.

Recently I have decided to give \href{https://julialang.org/}{Julia} a
try. I must admit I was skeptical. Is it worthy? I think yes.

Let me say that I was surprised by Julia's design but that is the
subject for another post. This post is about what everyone thinks is the
differentiating Julia advantage:
\href{https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/julia-gcc.html}{speed}.
\textbf{The main conclusion is that it's possible in a practical way to
obtain near C speed}. Julia will not make your code run fast
automagically but it's a power tool that in trained hands can achieve
good results.

I'm still a Julia newbie so I wouldn't be surprised if something I did
can be better done or something I wrote is not correct. Please let me
know if that is the case.

Just jump to the \protect\hyperlink{results}{results} if you don't feel
like reading such a lengthy post.

\section{Recommendation algorithms}\label{recommendation-algorithms}

I wanted to tackle a problem familiar enough for me, which has the right
amount of complexity and that has practical applications and so decided
to implement a recommender algorithm. Recommender systems are a
fascinating problem that I feel have not made much progress recently.
New deep learning approaches have not shown the same qualitative
improvements that have been achieved in other areas and I think there
are great opportunities in the field yet to be found. Anyway, this post
is about Julia the language and not about recommender systems.

I wanted to compare speed against a well optimized and mature
\href{https://cython.org/}{Cython} implementation,
\href{https://github.com/lyst/lightfm}{LightFM}. LightFM implements good
old matrix factorization with some additional functionality to
incorporate user and item features and also implements some loss
functions suitable for implicit problems like
\href{https://making.lyst.com/lightfm/docs/examples/warp_loss.html}{WARP}
which has given me good results in the past. WARP is also interesting
because has been designed with sparseness in mind and doesn't lend
itself to efficient implementations using automatic differentiation
frameworks like \href{https://pytorch.org/}{PyTorch} or in the case of
Julia \href{https://github.com/FluxML/Zygote.jl}{Zygote}. At least I
have not been able to use them for this problem. It would require
generating sparse gradient updates and prevent memory allocations. This
also means the GPU offers little advantage over a CPU, if any. There is
also \href{https://github.com/benfred/implicit}{Implicit} which runs
very fast leveraging the GPU if you wish but in my use cases WARP +
LightFM has given better results.

My implementation is far from feature parity with LightFM. In particular
I have not added user and item features and only implemented the WARP
loss without regularization. Nevertheless I'm quite satisfied with the
results and everything amounts to less than 300 lines of code.

\section{Matrix factorization}\label{matrix-factorization}

In case you don't know, or don't remember, the idea is to assign to each
user a vector, we will call it an embedding, of some size
\texttt{n\_components} which is an hyperparameter of our model. The
higher it is the higher our model's capacity. We do the same with items
and also consider an item bias, a single scalar, for each item (more
popular items will have greater bias). The affinity, preference or score
of a user with an item is just the scalar product of the user embedding
with the item embedding, plus the item bias, so if the ith user has user
embedding \(\bold{u_i}\) and the jth item has embedding \(\bold{v_j}\)
then the score \(s(i, j)\) that the user assigns to the item is simply:

\[
    s(i, j) = \bold{u_i}\cdot\bold{v_j} + b_j
\]

If we want to recommend the best 5 items for a given user then we
compute the score of that user against all item embeddings, which can be
done efficiently using matrix-vector multiplication, and then sort the
items by score and keep the 5 with the highest score. If you have not
found this explanation satisfactory the author of the Implicit library
has a nice blog post giving
\href{https://www.benfrederickson.com/matrix-factorization/}{more
details} about implicit matrix factorization.

The above math in code would be

\begin{verbatim}
struct Factorization{T <: AbstractFloat}
    user_embeddings:: Matrix{T}
    item_embeddings:: Matrix{T}
    item_bias:: Vector{T}

    function Factorization{T}(n_users::Int,
                              n_items::Int,
                              n_components::Int) where {T<:AbstractFloat}
        return new{T}(
            zeros(T, n_components, n_users),
            zeros(T, n_components, n_items),
            zeros(T, n_items))
    end
end
\end{verbatim}

If you are new to Julia the above just declares an struct with 3 fields.
It has a type parameter \texttt{T} which can be any floating point type
(\texttt{Float16}, \texttt{Float32}, \texttt{Float64}). The type
parameter appears inside the fields and constraint them to be arrays of
the same type (\texttt{Matrix\{T\}} is just a type alias for
\texttt{Array\{T,\ 2\}} and \texttt{Vector\{T\}} for
\texttt{Array\{T,\ 1\}}) that contain elements of that type. This is a
very important point because it means that any indexing operation will
return elements of that type and the compiler will be able to generate
very efficient code so this extra syntax noise is worth the effort not
just for documentation purposes but because of performance. I would love
to say that it allows to catch errors but I have been unable to
configure my editor setup, Emacs + julia-lsp. It seems that static type
checking is limited but I will have to have a look anyway at
\href{https://github.com/julia-vscode/StaticLint.jl}{this}.

\href{https://docs.julialang.org/en/v1/manual/types/}{Julia's type
system} is not complicated at all and is a necessary part of the
language. You need it to implement
\href{https://docs.julialang.org/en/v1/manual/methods/}{method
dispatch}, an alternative to OOP which is quite natural and easy to pick
up if you are coming for example from Python. Anyway a lot of Python
code is just checking type arguments at the beginning of functions, for
example have a look at
\href{https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/linear_model/_base.py\#L389}{a
method inside scikit-learn}, selected at random.

One of the problems with optional type annotations is the need to find
\href{https://dune.fandom.com/wiki/The_Golden_Path}{The Golden Path}
that leads you to performant code: there are lots of ways to do
something but just only one of making it go fast. This happens also in
Julia but I think the story is much better than say in Cython. For
example the following structure declarations are all valid Julia code
but suffer from different problems.

This first variant is perfectly valid and doesn't have type annotations:

\begin{verbatim}
struct Factorization
    user_embeddings
    item_embeddings
    item_bias
end
\end{verbatim}

Some type annotations:

\begin{verbatim}
struct Factorization
    user_embeddings:: AbstractArray
    item_embeddings:: AbstractArray
    item_bias:: AbstractArray
end
\end{verbatim}

This is also valid:

\begin{verbatim}
struct Factorization
    user_embeddings:: Array
    item_embeddings:: Array
    item_bias:: Array
end
\end{verbatim}

And this:

\begin{verbatim}
struct Factorization
    user_embeddings:: Matrix
    item_embeddings:: Matrix
    item_bias:: Vector
end
\end{verbatim}

In order to add more typing you might be tempted to do the following

\begin{verbatim}
struct Factorization
    user_embeddings:: Matrix{AbstractFloat}
    item_embeddings:: Matrix{AbstractFloat}
    item_bias:: Vector{AbstractFloat}
end
\end{verbatim}

The above has two problems AFAIK: first the arrays can be of different
types, for example \texttt{user\_embeddings} could be \texttt{Float32}
and \texttt{item\_embeddings} could be \texttt{Float64}. That would
incur in float conversions when performing arithmetic. The other more
important problem is that you could actually mix \texttt{Float32} and
\texttt{Float64} inside the same array. Actually, even if you don't mix
them, Julia will be forced to consider that possibility and since both
types have different sizes it will store references inside the array
instead of values which will add even more performance penalties (and
storage). These explanations, and much more, are better developed in
Julia's documentation in the section about
\href{https://docs.julialang.org/en/v1/manual/performance-tips/}{performance
tips}. It's not complicated but requires some learning.

Let's continue and implement now some little helper functions, using
single expression function declaration syntax:

\begin{verbatim}
n_users(model::Factorization) = size(model.user_embeddings, 2)
n_items(model::Factorization) = size(model.item_embeddings, 2)
n_components(model::Factorization) = size(model.user_embeddings, 1)
\end{verbatim}

And the dot product to compute the user/item score:

\begin{verbatim}
function(model::Factorization)(user::Int, item::Int)
    s = model.item_bias[item]
    for i=1:n_components(model)
        s += model.item_embeddings[i, item] * model.user_embeddings[i, user]
    end
    return s
end
\end{verbatim}

The above code snippet defines the equivalent of Python's
\texttt{\_\_call\_\_} special method and allows to write for example
\texttt{model(42,\ 22)} to compute the score between user 42 and item
22.

Some comments about performance again: first, notice that the above code
doesn't have parametric types anywhere (we just write
\texttt{model::Factorization} instead of say
\texttt{model::Factorization\{T\}}). This is perfectly correct and
normal and actually does not incur in performance penalties. Second, we
could have written the dot product in an alternative form which deviates
us from our Golden Path:

\begin{verbatim}
(model::Factorization)(user::Int, item::Int) =
    model.item_bias[item] +
    model.item_embeddings[:, item]' * model.user_embeddings[:, user]
\end{verbatim}

Actually I would have done the above if working from the REPL since it's
more succinct and more similar to the mathematical definition. The above
seemingly innocent expression has several problems. First, in Julia,
array slicing creates by default copies of the data, which I think it's
great, because you have the alternative of creating views. Simply use:

\begin{verbatim}
(model::Factorization)(user::Int, item::Int) =
    model.item_bias[item] +
    view(model.item_embeddings, :, item)' *
    view(model.user_embeddings, :, user)
\end{verbatim}

It will avoid data copying but still will allocate memory for structs of
type \texttt{SubArray}. You can check it with the very useful macro
\texttt{@allocated} which has proven for me essential to debug
performance problems:

\begin{verbatim}
julia> @allocated model(10, 10)
432
\end{verbatim}

The above shows that a single call to the function allocates 432 bytes,
a very significant amount of memory for a function that will be called
hundreds or thousands of millions of times.

With NumPy you need to learn which operations create views and which
create copies. It's not complicated but you have to reason if a view can
be created based on how arrays are represented internally (pointer to
data + strides). For example, in Python, indexing with a boolean array
will create a copy but slicing will create a view. I very rarely make
mistakes of this kind but I must admit sometimes I introduced a bug
because of this, adding code that modified a view in-place, which in
turn modifies the original array, because I forgot I was working with a
view.

As you can see there are several different ways of doing the same thing
in Julia. At least I think there are always two: one form suitable for
exploration where you write vectorized functions in the spirit of Matlab
or NumPy and the other one where you write loops and annotate types
similar to C. The good thing is the transition between the two is
completely smooth.

\section{Training data}\label{training-data}

I have decided to represent user/item interactions in a similar way to
LightFM. Whenever a user has interacted (clicked, bought, viewed,
whatever depending on the application) with an item we store the value
of the interaction (a 0-5 rating, +1/-1, 0/1) in the cell of a matrix
where the row is the item and the column is the user.

When our data is implicit the presence of an entry means that the user
somehow liked the item. The absence of an entry can mean that she didn't
like the item or most probably didn't know about the item. We build
therefore a very big sparse matrix with maybe millions of columns and
rows. The matrix has been called \texttt{interactions} in the code and
is represented with an
\href{https://docs.julialang.org/en/v1/stdlib/SparseArrays/}{SparseMatrixCSC}.

We will be using the
\href{http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html}{Last.fm
dataset} to have something concrete and to perform benchmarks. The
following function will process the TSV file converting user hashes and
artist ids to numeric indices suitable to represent rows and columns. It
returns an array with one row for each interaction and 3 columns: user,
item and number of times the artist has been played. We really don't
care how many times the artist has been played, just that it has been
played, but store nevertheless the amount:

\begin{verbatim}
function load_lastfm(path)
    function index!(dict, name)
        i = get(dict, name, 0)
        if i == 0
            dict[name] = i = length(dict) + 1
        end
        return i
    end

    data = Array{Int}(undef, countlines(open(path)), 3)
    users = Dict{String, Int}()
    items = Dict{String, Int}()
    for (i, line) in enumerate(eachline(path))
        cells = split(line, '\t')
        data[i, 1] = index!(users, cells[1])
        data[i, 2] = index!(items, cells[2])
        data[i, 3] = parse(Int, cells[4])
    end

    return data
end
\end{verbatim}

We can convert the above to the sparse matrix representation easily:

\begin{verbatim}
julia> lastfm = load_lastfm();
julia> using SparseArrays
julia> lastfm = sparse(view(lastfm, :, 2),
                       view(lastfm, :, 1),
                       view(lastfm, :, 3));
\end{verbatim}

And it's convenient to serialize to disk for easy and fast reading back:

\begin{verbatim}
julia> using Serialization
julia> serialize("lastfm.jlser", lastfm)
\end{verbatim}

Now if we restart Julia we can simply read back this data:

\begin{verbatim}
julia> using Serialization
julia> lastfm = deserialize("lastfm.jlser")
\end{verbatim}

The dataset has 360K users and 160K different artists and only 0.03\% of
the interactions matrix contains non-zero entries (17M entries).

\section{WARP loss}\label{warp-loss}

Learning to rank items as for example in recommender systems but also
for other domains like say search has the difficulty that the metric we
want to optimize directly (actually not completely true for recommender
systems, but this would require talking about online/offline metrics and
digress a lot) is either non-differentiable or constant. This means that
the gradient is either non-existent or zero. Let's say for example we
have 5 items whose correct order is:

\[
A B C D E
\]

Our model has computed the following scores:

\[
s_A = 4.1 \\
s_B = 5.3 \\
s_C = 2.5 \\
s_D = 1.4 \\
s_E = 3.0
\]

This corresponds to the ordering: \[
B A E C D
\]

As you can see changing the score of \texttt{A} from 4.1 to 4.2 doesn't
change the ordering (gradient zero). If however it changes from 4.1 to
5.4 then the ranking changes and whatever loss we have decided between
the two orderings, like for example the count of items correctly
ordered, will experience a sudden jump (gradient undefined).

One common approach is to use what are called ``pairwise losses''. The
main idea is that you compare pairs of items and if they are out of
order they add to the loss with some function that depends on their
scores. In the above example the following pairs are out of order:

\[
(B, A) \\
(E, C) \\
(E, D)
\]

So we could build the following differentiable loss function, for
example:

\[
L = \sigma (s_B - s_A) + \sigma (s_E - s_C) + \sigma (s_E - s_D)
\]

where \(\sigma\) is a differentiable function like the sigmoid or just
the identity function. Obviously a perfect pairwise loss is equivalent
to a perfect ordering but it's not obvious that a perfect ordering can
be achieved by repeated application of pairwise losses and actually I
think it has not been proved. \textbf{If anyone reading this has any
information on the subject I would love to know about it}. There has
been some recent publications on the subject of directly differentiating
sorting and ranking in order to plug them directly into Tensorflow or
PyTorch or similar automatic differentiation libraries like this paper
\href{https://arxiv.org/pdf/2002.08871.pdf}{Fast Differentiable Sorting
and Ranking} which achieves optimal asymptotic complexity. I did have a
look at the paper although the math is quite dense and I need to study
more about combinatorial optimization and I have not tested its
practicality though. Anyway, back to WARP.

WARP is one of such pairwise loses which implements a clever trick to
avoid computing lots of user/item scores. Take into account that is
quite possible to have a million of items in a recommender system and
you can have easily millions and hundreds of millions of users. The idea
is to pick some item and compute its score and then start picking random
items until you find one that has higher score but shouldn't, what I
will call an inversion. Then you add the pair to the loss scaled by how
difficult it was to find the inversion (the number of random items you
needed to test). As soon as you find the pair you just perform the
gradient step that modifies the user embedding and the two item
embeddings that compose the inverted pair. In order to improve the model
we require that the difference in score between the two items has some
margin.

In the context of recommender systems with implicit data we usually just
discriminate between positive and negative items and don't care with the
ranking between positives and the ranking between negatives. For
example, if we again have 5 items and \(A\) and \(B\) are positive and
the rest are negative then these are a few possibilities of equivalent
perfect orderings of the items:

\[
ABCDE \\
BACDE \\
ABDCE \\
BADCE
\]

The code to find inversions:

\begin{verbatim}
function is_positive(interactions::SparseMatrixCSC{S, Int},
                     user::Int,
                     item::Int) where {S}
    for i=interactions.colptr[user]:interactions.colptr[user+1] - 1
        if item == interactions.rowval[i]
            return true
        end
    end
    return false
end

function find_inversion(model::Factorization,
                        user::Int,
                        item::Int,
                        interactions::SparseMatrixCSC{S, Int},
                        max_sample::Int) where {S}
    n = n_items(model)
    s = model(user, item)
    for i=1:max_sample
        j = abs(rand(typeof(item)) % n) + 1
        t = model(user, j)
        if t > (s - 1) && !is_positive(interactions, user, j)
            return (i, j)
        end
    end
    return (max_sample, 0)
end
\end{verbatim}

As you can see the function \texttt{find\_inversion} will return the
number of steps (random samples) it took until finding an inversion and
the item that provokes the inversion. Notice that the functions has an
unconstrained type parameter \texttt{S}. This is because type parameters
in Julia are invariant. This means that for example
\texttt{SparseMatrixCSC\{Float32,\ Int\}} is not a subtype of
\texttt{SparseMatrixCSC\{Number,\ Int\}}:

\begin{verbatim}
julia> Float32 <: Number
true

julia> SparseMatrixCSC{Float32, Int} <: SparseMatrixCSC{Number, Int}
false
\end{verbatim}

Had we specified \texttt{SparseMatrixCSC\{Number,\ Int\}} in the
function declaration we most likely would have gotten a
\texttt{MethodError:\ no\ method\ matching...} kind of error. The type
parameter \texttt{S} is unconstrained because I don't use the values
inside the matrix at all and is just there to allow type matching. It
may seem limited at first look but I personally prefer this extra
verbosity than to think about type covariance/contravariance. Also,
notice the \texttt{typeof} call, this allows to generate the random
number of the correct type and avoid either type conversion or method
match errors.

Once we find an inversion we just perform an step on the following error
which is the WARP loss formula:

\begin{verbatim}
log((n_items(model) - 1) / n)*(model(user, item_n) - model(user, item_p) + 1)
\end{verbatim}

I would have loved to apply automatic differentiation to the problem
with more complex models in mind for the future but as I have said I
have been unable, so let's write the gradient functions directly:

\begin{verbatim}
warp_constant(model::Factorization{T}, n) where {T<:AbstractFloat} =
    convert(T, log((n_items(model) - 1) / n))

function warp_embeddings_grad(model::Factorization,
                              user::Int,
                              item_p::Int,
                              item_n::Int,
                              i::Int)
    return (model.item_embeddings[i, item_n] - model.item_embeddings[i, item_p],
            -model.user_embeddings[i, user],
            +model.user_embeddings[i, user])

end
\end{verbatim}

We will reuse the above functions to write the different optimization
algorithms.

\section{Optimization algorithms}\label{optimization-algorithms}

We are going to implement three optimization algorithms, one is good old
SGD and the other two are going to be Adadelta and Adagrad. I recommend
this well known blog post about
\href{https://ruder.io/optimizing-gradient-descent/}{gradient descent
algorithms} which I have used as a reference.

Let's start first with SGD:

\begin{verbatim}
abstract type Optimizer{T, N} end

struct SGD{T <: AbstractFloat, N} <: Optimizer{T, N}
    θ::Array{T, N}
    ϵ::T
    SGD(θ::Array{T}, ϵ::AbstractFloat) where {T <: AbstractFloat, N} =
        new{T, N}(θ, convert(T, ϵ))
end

function step!(optimizer::SGD{T, N},
               g::T,
               i::Vararg{Int, N}) where {T <: AbstractFloat, N}
    optimizer.θ[i...] -= optimizer.ϵ * g
end
\end{verbatim}

As you can see the optimizer wraps the parameters to be updated as
\(\theta\) and also adds whatever hyperparameters it needs, in this case
just the learning rate \(\epsilon\). A more conventional signature for
the method \texttt{step!} would eliminate the indices \texttt{i} and
pass the gradients \texttt{g} as an array of the correct shape instead
of as a scalar. However, our training has almost all gradient elements
as zero on each update and passing full arrays would be wasteful.
Passing sparse matrices would be efficient only if we accumulated lots
of these updates but we want to update the parameters as soon as
possible.

There is little more to comment about this code. Maybe the use of
\texttt{convert} over there. The reason is that I want the constructor
to accept any \texttt{AbstractFloat} and not only the exact same type as
the model otherwise if my model were a \texttt{Float32} a call like
\texttt{SGD(model,\ 1e-2)} would not match any method since the literal
\texttt{1e-2} is a \texttt{Float64}. It's sometimes a little annoying to
be thinking about these kinds of problems, using functions like
\texttt{typeof}, \texttt{convert}, \texttt{promote}, \texttt{zero} or
\texttt{oneunit} but I don't see any alternative to build fast code.

Let's implement now Adadelta and Adagrad. I implemented them because
they are the ones LightFM uses and also to showcase something that is
not very important but usually catches the attention: the use of Unicode
characters for math symbols inside Julia code. I found that in this case
it was a welcome addition because it allowed me to translate more easily
what I read in the algorithm description. Here it is:

\begin{verbatim}
struct Adadelta{T <: AbstractFloat, N} <: Optimizer{T, N}
    θ::Array{T, N}
    Eg²::Array{T, N}
    EΔ²::Array{T, N}
    ϵ::T
    η::T
    function Adadelta(θ::Array{T, N},
                      ϵ::AbstractFloat,
                      η::AbstractFloat) where {T <: AbstractFloat, N}
        return new{T, N}(θ,
                         zero(θ),
                         zero(θ),
                         convert(T, ϵ),
                         convert(T, η))
    end
end

function step!(optimizer::Adadelta{T, N},
               g::T,
               i::Vararg{Int, N}) where {T<:AbstractFloat, N}
    EΔ²i = optimizer.EΔ²[i...]
    Eg²i = optimizer.Eg²[i...]
    ϵ = optimizer.ϵ
    η = optimizer.η

    Eg²i = η*Eg²i + (1 - η)*g^2
    Δi = -sqrt((EΔ²i + ϵ)/(Eg²i + ϵ))*g
    EΔ²i = η*EΔ²i + (1 - η)*Δi^2

    optimizer.θ[i...] += Δi
    optimizer.EΔ²[i...] = EΔ²i
    optimizer.Eg²[i...] = Eg²i
end


struct Adagrad{T <: AbstractFloat, N} <: Optimizer{T, N}
    θ::Array{T, N}
    G::Array{T, N}
    ϵ::T
    η::T
    function Adagrad(θ::Array{T},
                     ϵ::AbstractFloat,
                     η::AbstractFloat) where {T <: AbstractFloat, N}
        return new{T, N}(θ,
                         zero(θ),
                         convert(T, ϵ),
                         convert(T, η))
    end
end

function step!(optimizer::Adagrad{T, N},
               g::T,
               i::Vararg{Int, N}) where {T<:AbstractFloat, N}
    Gi = optimizer.G[i...]
    ϵ = optimizer.ϵ
    η = optimizer.η

    Gi += g^2
    Δi = -η/sqrt(Gi + ϵ)*g

    optimizer.θ[i...] += Δi
    optimizer.G[i...] = Gi
end
\end{verbatim}

\section{Macros}\label{macros}

The above optimization algorithms are generic versions which can be
reused for other problems. We are going now to create an abstract type
\texttt{FactorizationOptimizer} that wraps optimizers for the two
embeddings and the biases with our particular problem in mind:

A first try looks like this:

\begin{verbatim}
abstract type FactorizationOptimizer{T <: AbstractFloat} end

struct FactorizationSGD{T} <: FactorizationOptimizer{T}
    user_embeddings::SGD{T, 2}
    item_embeddings::SGD{T, 2}
    item_bias::SGD{T, 1}

    function FactorizationSGD(model::Factorization{T},
                              ϵ::AbstractFloat) where {T}
        return new{T}(SGD(model.user_embeddings, ϵ),
                      SGD(model.item_embeddings, ϵ),
                      SGD(model.item_bias, ϵ))
    end

end


struct FactorizationAdadelta{T <: AbstractFloat} <: FactorizationOptimizer{T}
    user_embeddings::Adadelta{T, 2}
    item_embeddings::Adadelta{T, 2}
    item_bias::Adadelta{T, 1}

    function FactorizationAdadelta(model::Factorization{T},
                                   ϵ::AbstractFloat,
                                   η::AbstractFloat) where {T}
        return new{T}(Adadelta(model.user_embeddings, ϵ, η),
                      Adadelta(model.item_embeddings, ϵ, η),
                      Adadelta(model.item_bias, ϵ, η))
    end

end


struct FactorizationAdagrad{T <: AbstractFloat} <: FactorizationOptimizer{T}
    user_embeddings::Adagrad{T, 2}
    item_embeddings::Adagrad{T, 2}
    item_bias::Adagrad{T, 1}

    function FactorizationAdagrad(model::Factorization{T},
                                   ϵ::AbstractFloat,
                                   η::AbstractFloat) where {T}
        return new{T}(Adagrad(model.user_embeddings, ϵ, η),
                      Adagrad(model.item_embeddings, ϵ, η),
                      Adagrad(model.item_bias, ϵ, η))
    end

end
\end{verbatim}

I'm sure you will have noticed how similar the 3 structs are. They
follow a common template. If we were going to use just this 3 algorithms
I would leave them as they are but let's suppose we are going to
implement more. This is a good use case for macros. We want to create
the different structures with a single call like:

\begin{verbatim}
@optimizertype(SGD, ϵ)
@optimizertype(Adadelta, ϵ, η)
@optimizertype(Adagrad, ϵ, η)
\end{verbatim}

Here it the corresponding macro

\begin{verbatim}
macro optimizertype(OptimizerType, params...)
    structname = Symbol("Factorization", OptimizerType)
    return quote
        struct $structname{T} <: FactorizationOptimizer{T}
            user_embeddings::$OptimizerType{T, 2}
            item_embeddings::$OptimizerType{T, 2}
            item_bias::$OptimizerType{T, 1}

            function $structname(model::Factorization{T},
                                 $(params...)) where {T}
                return new{T}(
                    $OptimizerType(model.user_embeddings, $(params...)),
                    $OptimizerType(model.item_embeddings, $(params...)),
                    $OptimizerType(model.item_bias, $(params...)))
            end
        end
    end
end
\end{verbatim}

We have avoided some code copy-pasting at the cost of some readability.
If we had 10 different algorithms it would be worthy and with 3 I
wouldn't use a macro. Somewhere is the crossing line although I don't
know where.

\section{Training algorithm}\label{training-algorithm}

The optimization algorithm will proceed as follows: we select a random
user/item interaction from the training set, by definition this means
that the selected item is a positive item. We then find an inversion for
that user/item pair which gives a negative item and the number of random
samples it took to find the negative item. We then perform a gradient
step using some optimization algorithm. We repeat until we have visited
all interactions inside the training dataset.

Selecting a random user/item interaction is just picking a random index
inside the interactions. Obtaining the positive item is direct from the
\texttt{SparseMatrixCSC} but in order to obtain the user without
searching we must build an auxiliary array mapping interaction index to
user. To avoid recomputing this array at each training epoch I have
defined a \texttt{Interactions} struct that is just a wrapper for the
sparse matrix and builds the auxiliary array when constructed:

\begin{verbatim}
struct Interactions{S}
    sparse::SparseMatrixCSC{S, Int}
    users::Vector{Int}

    function Interactions(interactions::SparseMatrixCSC{S, Int}) where {S}
        n_interactions = length(interactions.nzval)
        users = Array{Int}(undef, length(interactions.nzval))
        j = 1
        for i=1:length(interactions.colptr)-1
            for j=interactions.colptr[i]:(interactions.colptr[i+1] - 1)
                users[j] = i
            end
        end
        return new{S}(interactions, users)
    end
end
\end{verbatim}

And now the training function is defined except for the
\texttt{warpstep!} function and the \texttt{FactorizationOptimizer}
struct which we will implement next:

\begin{verbatim}
function epoch!(model::Factorization,
                optimizer::FactorizationOptimizer,
                interactions::Interactions,
                max_sample::Int)
    max_sample = min(n_items(model) - 1, max_sample)
    n_interactions = length(interactions.sparse.nzval)
    @Threads.threads for i in randperm(n_interactions)
        user = interactions.users[i]
        item_p = interactions.sparse.rowval[i]
        n, item_n = find_inversion(
            model, user, item_p, interactions.sparse, max_sample)
        if item_n > 0
            warpstep!(model, optimizer, user, item_p, item_n, n)
        end
    end
end
\end{verbatim}

Notice the \texttt{@Threads.threads}. It automagically parallelizes the
\texttt{for} loop. We are using here the same algorithm as LightFM,
\href{https://papers.nips.cc/paper/2011/file/218a0aefd1d1a4be65601cc6ddc1520e-Paper.pdf}{Hogwild},
which basically is not caring about any data race. If you think it will
corrupt data you are right, but we simply don't care. First, it will
happen sparingly for our problem since the probabilities of an item
happening simultaneously in two different threads is low and for users
is even lower. Second, even if it happens, who cares? We are already
using a noisy optimization procedure and we actually like the noise. You
need to start the Julia interpreter correctly configured, in my case I
added to my \texttt{\textasciitilde{}/.profile} the line
\texttt{export\ JULIA\_NUM\_THREADS=8}. If you like me are coming from
Python the fact that this worked at the first try so well will make you
cry out of joy.

The function \texttt{warpstep!} is in charge of computing the gradients
and passing them to the optimizers for the embeddings and user biases:

\begin{verbatim}
function warpstep!(model::Factorization{T},
                   optimizer::FactorizationOptimizer{T},
                   user::Int,
                   item_p::Int,
                   item_n::Int,
                   n::Int) where {T}
    C = warp_constant(model, n)
    for i=1:n_components(model)
        gu, gp, gn = warp_embeddings_grad(model, user, item_p, item_n, i)
        step!(optimizer.user_embeddings, C*gu, i, user)
        step!(optimizer.item_embeddings, C*gp, i, item_p)
        step!(optimizer.item_embeddings, C*gn, i, item_n)
    end
    step!(optimizer.item_bias, convert(T, -1), item_p)
    step!(optimizer.item_bias, convert(T, +1), item_n)
end
\end{verbatim}

Finally, for completeness, this function initializes the embeddings and
the biases with sensible values:

\begin{verbatim}
function init!(interactions::Interactions, model::Factorization)
    interactions = interactions.sparse

    fill!(model.item_bias, 0)
    for i=1:length(interactions.nzval)
        model.item_bias[interactions.rowval[i]] += 1
    end
    model.item_bias ./= sum(model.item_bias)

    scale = 1.0 / n_components(model)
    rand!(model.user_embeddings)
    rand!(model.item_embeddings)
    model.user_embeddings .-= 0.5
    model.item_embeddings .-= 0.5
    model.user_embeddings .*= scale
    model.item_embeddings .*= scale
end
\end{verbatim}

\section{Metrics}\label{metrics}

Before measuring speed we must take an important point into account: if
everything goes well we expect each epoch of training to take more time
than the previous one. The reason is in the WARP loss. As the parameters
get better fit it is more difficult to find an inversion.

We are going therefore to measure the precision@k metric as we train.
LightFM has its own function to measure it, which works quite slow by
the way and I had to reimplement at my job. Here it is our version with
Julia:

\begin{verbatim}
function select_k(scores::Matrix{<:AbstractFloat}, k::Int)
    n = size(scores, 2)
    y = Array{Int}(undef, k, n)
    ix = Array{Int}(undef, size(scores, 1))
    for i=1:n
        partialsortperm!(ix, view(scores, :, i), k, rev=true)
        y[:, i] = view(ix, 1:k)
    end
    return y
end

function precision_at_k(model::Factorization,
                        interactions::SparseMatrixCSC,
                        users::Vector{Int},
                        k::Int)
    scores = model.item_bias .+
             model.item_embeddings' * model.user_embeddings[:, users]
    recommendations = select_k(scores, k)
    pk = zeros(Float32, length(users))
    for i=1:size(recommendations, 2)
        for j=1:size(recommendations, 1)
            user = users[i]
            item = recommendations[j, i]
            if is_positive(interactions, user, item)
                pk[i] += 1.0f0
            end
        end
    end
    return pk / k
end
\end{verbatim}

We should get comparable values of precision@k between LightFM and our
little test code if we are to compare speed. Also, we hope the precision
increases. We will test in a subset of users who are part of our
training data so don't take conclusions about how good the model is.

\hypertarget{results}{\section{Results}\label{results}}

And so, here are the benchmarks!

Python (LightFM):

\begin{verbatim}
Epoch time: 3.34min precision@5: 0.502
Epoch time: 4.83min precision@5: 0.575
Epoch time: 5.46min precision@5: 0.604
Epoch time: 5.80min precision@5: 0.622
Epoch time: 5.81min precision@5: 0.628
\end{verbatim}

Julia:

\begin{verbatim}
Epoch time: 2.01min precision@5: 0.518
Epoch time: 3.07min precision@5: 0.574
Epoch time: 3.26min precision@5: 0.599
Epoch time: 3.50min precision@5: 0.628
Epoch time: 3.61min precision@5: 0.638
\end{verbatim}

We can get even a little more juice out of Julia with additional
options:

\begin{verbatim}
julia --math-mode=fast --check-bounds=no --threads=auto test.jl
\end{verbatim}

To get:

\begin{verbatim}
Epoch time: 1.60min precision@5: 0.515
Epoch time: 2.37min precision@5: 0.572
Epoch time: 2.84min precision@5: 0.604
Epoch time: 2.87min precision@5: 0.616
Epoch time: 2.97min precision@5: 0.634
\end{verbatim}

More or less precisions are similar. Sometimes our implementation get
slightly better scores and sometimes LightFM because we evaluate on
different random sets of users. Timings are more or less the same all
times because it's always the same training data.

Even though we have not used the additional features of LightFM maybe
they constrain the implementation so it's premature to say our
implementation is faster although it looks so because there is some
generous margin. I think it's safe to say that Julia is at least as fast
as Cython.

You can find all the code to replicate the benchmarks at this
\href{https://github.com/plafl/julia-cython-comp}{repo}.

If these results encourage you to give Julia a try that's great but I
would advice to not focus on performance at first. In my case I started
learning by doing exercises at
\href{https://exercism.io/tracks/julia}{Exercism}. Have fun!