Skip to content

Instantly share code, notes, and snippets.

@rasmusbergpalm
Last active May 23, 2018 07:44
Show Gist options
  • Save rasmusbergpalm/1d7f6c968a216b99c86c53cae725de93 to your computer and use it in GitHub Desktop.
Save rasmusbergpalm/1d7f6c968a216b99c86c53cae725de93 to your computer and use it in GitHub Desktop.

Introducing Blayze

At Tradeshift our Machine Learning Team help product teams incorporate machine learning in their products. As such we're exposed to a wide range of products and applications.

Sometimes we need a carefully tuned deep learning classifier with millions of parameters, trained for weeks on GPUs, on complicated dense inputs. For those times, tools like Tensorflow and PyTorch are great.

Other times we need a simple, flexible classifier, that can be trained in seconds on a single core, which supports sparse and missing features. A kind of swiss army knife we can add anywhere in the codebase. The first thing we try, because it's just so easy. The embarrassingly simple baseline we can struggle to beat later on.

One of the simplest classifiers is arguably naive bayes. We couldn't find one that matched our needs, so we created Blayze.

Design considerations

  • JVM native. Since most of Tradeshift runs on the JVM we wanted a JVM library for ease of integration. We wrote it in Kotlin for the improved type safety and immutable collections, but it's just as easy to use in any JVM language: Java, Scala, etc. It's relased on maven central.

  • Multiple feature types. The naive bayes classifier framework inherently supports integrating features of different types. Blayze currently supports text, categorical and gaussian features. We're open for pull requests :]

  • Fast and efficient online learning. For the text and categorical features it reduces to incrementing counters. For the gaussian features we can estimate the mean and variance in a streaming fashion, storing just three floats.

  • Sparse features. The amount of features can quickly grow. With Blayze the runtime of classifying and training only depend on the features present in each example.

  • Robust to missing and new features. If a feature is missing from the inputs, it simply will not influence the classification. This way the model degrades gracefully. Ultimately it'll fall back on the (learned) prior. If a new feature appears in the inputs it's added to the model without any fuss.

  • Efficient serialization. We're using protobuf for serialization ensuring that saving and loading models is fast and efficient.

We hope that you will find Blayze as useful as we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment