Skip to content

Instantly share code, notes, and snippets.

@jcpsantiago
Last active November 18, 2021 18:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jcpsantiago/320e3665a9bd749fc25ede0341c6323c to your computer and use it in GitHub Desktop.
Save jcpsantiago/320e3665a9bd749fc25ede0341c6323c to your computer and use it in GitHub Desktop.
Proof of concept for bulgogi/feature processing system
(ns bulgogi.main
"
A bulgogi prototype. Not seasoned yet. ⚠️
An exploration on how a simple feature engineering system could look like in Clojure.
Features are defined as pure functions and take an input-data map as input:
{:input-data {:email \"hackermann@unprotected.com\"
:current-amount 3455
:previous-amount 2344}}
"
(:require
[clojure.edn :as edn]
[clojure.string :as s]))
;; ---- utilities -----
(defn boolean->int
"Cast a boolean to 1/0 integer indicator"
[b]
(when boolean? b
(if (true? b) 1 0)))
;; ---- features -----
;; Features are variables used in Machine Learning models. They're like fn args, if a model was just
;; a function: `(model feature1 feature2 ...)`.
;; Becacuse of referential transparency, here they are just pure functions.
;; In this prototype, they all follow the same contract taking
;; an input-map as input, extracting what they need and returning some value.
;; The following features are representative of the types I've used in production.
(defn email-name
"Lower-cased name of an email address (the bit before @)"
[{email :email}]
(-> email
s/lower-case
(s/replace-first #"@.*" "")))
(defn n-digits-in-email-name
"Number of digits in the email name"
[input-data]
;; we get dependency management for free because everything is just functions
(->> (email-name input-data)
(re-seq #"\d")
count))
(defn n-chars-in-email-mail
"Number of characters in the email name i.e. length of the email name"
[input-data]
(-> (email-name input-data)
count))
(defn diff-eur-previous-order
"Difference in euros between the current order and the previous one."
[{current-amount :current-amount previous-amount :previous-amount}]
(- current-amount previous-amount))
(defn risky-item?
"Boolean depending on whether an item is risky or not"
[{brand :brand}]
(->> brand
s/lower-case
(re-seq #"baz corp")
some?
boolean->int))
(defn contains-risky-item
"Indicator 1/0 depending on whether a risky item is present in the cart"
[{items :items}]
(->> items
(map #(risky-item? %))
(some #(= 1 %))
boolean->int))
;; ---- main functions part of the actual infrastructure, not features ----
(defn preprocessed
"
Takes a request map with keys :input-data and :features.
The first key contains an input-data map with the actual data needed to calculate features;
The second key contains a vector with the names of the features requested.
Looks for the features (aka functions) in the namespace and applies them to the input-data
in parallel.
Returns a map of feature-keys and feature-values.
"
[req]
(let [{:keys [input-data features]} req
fns (->> features
(map #(-> % symbol resolve)))
fn-ks (map keyword features)]
(->> (pmap #(% input-data) fns)
(zipmap fn-ks))))
(defn response
"Bundles the calculated features into a consumable response"
[input-map preprocessed-map]
(->> preprocessed-map
(assoc {:request input-map} :preprocessed)))
(comment
;; try it in the REPL
(def req
"Example request. :input-data should be as flat as possible"
{:input-data {:current-amount 700
:previous-amount 400
:email "squadron42@starfleet.ufp"
:items [{:brand "Foo Industries" :value 1234}
{:brand "Baz Corp" :value 35345}]}
:features ["n-digits-in-email-name"
"diff-eur-previous-order"]}))
(let [req (edn/read *in*)
res (->> req
preprocessed
(response req))]
(future
(println "Saving response to file...")
(spit "bulgogi_response.edn" res :append true))
res)
@jcpsantiago
Copy link
Author

jcpsantiago commented Sep 5, 2021

Bulgogi is a small-scale prototype for a just-in-time feature calculation system for real-time machine-learning models.

Run with

echo '{:input-data {:email "foo@gmail.com" :schufa-id 324565436754321} :features ["long-schufa-id?" "n-digits-in-email-name"]}' | bb -f bulgogi.clj

or straight in the REPL :)

@jcpsantiago
Copy link
Author

Add metadata to each function

  • tags, inputs needed, etc to make it searchable and enable "calculate whatever is possible given this data"

@jcpsantiago
Copy link
Author

Given a request with input-data and a list of features, Bulgogi looks for a function with the same name as feature in any namespace and applies the input-data map to it. Multiple features (i.e. multiple functions) are applied in parallel with pmap; dependencies between features are implicit in the function calls, so there's no need to build a DAG or similar; documentation for each feature function in the form of a docstring ensures locality, and metadata e.g. {:tags ["email" "derived"] :dependencies [:email]} enables later use-cases such as "search by tag" or "given this input-data calculate all possible features"(useful for backfilling).

@behrica
Copy link

behrica commented Sep 9, 2021

A lot of times data transformation are easier expressed and understood to work one after the other and not in parallel, giving teh notion
of a data transformation pipeline.
How can your system can express this ?

@jcpsantiago
Copy link
Author

jcpsantiago commented Sep 10, 2021

thanks for taking a look @behrica! I agree with your statement, but I'm exploring ideas to actually avoid pipelines altogether, because those tend to be less reusable across projects/teams. What I'm aiming for is a library (in the sense of a repository, not a software library) of features as functions that one can call when needed. Since the data would all be stored for the future and backfilled when new features are created, there's no need to have a training transformation pipeline either.

This would be part of a larger "feature store" system (see https://www.tecton.ai/blog/what-is-a-feature-store/ for an intro to what I mean, if you're not familiar).

I think pipelines still make sense for a lot of applications, but in my case I need to combine pre-computed features (because they are expensive to compute at prediction-time) with data I only receive at prediction-time e.g. the timestamp of an order, or the amount -- currently this is an R recipes pipeline, but if I want to reuse this code I can't, I'll have to copy paste it or my python colleagues will need to rewrite the calculation in python. Hope it makes sense :) Since what I'm describing here are just functions, nothing stops you from expressing a pipeline using another library and then calling that instead of a single function.

So you could have multiple pipelines and your model just calls the one it needs in prod.

@jcpsantiago
Copy link
Author

jcpsantiago commented Oct 5, 2021

@daslu asked when would side-effects happen (i.e. saving all results to a database). I think there are two options, of which I prefer the latter:

  • Creating a macro e.g. defn-feature which wraps defn and stores its return value in some global atom, which is then flushed when responding back to the client. Clear issue is the fact we would need to manage state, which always adds complexity.
  • After all requested features are calculated, put the input data and calculated features on a queue and async calculate all other features (excluding the ones we already have results for) and then write to a database. This has the added advantage of zero state, but it can grow large if no heuristics regarding "what do calculate" are employed e.g. via metadata tags.

In this Gist example, the affected feature would be email-name in case we only care about n-digits-in-email, because it's only a dependency and not a "final feature". Either option above would result in email-name also being stored to persistent storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment