Skip to content

Instantly share code, notes, and snippets.

@jcpsantiago
Last active Oct 10, 2021
Embed
What would you like to do?
Proof of concept for bulgogi/feature processing system
(ns bulgogi.main
"A bulgogi prototype. Not seasoned yet.
A main namespace would require sub-namespaces, each with its own use-case or theme.
Still needs some error-handling in case no function is available."
(:require
[clojure.edn :as edn]
[clojure.pprint :refer [pprint]]
[clojure.string :as s]))
;; ---- email features namespace ----
(defn email-name
"Lower-cased name of an email address (the bit before @)"
[{email :email}]
(-> email
s/lower-case
(s/replace-first #"@.*" "")))
(defn n-digits-in-email-name
"Number of digits in the email address"
[input-data]
;; we get dependency management for free because everything is just functions
(->> (email-name input-data)
(re-seq #"\d")
count)))
;; ---- credit agency features namespace ----
(defn long-schufa-id?
"Indicator whether the Schufa ID is 'long' or not. Integer 0/1 for FALSE/TRUE"
[{schufa-id :schufa-id}]
(let [id-str (str schufa-id)]
;; 'long' is defined by Schufa themselves
(if (> (count id-str) 10)
1
0)))
;; ---- from real meat to bulgogi, would be part of the actual core ----
(defn preprocessed
[req]
(let [{:keys [input-data features]} req
fns (->> features
(map #(-> % symbol resolve)))
fn-ks (map keyword features)]
(->> (pmap #(% input-data) fns)
(zipmap fn-ks))))
(defn response
"Bundles the bulgogi into a consumable response"
[input-map preprocessed-map]
(->> preprocessed-map
(assoc {:request input-map} :preprocessed)))
(comment
(def req
"Example request. :input-data should be as flat as possible"
{:input-data {:schufa-id 54893453457654345
:email "partiboi69@unprotected.com"}
:features ["n-digits-in-email-name" "long-schufa-id?"]}))
(let [req (edn/read *in*)]
(->> req
preprocessed
(response req)
pprint))
@jcpsantiago

This comment has been minimized.

Copy link
Owner Author

@jcpsantiago jcpsantiago commented Sep 5, 2021

Bulgogi is a small-scale prototype for a just-in-time feature calculation system for real-time machine-learning models.

Run with

echo '{:input-data {:email "foo@gmail.com" :schufa-id 324565436754321} :features ["long-schufa-id?" "n-digits-in-email-name"]}' | bb -f bulgogi.clj
@jcpsantiago

This comment has been minimized.

Copy link
Owner Author

@jcpsantiago jcpsantiago commented Sep 6, 2021

Add metadata to each function

  • tags, inputs needed, etc to make it searchable and enable "calculate whatever is possible given this data"
@jcpsantiago

This comment has been minimized.

Copy link
Owner Author

@jcpsantiago jcpsantiago commented Sep 8, 2021

Given a request with input-data and a list of features, Bulgogi looks for a function with the same name as feature in any namespace and applies the input-data map to it. Multiple features (i.e. multiple functions) are applied in parallel with pmap; dependencies between features are implicit in the function calls, so there's no need to build a DAG or similar; documentation for each feature function in the form of a docstring ensures locality, and metadata e.g. {:tags ["email" "derived"] :dependencies [:email]} enables later use-cases such as "search by tag" or "given this input-data calculate all possible features"(useful for backfilling).

@behrica

This comment has been minimized.

Copy link

@behrica behrica commented Sep 9, 2021

A lot of times data transformation are easier expressed and understood to work one after the other and not in parallel, giving teh notion
of a data transformation pipeline.
How can your system can express this ?

@jcpsantiago

This comment has been minimized.

Copy link
Owner Author

@jcpsantiago jcpsantiago commented Sep 10, 2021

thanks for taking a look @behrica! I agree with your statement, but I'm exploring ideas to actually avoid pipelines altogether, because those tend to be less reusable across projects/teams. What I'm aiming for is a library (in the sense of a repository, not a software library) of features as functions that one can call when needed. Since the data would all be stored for the future and backfilled when new features are created, there's no need to have a training transformation pipeline either.

This would be part of a larger "feature store" system (see https://www.tecton.ai/blog/what-is-a-feature-store/ for an intro to what I mean, if you're not familiar).

I think pipelines still make sense for a lot of applications, but in my case I need to combine pre-computed features (because they are expensive to compute at prediction-time) with data I only receive at prediction-time e.g. the timestamp of an order, or the amount -- currently this is an R recipes pipeline, but if I want to reuse this code I can't, I'll have to copy paste it or my python colleagues will need to rewrite the calculation in python. Hope it makes sense :) Since what I'm describing here are just functions, nothing stops you from expressing a pipeline using another library and then calling that instead of a single function.

So you could have multiple pipelines and your model just calls the one it needs in prod.

@jcpsantiago

This comment has been minimized.

Copy link
Owner Author

@jcpsantiago jcpsantiago commented Oct 5, 2021

@daslu asked when would side-effects happen (i.e. saving all results to a database). I think there are two options, of which I prefer the latter:

  • Creating a macro e.g. defn-feature which wraps defn and stores its return value in some global atom, which is then flushed when responding back to the client. Clear issue is the fact we would need to manage state, which always adds complexity.
  • After all requested features are calculated, put the input data and calculated features on a queue and async calculate all other features (excluding the ones we already have results for) and then write to a database. This has the added advantage of zero state, but it can grow large if no heuristics regarding "what do calculate" are employed e.g. via metadata tags.

In this Gist example, the affected feature would be email-name in case we only care about n-digits-in-email, because it's only a dependency and not a "final feature". Either option above would result in email-name also being stored to persistent storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment