Skip to content

Instantly share code, notes, and snippets.

@kordless
Last active October 6, 2022 04:10
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kordless/d3aaeedbe0ac68d284c077ddd74c2ae1 to your computer and use it in GitHub Desktop.
Save kordless/d3aaeedbe0ac68d284c077ddd74c2ae1 to your computer and use it in GitHub Desktop.
Set Up a FeatureBase Binary Index in 5 Minutes

FeatureBase in 5 Minutes

FeatureBase is a B-tree database built on Roaring Bitmaps. This makes it suitable for running analytics on massive data sets in real time. It's also perfect for use in machine learning applications, where a fast feature store is desired during training or inference. FeatureBase itself may be considered a type of machine learning model which is trained on large amounts of other model's inference data, over time.

This guide is designed to get you started using FeatureBase on a Mac or Linux/UNIX based system. It covers downloading and starting the FeatureBase server as well as ingesting and querying a small amount of data.

You may also want to reference the documentation for FeatureBase after you finish running through this guide.

Install

Start by heading over to the downloads on the Github repo and select the build needed for your particular architecture. The ARM version are for newer Macs or devices like the Raspberry Pi. The AMD versions are for Intel architectures.

The rest of this guide will assume you are on a newer, Apple-CPU Mac.

Download the file wherever you like it and then open a terminal and change to that directory:

kord@bob Downloads % mkdir featurebase; mv featurebase-community-v1.0.0-darwin-arm64.tar.gz featurebase; cd featurebase
kord@bob featurebase % ls -la
total 182976
drwx------@  4 kord  staff       128 Sep 23 14:37 .
drwx------+ 27 kord  staff       864 Sep 23 14:36 ..
-rw-r--r--@  1 kord  staff      6148 Sep 19 13:08 .DS_Store
-rw-r--r--@  1 kord  staff  93672692 Sep 19 13:07 featurebase-community-v1.0.0-darwin-arm64.tar.gz

Now uncompress the file:

kord@bob featurebase % tar xvfz featurebase-community-v1.0.0-darwin-arm64.tar.gz 
x featurebase-v3.20.0-darwin-arm64/
x featurebase-v3.20.0-darwin-arm64/featurebase.conf
x featurebase-v3.20.0-darwin-arm64/featurebase.redhat.service
x featurebase-v3.20.0-darwin-arm64/featurebase
x featurebase-v3.20.0-darwin-arm64/roaring-migrate
x featurebase-v3.20.0-darwin-arm64/NOTICE
x featurebase-v3.20.0-darwin-arm64/featurebase.debian.service
x idk-v3.20.0-darwin-arm64/
x idk-v3.20.0-darwin-arm64/molecula-consumer-kinesis
x idk-v3.20.0-darwin-arm64/molecula-consumer-kafka-static
x idk-v3.20.0-darwin-arm64/molecula-consumer-sql
x idk-v3.20.0-darwin-arm64/molecula-consumer-csv
x idk-v3.20.0-darwin-arm64/molecula-consumer-github

Let's move the directories into simplified names, so you can cut and paste as we go:

kord@bob featurebase % mv featurebase-v3.20.0-darwin-arm64 fb
kord@bob featurebase % mv idk-* idk

The idk directory contains a few ingestion tools. We'll get to this in a minute.

Start the Server

Before we start the server, we have to deal with Apple's new security features. Jump into Finder and then navigate to the same directory and right click (ctrl-click) on the featurebase binary. This will bring up a menu which includes an open with option. You'll need to pick the terminal you want to "open" the binary with to get the next dialog (don't worry, it won't actually start the server doing this):

allow

Now you've tried to open it, you'll get a warning from OSX, which will allow you to set the permissions to run it without having to go dig around in settings:

open

Now we've done this bit of funny business, open up another terminal and switch into the same directory. We'll use the new terminal to start FeatureBase:

kord@bob ~ % cd ~/Downloads/featurebase/fb
kord@bob fb % ./featurebase server
2022-09-23T17:41:46.850455Z INFO:  Molecula Pilosa v3.20.0 (Aug 25 2022 7:08PM, 6562b60) go1.19
2022-09-23T17:41:46.856631Z INFO:  rbf config = &cfg.Config{MaxSize:4294967296, MaxWALSize:4294967296, MinWALCheckpointSize:1048576, MaxWALCheckpointSize:2147483648, FsyncEnabled:true, FsyncWALEnabled:true, DoAllocZero:false, CursorCacheSize:0, Logger:logger.Logger(nil), MaxDelete:65536}
2022-09-23T17:41:46.856662Z INFO:  cwd: /Users/kord/Downloads/featurebase-community-v1.0.0-darwin-arm64 (1)/featurebase-v3.20.0-darwin-arm64
2022-09-23T17:41:46.856671Z INFO:  cmd line: ./featurebase server
2022-09-23T17:41:46.915337Z INFO:  enabled Web UI at :10101
2022-09-23T17:41:46.915411Z INFO:  open server. PID 11787
2022-09-23T17:41:47.515650Z INFO:  holder translation sync monitor initializing
2022-09-23T17:41:47.515799Z INFO:  holder translation sync beginning
2022-09-23T17:41:47.515913Z INFO:  open holder path: /Users/kord/.pilosa
2022-09-23T17:41:47.586626Z INFO:  open holder: complete
2022-09-23T17:41:47.586781Z INFO:  diagnostics disabled
2022-09-23T17:41:47.586855Z INFO:  listening as http://localhost:10101
2022-09-23T17:41:47.586968Z INFO:  enabled grpc listening on 127.0.0.1:20101
2022-09-23T17:41:48.587094Z INFO:  start initial cluster state sync
2022-09-23T17:41:48.587162Z INFO:  completed initial cluster state sync in 74.083µs
<snip>

Pilosa is the older name of the FeatureBase server, which was created by Molecula (also renamed to FeatureBase). Pilosas are an order of xenarthran placental mammals, native to the Americas. It includes the anteaters and sloths, which includes the extinct ground sloths. The name comes from the Latin word for "hairy".

It is rumor that the original authors of Pilosa (Molecula) thought it would be humorous to name the fastest database in the world after an order of the slowest mammal, the sloth. Surprisingly, anteaters and armadillos can move very quickly. If you ever run across an armadillo, I suggest (as a Texan) you steer clear of it!

Access the UI

Now we have the server running, let's jump over to a browser and access the UI using the following URL:

http://localhost:10101/

FeatureBase runs on port 10101. This is probably another inside joke, but I'm unsure why this is funny. Nevertheless, here's what you should see:

ui

Ingest Data

We're now ready to ingest data. Switch back to the other terminal window and clone this gist into the featurebase directory and then move the new directory to a directory called gist:

kord@bob featurebase % git clone https://gist.github.com/kordless/d3aaeedbe0ac68d284c077ddd74c2ae1/
kord@bob featurebase % mv d3aaeedbe0ac68d284c077ddd74c2ae1 gist

We'll use the molecula-consumer-csv tool (located in idk) to ingest the sample data (which also means you have to do the open with trick with it, as mentioned above):

kord@bob featurebase % idk/molecula-consumer-csv --batch-size=10000 --auto-generate --index=allyourbase --files=gist/sample.csv
Molecula Consumer v3.20.0, build time 2022-08-26T00:11:50+0000
2022-09-23T20:07:25.430833Z INFO:  Serving Prometheus metrics with namespace "ingester_csv" at localhost:9093/metrics
2022-09-23T20:07:25.434554Z INFO:  start ingester 0
2022-09-23T20:07:25.434862Z INFO:  processFile: gist/sample.csv
2022-09-23T20:07:25.435059Z INFO:  new schema: []idk.Field{idk.StringField{NameVal:"asset_tag", DestNameVal:"asset_tag", Mutex:false, Quantum:"", TTL:"", CacheConfig:(*idk.CacheConfig)(nil)}, idk.RecordTimeField{NameVal:"fan_time", DestNameVal:"fan_time", Layout:"2006-01-02", Epoch:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Unit:""}, idk.StringField{NameVal:"fan_val", DestNameVal:"fan_val", Mutex:false, Quantum:"YMD", TTL:"", CacheConfig:(*idk.CacheConfig)(nil)}}
2022-09-23T20:07:25.436104Z INFO:  Listening for /debug/pprof/ and /debug/fgprof on 'localhost:6062'
2022-09-23T20:07:25.478702Z INFO:  translating batch of 8 took: 41.931708ms
2022-09-23T20:07:25.478805Z INFO:  making fragments for batch of 8 took 110.25µs
2022-09-23T20:07:25.481605Z INFO:  importing fragments took 2.799375ms
2022-09-23T20:07:25.481918Z INFO:  1 records processed 0-> (9)
2022-09-23T20:07:25.481925Z INFO:  metrics: import=46.120541ms

Querying

Now we have our sample data loaded, let's take a look at it in the UI. We'll do this by writing a simple SQL query to view the "rows":

select * from allyourbase;

sql

Wrapping Up

That's it for this minimalist guide. If you have any feedback about any of this, be sure to join FeatureBase's Discord server.

asset_tag__String fan_time__RecordTime_2006-01-02 fan_val__String_F_YMD
ABCD 2019-01-02 70%
ABCD 2019-01-03 20%
BEDF 2019-01-02 70%
BEDF 2019-01-05 90%
ABCD 2019-01-30 40%
BEDF 2019-01-08 10%
BEDF 2019-01-08 20%
ABCD 2019-01-04 30%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment