Skip to content

Instantly share code, notes, and snippets.

@par6n
Created December 13, 2019 13:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save par6n/d134be2dd06959a33db256cf4b544c8f to your computer and use it in GitHub Desktop.
Save par6n/d134be2dd06959a33db256cf4b544c8f to your computer and use it in GitHub Desktop.

Chabok job offer exercise

This is an example which uses Node.js as the runtime environment and ClickHouse, which is a column-oriented database management system.

Installation

  1. Create a ClickHouse cluster.
  2. Setup the connection by setting following environment variables: DB_HOST, DB_USER, DB_PASS (you can study the fallback values here.)
  3. Install the script dependencies:
$ npm install
$ yarn # or use yarn, which is a lot better.
  1. Run seed.js file which creates a table named Chabox.Event and populates it with 10 millions records (you can alter the records counts to create by changing ROWS_COUNT constant.)
$ node seed

Performance

The populating operation took me 1999899.612ms (~33 minutes) on a virtual private server with 1 GB of RAM and uses one core of an Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz. Data writing rate was ~5002 rows/seconds.

Funnel query example

I've used weighted randoms in my seeder script, so the outcome of my funnel query would be more predictable.

I created a basic funnel using the following query:

--- First query
SELECT
  COUNT(e.ID) As AllEvents,
  COUNT(multiIf(e.Name = 'logged_in', e.ID, NULL)) AS LoggedIn,
  COUNT(multiIf(e.Name = 'added_to_cart', e.ID, NULL)) AS AddedToCart,
  COUNT(multiIf(e.Name = 'visited_checkout', e.ID, NULL)) AS VisitedCheckout,
  COUNT(multiIf(e.Name = 'purchased', e.ID, NULL)) AS Purchased
FROM Chabox.Event As e
WHERE e.Date BETWEEN '2019-12-01 00:00:00' AND '2019-12-15 23:59:59';

Which produced the following performance stats:

1 rows in set. Elapsed: 2.387 sec. Processed 10.00 million rows, 209.39 MB (4.19 million rows/s., 87.72 MB/s.) 

Executing a simple OPTIMIZE query would increase the performance slightly.

OPTIMIZE TABLE Chabox.Event;

Running the first query after optimization produces this:

1 rows in set. Elapsed: 2.173 sec. Processed 10.00 million rows, 209.39 MB (4.60 million rows/s., 96.34 MB/s.)

Deleting the WHERE clause doesn't impact the query performance that much:

1 rows in set. Elapsed: 2.250 sec. Processed 10.00 million rows, 418.63 MB (4.44 million rows/s., 186.03 MB/s.) 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment