Skip to content

Instantly share code, notes, and snippets.

@leventov
Last active April 4, 2020 14:38
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save leventov/8943669bb1c7deabb82d7e1610bbf52f to your computer and use it in GitHub Desktop.
Save leventov/8943669bb1c7deabb82d7e1610bbf52f to your computer and use it in GitHub Desktop.
ClickHouse Druid or Pinot
The organization has expertise in C++ The organization has expertise in Java
Small cluster Large cluster
A few tables Many tables
Single data set Multiple unrelated data sets (multitenancy)
Tables and data sets reside the cluster permanently Tables and data sets periodically emerge and retire from the cluster
Table sizes (and query intensity to them) are stable in time Tables significantly grow and shrink in time
Homogeneity of queries (their type, size, distribution by time of the day, etc.) Heterogeneity
There is a dimension in the data,
by which it could be partitioned
and almost no queries that touch data
across the partitions are done
(i. e. shared-nothing partitioning)
There is no such dimension,
queries often touch data across the whole cluster. Edit 2019: Pinot now supports partitioning by dimension.
Cloud is not used, cluster is deployed on specific physical servers Cluster is deployed in the cloud
No existing clusters of Hadoop or Spark Clusters of either Hadoop or Spark already exist and could be used
@kishoreg
Copy link

kishoreg commented Sep 23, 2019

There is no such dimension, queries often touch data across the whole cluster. Pinot has support for partitioning and sorting on a single dimension key.

@leventov
Copy link
Author

@kishoreg thanks, updated. However, partitioning by key makes partition-based sampling problematic (because it may be very biased). And efficient sampling may be even more important that the benefits that key-based partitioning provides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment