Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
ClickHouse Druid or Pinot
The organization has expertise in C++ The organization has expertise in Java
Small cluster Large cluster
A few tables Many tables
Single data set Multiple unrelated data sets (multitenancy)
Tables and data sets reside the cluster permanently Tables and data sets periodically emerge and retire from the cluster
Table sizes (and query intensity to them) are stable in time Tables significantly grow and shrink in time
Homogeneity of queries (their type, size, distribution by time of the day, etc.) Heterogeneity
There is a dimension in the data,
by which it could be partitioned
and almost no queries that touch data
across the partitions are done
(i. e. shared-nothing partitioning)
There is no such dimension,
queries often touch data across the whole cluster. Edit 2019: Pinot now supports partitioning by dimension.
Cloud is not used, cluster is deployed on specific physical servers Cluster is deployed in the cloud
No existing clusters of Hadoop or Spark Clusters of either Hadoop or Spark already exist and could be used
@leventov

This comment has been minimized.

@kishoreg

This comment has been minimized.

Copy link

@kishoreg kishoreg commented Sep 23, 2019

There is no such dimension, queries often touch data across the whole cluster. Pinot has support for partitioning and sorting on a single dimension key.

@leventov

This comment has been minimized.

Copy link
Owner Author

@leventov leventov commented Sep 23, 2019

@kishoreg thanks, updated. However, partitioning by key makes partition-based sampling problematic (because it may be very biased). And efficient sampling may be even more important that the benefits that key-based partitioning provides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment