valyala/when-size-matters.md Secret

## when-size-matters.md

      
    Raw
  

              when-size-matters.md
            
          
    When size matters - benchmarking VictoriaMetrics vs Timescale and InfluxDB

Recently Timescale published Time Series Benchmark Suite (TSBS) -
a framework for TSDB benchmarking. See TSBS on GitHub.
The TSBS may:

generate the configured number of production-like timeseries;
measure insert performance for the generated timeseries;
measure select performance for various production-like queries.

The original TSBS supports the following systems:

Timescale
InfluxDB
MongoDB
Cassandra

Adding VictoriaMetrics to TSBS

We liked TSBS, so we quickly hacked support
for Prometheus remote write API into TSBS and started using it.
Initial results weren't exciting - VictoriaMetrics was slow on some queries
and required a lot of memory during the benchmark execution. The root cause was remote read API.
TSBS had been configured to query Prometheus, which, in turn, queried VictoriaMetrics via remote read API.
This didn't scale well, since VictoriaMetrics had to prepare and return huge amounts of data to Prometheus
on heavy queries like double-groupby.
The solution was to create a PromQL engine
directly in VictoriaMetrics, so all the heavy-lifting on complex queries could be implemented
and optimized inside the engine. The end result is Extended PromQL
engine with full PromQL support plus additional useful features like WITH expressions.
Benchmark preparation

Which competitors to put against VictoriaMetrics?

Cassandra has been disqualified, since it is much slower than Timescale.
MongoDB has been disqualified for the same reason.

The remaining competitors - Timescale and InfluxDB.
The following TSBS queries couldn't be translated to PromQL, so they have been dropped from the benchmark:

lastpoint - PromQL cannot return last point for each time series;
groupby-orderby-limit - PromQL doesn't support order by and limit.

The high-cpu queries have been modified to return the max(usage_user) for each host, since PromQL doesn't support SELECT *.
The cpu-max-all queries have been dropped, since they weren't present in benchmark results from Timescale.
The benchmark was run in Google Compute Engine on two n1-standard-8 instances with 8 virtual CPUs, 30GB RAM and 200GB HDD -
an instance for the client (TSBS), and an instance for the server. Timescale version - 0.12.1, InfluxDB version - 1.6.4.
Benchmark results

Insert performance for a billion of datapoints belonging to 40K distinct timeseries:

VictoriaMetrics - 1.7M datapoints per second, RAM usage - 0.8GB, data size on HDD - 387MB.
InfluxDB - 1.1M datapoints per second, RAM usage - 1.7GB, data size on HDD - 573MB.
Timescale - 890K datapoints per second, RAM usage - 0.4GB, data size on HDD - 29GB.

Nothing interesting except Timescale data occupies whopping 29GB on HDD. That's 50x more than InfluxDB
and 75x more than VictoriaMetrics. Later we'll see when this size matters.
Select performance:

VictoriaMetrics wins InfluxDB and Timescale in all the queries by a margin of up to 20x. It especially excels
at heavy queries, which scan many millions of datapoints across thousands of distinct timeseries.
InfluxDB is on the second place. It wins Timescale on light queries and looses Timescale by up to 3.5x on heavy queries.
Timescale is on the third place. Moreover, it was multiple orders of magnitude slower on all the queries when the required data wasn't
in page cache, while VictoriaMetrics and InfluxDB were only marginally slower in these cases.

See full benchmark results.
Analysis

Why Timescale performed so poorly on select queries? The answer is in huge data size (29GB) and on-disk data layout not suited for storage with low iops.
Google Cloud HDDs are limited in iops per GB and throughput per GB.
200GB disk is limited by 150 read operations per second and 24MB/s read/write throughput. Simple calculations show that 20 minutes is needed
for loading 29GB into page cache at 24MB/s. VictoriaMetrics would load the same amount of data (a billion of datapoints) on the same HDD in 16 seconds.
The read throughput limit has been reached by Timescale only a few times during select queries. The rest of time it was limited by 150 read operations per second.
This points to suboptimal data layout for low-iops storage such as HDD.
Any workarounds for Timescale? The easiest workaround is to use more expensive storage with high bandwidth and high iops such as high-end SSD.
Conclusion

Sometimes size matters :) It may be more expensive than you expect.
TSBS is a great benchmarking tool. It helped minimizing CPU usage and RAM usage for VictoriaMetrics on production workloads.
We are planning to run benchmarks and publish results for higher cardinality (millions of unique timeseries) and higher number of datapoints (trillions).
Stay tuned.
In the mean time read how we created VictoriaMetrics - the best remote storage for Prometheus.