- Currently, all log information is in the
logging
table and as of now it has around~2,13,92,493
records - Total size of data is
~36 GB
(retrieved frominformation_schema
table) - A record can have a size between 1KB to 10KB based on the amount of data stored.
- With the current indexes in the table, logs for a single user can be fetched pretty quickly and uses
const
ref
mysql> EXPLAIN SELECT * FROM `logging` WHERE `member_id` = 45373 ORDER BY `id` DESC LIMIT 100 OFFSET 0;
+----+-------------+---------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | logging | NULL | ref | logging_member_id_foreign | logging_member_id_foreign | 5 | const | 1 | 100.00 | Using where |
+----+-------------+---------+------------+------+---------------------------+---------------------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.05 sec)
I have tried to see which of the available datastores might fit the need, here's an overview:
Name | Description | PHP/Laravel Package | Recommend |
---|---|---|---|
ClickHouse | ClickHouse® is an open-source, high performance columnar OLAP database management system for real-time analytics using SQL. | https://github.com/smi2/phpClickHouse https://github.com/bavix/laravel-clickhouse | Yes |
Elasticsearch | Elasticsearch, Kibana, Beats, and Logstash (also known as the ELK Stack). Reliably and securely take data from any source, in any format, then search, analyze, and visualize it in real time. | https://github.com/elastic/elasticsearch-php https://github.com/sleimanx2/plastic | Yes |
PrestoDB | Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. | No well-maintained PHP client | No, lacks good integration |
InfluxDB | At its heart, InfluxDB is a database purpose-built to handle the epic volumes and countless sources of time-stamped data produced by sensors, applications and infrastructure. | https://github.com/influxdata/influxdb-client-php | No, unrelated to our need |
Apache HBase | Apache HBase™ is good when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. | No well-maintained PHP client, requires Throttle protocol | No, lacks good integration |
Tarantool | Tarantool keeps all the data in random-access memory (RAM). Solutions of this class are fast, but they often demonstrate a lack of data integrity. Tarantool solves all these problems. | https://github.com/tarantool-php/client | No, unrelated to our need |
-
Keep using MySQL in a separate, dedicated database and if needed in the future, introduce sharding data based on some logic (i.e. year).
-
Introduce Elasticsearch, which has a good official PHP client and popular Laravel packages. It also has other bundled tools for data visualization which might come in handy.
-
Introduce ClickHouse, which has a community-developed PHP client and less-used Laravel package. It features SQL like syntax so it can be used for financial data analytics and dashboard.