vlasky/calculate_median_in_mysql_8_using_window_function_row_number.md

## calculate_median_in_mysql_8_using_window_function_row_number.md

      
    Raw
  

              calculate_median_in_mysql_8_using_window_function_row_number.md
            
          
    During a collaborative session with ChatGPT, I came up with a modern MySQL query to calculate the median value of a set of data. It leverages the window function ROW_NUMBER() introduced in MySQL 8.0. For this reason, it won't run on MySQL 5.7 or older versions.
The median is an extremely useful metric, especially when analyzing data sets representing response times, latencies or error rates. Unlike the average, it is resistant to being skewed by outliers.
To get the median of a data set, we need to sort the data and take the value of the middle row (if the number of rows is odd), or the average of the two middle numbers (if the number of rows is even).
All the median calculation MySQL queries I found online seemed crude, relying on user-defined variables, string concatenation via GROUP_CONCAT() and/or multiple subqueries. I find this approach much cleaner.
As of June 2023, MariaDB is currently the only fork of MySQL that has native MEDIAN() and PERCENTILE_CONT() functions. A feature request to add these functions was submitted to the MySQL bug tracker back in November 2018, but it has not yet been actioned.
First, we'll create a table and populate it with some data. For this example, let's imagine we have a vehicle logbook with distances covered per trip and want to determine the median distance travelled.
CREATE TABLE trips (
  id int(11) NOT NULL AUTO_INCREMENT,
  start_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  distance_km double NOT NULL,
  PRIMARY KEY (id)
);

INSERT INTO trips (start_time,distance_km) values ('2023-06-01 08:52:23',200.3);
INSERT INTO trips (start_time,distance_km) values ('2023-06-02 08:49:10',2.2);
INSERT INTO trips (start_time,distance_km) values ('2023-06-03 09:23:23',23.2);
INSERT INTO trips (start_time,distance_km) values ('2023-06-04 10:18:23',20.0);
INSERT INTO trips (start_time,distance_km) values ('2023-06-05 07:22:23',7.0);
INSERT INTO trips (start_time,distance_km) values ('2023-06-06 08:08:23',312.23);
INSERT INTO trips (start_time,distance_km) values ('2023-06-07 09:43:23',1.2);
INSERT INTO trips (start_time,distance_km) values ('2023-06-08 09:47:23',220.56);
INSERT INTO trips (start_time,distance_km) values ('2023-06-09 09:55:23',50.2);

And here is the query that returns the median:
SELECT
  AVG(distance_km) AS MedianTripDistanceKm
  FROM (
    SELECT
      trips.distance_km,
      ROW_NUMBER() OVER (ORDER BY trips.distance_km) AS row_num,
      COUNT(*) OVER () AS total_count
    FROM
      trips
  ) AS dd
WHERE
  dd.row_num IN (FLOOR((dd.total_count+1)/2), FLOOR((dd.total_count+2)/2));

It will return the following result:
+----------------------+
| MedianTripDistanceKm |
+----------------------+
|                 23.2 |
+----------------------+
1 row in set (0.00 sec)

To prove that this is correct, we will examine the result of the inner query:
   SELECT
      trips.distance_km,
      ROW_NUMBER() OVER (ORDER BY trips.distance_km) AS row_num,
      COUNT(*) OVER () AS total_count
    FROM
      trips

Which returns
+-------------+---------+-------------+
| distance_km | row_num | total_count |
+-------------+---------+-------------+
|         1.2 |       1 |           9 |
|         2.2 |       2 |           9 |
|           7 |       3 |           9 |
|          20 |       4 |           9 |
|        23.2 |       5 |           9 |
|        50.2 |       6 |           9 |
|       200.3 |       7 |           9 |
|      220.56 |       8 |           9 |
|      312.23 |       9 |           9 |
+-------------+---------+-------------+
9 rows in set (0.00 sec)

As we can see, row 5 is the middle row, which contains the median value of 23.2.
Now let's see if it produces the correct result with an even number of values. We will add an extra data point:
INSERT INTO trips (start_time,distance_km) values ('2023-06-10 11:02:03',21.0);

If we run the original query again we get a new median value of 22.1:
+----------------------+
| MedianTripDistanceKm |
+----------------------+
|                 22.1 |
+----------------------+
1 row in set (0.00 sec)

Let's verify it by examining the output of the inner query:
+-------------+---------+-------------+
| distance_km | row_num | total_count |
+-------------+---------+-------------+
|         1.2 |       1 |          10 |
|         2.2 |       2 |          10 |
|           7 |       3 |          10 |
|          20 |       4 |          10 |
|          21 |       5 |          10 |
|        23.2 |       6 |          10 |
|        50.2 |       7 |          10 |
|       200.3 |       8 |          10 |
|      220.56 |       9 |          10 |
|      312.23 |      10 |          10 |
+-------------+---------+-------------+
10 rows in set (0.00 sec)

The median is now given by the average of rows 5 and 6, which is (21+23.2)/2. This equals 22.1, which matches the previous result.
Hope you find this useful.