Data is generally stored at the lowest level of granularity needed by any of a database's users.
Sometimes for instance, one needs to take a look at each record of a table. But that does not
mean you have to look at it at that granularity. The group by
construct and the aggregate functions allow you to examine data at a different granularity.
If you look at a table like employee
, then you will see that it has a number of columns like emp_id
, dept_id
, title
etc. Thus, each employee has some value for these fields. Sometimes, it may so happen that you want to form the groups of employees based on these criteria like title
. Doing this may help us answer questions like how many different titles of employees are there? It is almost like asking each employee her/his title and putting her/him in a bucket labeled with that title.
This will enable us to first of all, find the buckets with distinct labels. Each bucket is a group that is unique with respect to the criterion chosen (this means they may differ with respect to otherwise). In a small table with 18 employees each one of which have a department we see the following:
mysql> select emp_id, lname, dept_id from employee;
+--------+-----------+---------+
| emp_id | lname | dept_id |
+--------+-----------+---------+
| 1 | Smith | 3 |
| 2 | Barker | 3 |
| 3 | Tyler | 3 |
| 4 | Hawthorne | 1 |
| 5 | Gooding | 2 |
| 6 | Fleming | 1 |
| 7 | Tucker | 1 |
| 8 | Parker | 1 |
| 9 | Grossman | 1 |
| 10 | Roberts | 1 |
| 11 | Ziegler | 1 |
| 12 | Jameson | 1 |
| 13 | Blake | 1 |
| 14 | Mason | 1 |
| 15 | Portman | 1 |
| 16 | Markham | 1 |
| 17 | Fowler | 1 |
| 18 | Tulman | 1 |
+--------+-----------+---------+
18 rows in set (0.00 sec
If we make the dept_id as a bucket or group, then each of these employees has to fall in one of the three buckets, where the dept_id is either 1, 2, or 3. There is quite some insight that we may get about these groups, if we were actually able to create these buckets. This is exactly what the group by
construct is designed for. See for example, the output of the following query:
mysql> select dept_id from employee group by dept_id;
+---------+
| dept_id |
+---------+
| 1 |
| 2 |
| 3 |
+---------+
3 rows in set (0.00 sec)
This output is to be expected as we have three buckets as far as dept_id's are concerned and each employee is in exactly one of these buckets.
Note that each member of the group is still a full-fledged record from the table. Thus, each member of a group has all the characteristics (e.g. title, emp_id, lname, fname etc.) that are of interest.
Once the groups are made, what is of interest? Perhaps doing some operations on the group as a whole! The most straightforward of those is the size of each group. For now, it does not matter how we count the members of each group -- we just see how many things are in a bucket. The way to do that in SQL is the famous count(*). Thus, to count the number of members in each bucket (i.e. the size of each group), we do:
mysql> select dept_id, count(*) from employee group by dept_id;
+---------+----------+
| dept_id | count(*) |
+---------+----------+
| 1 | 14 |
| 2 | 1 |
| 3 | 3 |
+---------+----------+
3 rows in set (0.00 sec)
or, more descriptively:
mysql> select dept_id, count(dept_id) 'size of group with this dept_id' from employee group by dept_id;
+---------+---------------------------------+
| dept_id | size of group with this dept_id |
+---------+---------------------------------+
| 1 | 14 |
| 2 | 1 |
| 3 | 3 |
+---------+---------------------------------+
3 rows in set (0.00 sec)
The criterion for grouping used in the above example is the dept_id of each employee record. This criterion can be expressed as: Group the employees by their departments.
This is what aggregate functions are for: doing some operation on all the records in a group. Thus, if a particular characteristic of every record in the group is numeric, we could sum the records on that characteristic, find the max or min of all the records for that characteristic and so on.
And the database server does that operation for all the groups formed by the column/criterion specified by the group by
clause. In the above example, the count
aggregate function is used to count the number of records in each group of employees in the same department.
The number of aggregate functions provided by a database varies, but the following functions are provided by a majority of the databases:
- Avg(): Finds the average value of a particular column (numeric) of all the records in a group.
- Count(): Counts the number of records in a group, based on a column (asterisk means any column).
- Max(): Returns the maximum (numeric) value of a column for all the records in a group.
- Min(): Returns the minimum (numeric) value of a column for all the records in a group.
- Sum(): Returns the sum of a (numeric) value of a column for all the records in a group.
With the count()
function, there is an additional option called DISTINCT
. Remember that each record is a full-fledged record with all the characteristics of the given table (or relation). This means, when we are counting the members of a group, we can specify whether counting is affected by whether or not the value of a particular column for each member in the group is distinct.
Thus, count(role)
will count all the records in a group that have some role, whereas count(DISTINCT role)
will count only those records in a group that have unique value for role. So, for example, in a group formed based on departments (i.e. group by dept_id
), if we have 10 members, 8 of which are TELLERs
and 2 are HEAD TELLERs
, then count(role)
would return 10, whereas count(DISTINCT role)
would return 2 since there are only two distinct roles in the entire group.
We now explore how counting is affected by NULL values. Simply speaking, count(*)
counts the number of records in a group, whereas count(column_name)
counts the elements that have some non NULL value for column_name
. The NULL values are also ignored by other aggregate functions.
When the groups are formed, certain filtering conditions are to be specified using the having
clause, rather than the where
clause:
mysql> select count(*) c from account group by product_cd having c >= 10;
+----+
| c |
+----+
| 10 |
+----+
1 row in set (0.00 sec)
An implicit group is formed in the absence of the group by
clause. The aggregate functions can be applied to the group thus formed. But it is not always possible to create groups without the explicit group by
clause. If you see an error like invalid use of aggregate function
or mixing of group columns with no group columns is illegal if there no GROUP BY clause
then you are running into this problem of database being unable to create explicit groups to run the aggregate functions on. Sometimes, without the essential group by
, the database may actually return incorrect values without throwing any error!
We have already seen that choosing a column name in the group by
clause groups all the records by the value of that column for each record. Nothing prevents us from grouping records by their values of more than one column. For instance, it is possible to group students in a school by their grade and favorite sport. In our example, we may want to group the accounts by their type and the branch where they were opened:
mysql> select product_cd, open_branch_id, count(*) from account group by product_cd, open_branch_id;
+------------+----------------+----------+
| product_cd | open_branch_id | count(*) |
+------------+----------------+----------+
| BUS | 2 | 1 |
| BUS | 4 | 1 |
| CD | 1 | 2 |
| CD | 2 | 2 |
| CHK | 1 | 3 |
| CHK | 2 | 2 |
| CHK | 3 | 1 |
| CHK | 4 | 4 |
| MM | 1 | 2 |
| MM | 3 | 1 |
| SAV | 1 | 1 |
| SAV | 2 | 2 |
| SAV | 4 | 1 |
| SBL | 3 | 1 |
+------------+----------------+----------+
14 rows in set (0.00 sec)
Thus, the above example shows the multicolumn grouping. The idea is like
this: We look at the available number of products (say m
) and available
number of branches where an account could be opened (say n
). From the
rule of product, we know the
number of groups thus formed is m ⨉ n
. So, we form these groups, go through
our records one by one and place each one of them in one of the groups. Thus,
for an account if the product type is BUS (business) and the branch is Woburn
(open_branch_id = 2), then we place it in that group.
In our present example, there are 6 different types of products (select count(*) from account group by product_cd;
) and 4 different branches (select count(*) from branch
), which means we have 6 ⨉ 4 groups formed that way. Over these
24 groups, we can apply:
- Several aggregate functions, and
- A few options
We have seen aggregate functions like avg, max, min, sum, count etc. but the options are something new. There are two options that are of interest:
The dictionary meaning of rollup is accumulation. A few actual runs of this option will clarify what this option does.
The dictionary meaning of rollup is accumulation. A few actual runs of this option will clarify what this option does.
Consider the following output:
mysql> select count(*), product_cd, open_branch_id, sum(avail_balance) from account group by product
_cd, open_branch_id with rollup;
+----------+------------+----------------+--------------------+
| count(*) | product_cd | open_branch_id | sum(avail_balance) |
+----------+------------+----------------+--------------------+
| 1 | BUS | 2 | 9345.55 |
| 1 | BUS | 4 | 0.00 |
| 2 | BUS | NULL | 9345.55 |
| 2 | CD | 1 | 11500.00 |
| 2 | CD | 2 | 8000.00 |
| 4 | CD | NULL | 19500.00 |
| 3 | CHK | 1 | 782.16 |
| 2 | CHK | 2 | 3315.77 |
| 1 | CHK | 3 | 1057.75 |
| 4 | CHK | 4 | 67852.33 |
| 10 | CHK | NULL | 73008.01 |
| 2 | MM | 1 | 14832.64 |
| 1 | MM | 3 | 2212.50 |
| 3 | MM | NULL | 17045.14 |
| 1 | SAV | 1 | 767.77 |
| 2 | SAV | 2 | 700.00 |
| 1 | SAV | 4 | 387.99 |
| 4 | SAV | NULL | 1855.76 |
| 1 | SBL | 3 | 50000.00 |
| 1 | SBL | NULL | 50000.00 |
| 24 | NULL | NULL | 170754.46 |
+----------+------------+----------------+--------------------+
21 rows in set (0.01 sec)
Results would be have clearer if the NULL in the above output were replaced with ANY.
Here is an output from another (perhaps more complicated) run:
mysql> select count(*), product_cd, open_branch_id, year(open_date), sum(avail_balance) from accou
group by product_cd, open_branch_id, year(open_date) with rollup;
+----------+------------+----------------+-----------------+--------------------+
| count(*) | product_cd | open_branch_id | year(open_date) | sum(avail_balance) |
+----------+------------+----------------+-----------------+--------------------+
| 1 | BUS | 2 | 2004 | 9345.55 |
| 1 | BUS | 2 | NULL | 9345.55 |
| 1 | BUS | 4 | 2002 | 0.00 |
| 1 | BUS | 4 | NULL | 0.00 |
| 2 | BUS | NULL | NULL | 9345.55 |
| 2 | CD | 1 | 2004 | 11500.00 |
| 2 | CD | 1 | NULL | 11500.00 |
| 2 | CD | 2 | 2004 | 8000.00 |
| 2 | CD | 2 | NULL | 8000.00 |
| 4 | CD | NULL | NULL | 19500.00 |
| 1 | CHK | 1 | 2002 | 122.37 |
| 2 | CHK | 1 | 2003 | 659.79 |
| 3 | CHK | 1 | NULL | 782.16 |
| 1 | CHK | 2 | 2000 | 1057.75 |
| 1 | CHK | 2 | 2001 | 2258.02 |
| 2 | CHK | 2 | NULL | 3315.77 |
| 1 | CHK | 3 | 2002 | 1057.75 |
| 1 | CHK | 3 | NULL | 1057.75 |
| 1 | CHK | 4 | 2001 | 3487.19 |
| 1 | CHK | 4 | 2002 | 23575.12 |
| 1 | CHK | 4 | 2003 | 38552.05 |
| 1 | CHK | 4 | 2004 | 2237.97 |
| 4 | CHK | 4 | NULL | 67852.33 |
| 10 | CHK | NULL | NULL | 73008.01 |
| 2 | MM | 1 | 2004 | 14832.64 |
| 2 | MM | 1 | NULL | 14832.64 |
| 1 | MM | 3 | 2002 | 2212.50 |
| 1 | MM | 3 | NULL | 2212.50 |
| 3 | MM | NULL | NULL | 17045.14 |
| 1 | SAV | 1 | 2000 | 767.77 |
| 1 | SAV | 1 | NULL | 767.77 |
| 1 | SAV | 2 | 2000 | 500.00 |
| 1 | SAV | 2 | 2001 | 200.00 |
| 2 | SAV | 2 | NULL | 700.00 |
| 1 | SAV | 4 | 2001 | 387.99 |
| 1 | SAV | 4 | NULL | 387.99 |
| 4 | SAV | NULL | NULL | 1855.76 |
| 1 | SBL | 3 | 2004 | 50000.00 |
| 1 | SBL | 3 | NULL | 50000.00 |
| 1 | SBL | NULL | NULL | 50000.00 |
| 24 | NULL | NULL | NULL | 170754.46 |
+----------+------------+----------------+-----------------+--------------------+
41 rows in set (0.00 sec)
With cube
Construct a query to that counts the number of rows in the account
table.
mysql> select count(*) from account;
+----------+
| count(*) |
+----------+
| 24 |
+----------+
1 row in set (0.00 sec)
Modify the query in 8-1 to count the number of accounts held by each customer. Show the cu ID and number of accounts for each cu.
mysql> select cust_id, count(*) c from account group by cust_id;
+---------+---+
| cust_id | c |
+---------+---+
| 1 | 3 |
| 2 | 2 |
| 3 | 2 |
| 4 | 3 |
| 5 | 1 |
| 6 | 2 |
| 7 | 1 |
| 8 | 2 |
| 9 | 3 |
| 10 | 2 |
| 11 | 1 |
| 12 | 1 |
| 13 | 1 |
+---------+---+
13 rows in set (0.00 sec)
Modify the query in 8-2 to include only those customers who have at least two accounts.
mysql> select cust_id, count(*) c from account group by cust_id having c >=2 ;
+---------+---+
| cust_id | c |
+---------+---+
| 1 | 3 |
| 2 | 2 |
| 3 | 2 |
| 4 | 3 |
| 6 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 2 |
+---------+---+
8 rows in set (0.01 sec)
Find the total available balance by product and branch where there is more than one account per product and branch. Order the result by total balance (highest to lowest).
mysql> select sum(avail_balance) s, product_cd p, open_branch_id b from account group by product_cd,
open_branch_id having count(cust_id) > 1 order by s desc;
+----------+-----+------+
| s | p | b |
+----------+-----+------+
| 67852.33 | CHK | 4 |
| 14832.64 | MM | 1 |
| 11500.00 | CD | 1 |
| 8000.00 | CD | 2 |
| 3315.77 | CHK | 2 |
| 782.16 | CHK | 1 |
| 700.00 | SAV | 2 |
+----------+-----+------+
7 rows in set (0.00 sec)