Wednesday, August 24, 2005

What does GROUP BY CUBE do?

This is one of those "new" features that has actually been around for a long time, but I had never used it. Then recently I was asked to tune a long-running report, and it occured to me that it might just be what I needed.

The report consisted of 4 summary queries, each of which presented the same data but summarised by a different "dimension". Imagine it was based on the EMP table, then the report would show:

Employee Summary by Job
-----------------------

Job # Emps Total Sal
---------------------------------------- ---------- ----------
CLERK 4 4150
ANALYST 2 6000
MANAGER 3 8275
SALESMAN 4 5600
PRESIDENT 1 5000
---------- ----------
14 29025


Employee Summary by Manager
---------------------------

Mgr # Emps Total Sal
---------------------------------------- ---------- ----------
7566 2 6000
7698 5 6550
7782 1 1300
7788 1 1100
7839 3 8275
7902 1 800
---------- ----------
14 29025

Employee Summary by Department
------------------------------

Dept # Emps Total Sal
---------------------------------------- ---------- ----------
10 3 8750
20 5 10875
30 6 9400
---------- ----------
14 29025


That is only 3 queries, but you get the idea.

Each query took 10 minutes to run (reasonable given the complexity of the query and the amount of data), so the whole report took 40 minutes.

My goal was to reduce the time to more like 10 minutes - perhaps by pre-computing the results grouped by all 4 dimensions into a temporary table and then reporting on that. However, it occured to me that GROUP BY CUBE might be what I needed, so I tried it out. Here is a simple query:

SQL> select job, mgr, deptno, count(*) numemps, sum(sal) tots
2 from emp
3 group by cube(job, mgr, deptno);

JOB MGR DEPTNO NUMEMPS TOTSAL
--------- ---------- ---------- ---------- ----------
1 5000
14 29025
10 1 5000
10 3 8750
20 5 10875
30 6 9400
7566 2 6000
7566 20 2 6000
7698 5 6550
7698 30 5 6550
7782 1 1300
7782 10 1 1300
7788 1 1100
7788 20 1 1100
7839 3 8275
7839 10 1 2450
7839 20 1 2975
7839 30 1 2850
7902 1 800
7902 20 1 800
CLERK 4 4150
CLERK 10 1 1300
CLERK 20 2 1900
CLERK 30 1 950
CLERK 7698 1 950
CLERK 7698 30 1 950
CLERK 7782 1 1300
CLERK 7782 10 1 1300
CLERK 7788 1 1100
CLERK 7788 20 1 1100
CLERK 7902 1 800
CLERK 7902 20 1 800
ANALYST 2 6000
ANALYST 20 2 6000
ANALYST 7566 2 6000
ANALYST 7566 20 2 6000
MANAGER 3 8275
MANAGER 10 1 2450
MANAGER 20 1 2975
MANAGER 30 1 2850
MANAGER 7839 3 8275
MANAGER 7839 10 1 2450
MANAGER 7839 20 1 2975
MANAGER 7839 30 1 2850
SALESMAN 4 5600
SALESMAN 30 4 5600
SALESMAN 7698 4 5600
SALESMAN 7698 30 4 5600
PRESIDENT 1 5000
PRESIDENT 10 1 5000
PRESIDENT 10 1 5000

52 rows selected.


Buried in amongst all those rows is all the data I need. There is a GROUPING function that can be used to see which group a row belongs to:

SQL> select job, mgr, deptno, count(*) numemps, sum(sal) totsal
2 , grouping(job) gj, grouping(mgr) gm, grouping(deptno) gd
3 from emp
4 group by cube(job, mgr, deptno);

JOB MGR DEPTNO NUMEMPS TOTSAL GJ GM GD
--------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1 5000 1 0 1
14 29025 1 1 1
10 1 5000 1 0 0
10 3 8750 1 1 0
20 5 10875 1 1 0
30 6 9400 1 1 0
7566 2 6000 1 0 1
7566 20 2 6000 1 0 0
7698 5 6550 1 0 1
7698 30 5 6550 1 0 0
7782 1 1300 1 0 1
7782 10 1 1300 1 0 0
7788 1 1100 1 0 1
7788 20 1 1100 1 0 0
7839 3 8275 1 0 1
7839 10 1 2450 1 0 0
7839 20 1 2975 1 0 0
7839 30 1 2850 1 0 0
7902 1 800 1 0 1
7902 20 1 800 1 0 0
CLERK 4 4150 0 1 1
CLERK 10 1 1300 0 1 0
CLERK 20 2 1900 0 1 0
CLERK 30 1 950 0 1 0
CLERK 7698 1 950 0 0 1
CLERK 7698 30 1 950 0 0 0
CLERK 7782 1 1300 0 0 1
CLERK 7782 10 1 1300 0 0 0
CLERK 7788 1 1100 0 0 1
CLERK 7788 20 1 1100 0 0 0
CLERK 7902 1 800 0 0 1
CLERK 7902 20 1 800 0 0 0
ANALYST 2 6000 0 1 1
ANALYST 20 2 6000 0 1 0
ANALYST 7566 2 6000 0 0 1
ANALYST 7566 20 2 6000 0 0 0
MANAGER 3 8275 0 1 1
MANAGER 10 1 2450 0 1 0
MANAGER 20 1 2975 0 1 0
MANAGER 30 1 2850 0 1 0
MANAGER 7839 3 8275 0 0 1
MANAGER 7839 10 1 2450 0 0 0
MANAGER 7839 20 1 2975 0 0 0
MANAGER 7839 30 1 2850 0 0 0
SALESMAN 4 5600 0 1 1
SALESMAN 30 4 5600 0 1 0
SALESMAN 7698 4 5600 0 0 1
SALESMAN 7698 30 4 5600 0 0 0
PRESIDENT 1 5000 0 0 1
PRESIDENT 1 5000 0 1 1
PRESIDENT 10 1 5000 0 0 0
PRESIDENT 10 1 5000 0 1 0


Based on that, I can filter out the rows I'm not interested in and format the results like this:

SQL> select case when gj=0 and gm=1 and gd=1 then 'Job'
2 when gj=1 and gm=0 and gd=1 then 'Manager'
3 when gj=1 and gm=1 and gd=0 then 'Dept'
4 end as keytype
5 , case when gj=0 and gm=1 and gd=1 then job
6 when gj=1 and gm=0 and gd=1 then to_char(mgr)
7 when gj=1 and gm=1 and gd=0 then to_char(deptno)
8 end as keyval
9 , numemps, totsal
10 from
11 (
12 select job, mgr, deptno, count(*) numemps, sum(sal) totsal
13 , grouping(job) gj, grouping(mgr) gm, grouping(deptno) gd
14 from emp
15 group by cube(job, mgr, deptno)
16 )
17 where (gj=0 and gm=1 and gd=1)
18 or (gj=1 and gm=0 and gd=1)
19 or (gj=1 and gm=1 and gd=0);

KEYTYPE KEYVAL NUMEMPS TOTSAL
------- ---------------------------------------- ---------- ----------
Manager 7566 2 6000
7698 5 6550
7782 1 1300
7788 1 1100
7839 3 8275
7902 1 800
1 5000
******* ---------- ----------
sum 14 29025
Job ANALYST 2 6000
CLERK 4 4150
MANAGER 3 8275
PRESIDENT 1 5000
SALESMAN 4 5600
******* ---------- ----------
sum 14 29025
Dept 10 3 8750
20 5 10875
30 6 9400
******* ---------- ----------
sum 14 29025


... which gives me the breakdown I want in a single query, and provides the performance boost I was hoping for!

4 comments:

rudy said...

nice!

the combined sql is ugly, but hey, the performance improvement is more important, isn't it

but what happens if the application adds another job type or dept? have you created a maintenance nightmare?

Tony Andrews said...

> but what happens if the application adds another job type or dept? have you created a maintenance nightmare?

No, nothing in the SQL depends on the particular jobs or depts in the data. The only need for maintenance would be if a completely new "dimension" was required, such as "grade" or "project".

Tony Andrews said...

Actually, that final SQL can be simplified a little to:

SQL> select case when gj=0 then 'Job'
2 when gm=0 then 'Manager'
3 when gd=0 then 'Dept'
4 end as keytype
5 , case when gj=0 then job
6 when gm=0 then to_char(mgr)
7 when gd=0 then to_char(deptno)
8 end as keyval
9 , numemps, totsal
10 from
11 (
12 select job, mgr, deptno, count(*) numemps, sum(sal) totsal
13 , grouping(job) gj, grouping(mgr) gm, grouping(deptno) gd
14 from emp
15 group by cube(job, mgr, deptno)
16 )
17 where gj+gm+gd = 2

Rob van Wijk said...

Instead of aggregating and later on filtering out some of the aggregates, you can use grouping sets to exactly specify which groups you want. Even more readable, in my opinion.

SQL> select case
2 when grouping(job)=0 then 'Job'
3 when grouping(mgr)=0 then 'Manager'
4 when grouping(deptno)=0 then 'Dept'
5 end as keytype
6 , case
7 when grouping(job)=0 then job
8 when grouping(mgr)=0 then to_char(mgr)
9 when grouping(deptno)=0 then to_char(deptno)
10 end as keyval
11 , count(*) numemps
12 , sum(sal) totsal
13 from emp
14 group by grouping sets (job,mgr,deptno)