Wednesday, August 24, 2005

What does GROUP BY CUBE do?

This is one of those "new" features that has actually been around for a long time, but I had never used it. Then recently I was asked to tune a long-running report, and it occured to me that it might just be what I needed.

The report consisted of 4 summary queries, each of which presented the same data but summarised by a different "dimension". Imagine it was based on the EMP table, then the report would show:

Employee Summary by Job
-----------------------

Job # Emps Total Sal
---------------------------------------- ---------- ----------
CLERK 4 4150
ANALYST 2 6000
MANAGER 3 8275
SALESMAN 4 5600
PRESIDENT 1 5000
---------- ----------
14 29025


Employee Summary by Manager
---------------------------

Mgr # Emps Total Sal
---------------------------------------- ---------- ----------
7566 2 6000
7698 5 6550
7782 1 1300
7788 1 1100
7839 3 8275
7902 1 800
---------- ----------
14 29025

Employee Summary by Department
------------------------------

Dept # Emps Total Sal
---------------------------------------- ---------- ----------
10 3 8750
20 5 10875
30 6 9400
---------- ----------
14 29025


That is only 3 queries, but you get the idea.

Each query took 10 minutes to run (reasonable given the complexity of the query and the amount of data), so the whole report took 40 minutes.

My goal was to reduce the time to more like 10 minutes - perhaps by pre-computing the results grouped by all 4 dimensions into a temporary table and then reporting on that. However, it occured to me that GROUP BY CUBE might be what I needed, so I tried it out. Here is a simple query:

SQL> select job, mgr, deptno, count(*) numemps, sum(sal) tots
2 from emp
3 group by cube(job, mgr, deptno);

JOB MGR DEPTNO NUMEMPS TOTSAL
--------- ---------- ---------- ---------- ----------
1 5000
14 29025
10 1 5000
10 3 8750
20 5 10875
30 6 9400
7566 2 6000
7566 20 2 6000
7698 5 6550
7698 30 5 6550
7782 1 1300
7782 10 1 1300
7788 1 1100
7788 20 1 1100
7839 3 8275
7839 10 1 2450
7839 20 1 2975
7839 30 1 2850
7902 1 800
7902 20 1 800
CLERK 4 4150
CLERK 10 1 1300
CLERK 20 2 1900
CLERK 30 1 950
CLERK 7698 1 950
CLERK 7698 30 1 950
CLERK 7782 1 1300
CLERK 7782 10 1 1300
CLERK 7788 1 1100
CLERK 7788 20 1 1100
CLERK 7902 1 800
CLERK 7902 20 1 800
ANALYST 2 6000
ANALYST 20 2 6000
ANALYST 7566 2 6000
ANALYST 7566 20 2 6000
MANAGER 3 8275
MANAGER 10 1 2450
MANAGER 20 1 2975
MANAGER 30 1 2850
MANAGER 7839 3 8275
MANAGER 7839 10 1 2450
MANAGER 7839 20 1 2975
MANAGER 7839 30 1 2850
SALESMAN 4 5600
SALESMAN 30 4 5600
SALESMAN 7698 4 5600
SALESMAN 7698 30 4 5600
PRESIDENT 1 5000
PRESIDENT 10 1 5000
PRESIDENT 10 1 5000

52 rows selected.


Buried in amongst all those rows is all the data I need. There is a GROUPING function that can be used to see which group a row belongs to:

SQL> select job, mgr, deptno, count(*) numemps, sum(sal) totsal
2 , grouping(job) gj, grouping(mgr) gm, grouping(deptno) gd
3 from emp
4 group by cube(job, mgr, deptno);

JOB MGR DEPTNO NUMEMPS TOTSAL GJ GM GD
--------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1 5000 1 0 1
14 29025 1 1 1
10 1 5000 1 0 0
10 3 8750 1 1 0
20 5 10875 1 1 0
30 6 9400 1 1 0
7566 2 6000 1 0 1
7566 20 2 6000 1 0 0
7698 5 6550 1 0 1
7698 30 5 6550 1 0 0
7782 1 1300 1 0 1
7782 10 1 1300 1 0 0
7788 1 1100 1 0 1
7788 20 1 1100 1 0 0
7839 3 8275 1 0 1
7839 10 1 2450 1 0 0
7839 20 1 2975 1 0 0
7839 30 1 2850 1 0 0
7902 1 800 1 0 1
7902 20 1 800 1 0 0
CLERK 4 4150 0 1 1
CLERK 10 1 1300 0 1 0
CLERK 20 2 1900 0 1 0
CLERK 30 1 950 0 1 0
CLERK 7698 1 950 0 0 1
CLERK 7698 30 1 950 0 0 0
CLERK 7782 1 1300 0 0 1
CLERK 7782 10 1 1300 0 0 0
CLERK 7788 1 1100 0 0 1
CLERK 7788 20 1 1100 0 0 0
CLERK 7902 1 800 0 0 1
CLERK 7902 20 1 800 0 0 0
ANALYST 2 6000 0 1 1
ANALYST 20 2 6000 0 1 0
ANALYST 7566 2 6000 0 0 1
ANALYST 7566 20 2 6000 0 0 0
MANAGER 3 8275 0 1 1
MANAGER 10 1 2450 0 1 0
MANAGER 20 1 2975 0 1 0
MANAGER 30 1 2850 0 1 0
MANAGER 7839 3 8275 0 0 1
MANAGER 7839 10 1 2450 0 0 0
MANAGER 7839 20 1 2975 0 0 0
MANAGER 7839 30 1 2850 0 0 0
SALESMAN 4 5600 0 1 1
SALESMAN 30 4 5600 0 1 0
SALESMAN 7698 4 5600 0 0 1
SALESMAN 7698 30 4 5600 0 0 0
PRESIDENT 1 5000 0 0 1
PRESIDENT 1 5000 0 1 1
PRESIDENT 10 1 5000 0 0 0
PRESIDENT 10 1 5000 0 1 0


Based on that, I can filter out the rows I'm not interested in and format the results like this:

SQL> select case when gj=0 and gm=1 and gd=1 then 'Job'
2 when gj=1 and gm=0 and gd=1 then 'Manager'
3 when gj=1 and gm=1 and gd=0 then 'Dept'
4 end as keytype
5 , case when gj=0 and gm=1 and gd=1 then job
6 when gj=1 and gm=0 and gd=1 then to_char(mgr)
7 when gj=1 and gm=1 and gd=0 then to_char(deptno)
8 end as keyval
9 , numemps, totsal
10 from
11 (
12 select job, mgr, deptno, count(*) numemps, sum(sal) totsal
13 , grouping(job) gj, grouping(mgr) gm, grouping(deptno) gd
14 from emp
15 group by cube(job, mgr, deptno)
16 )
17 where (gj=0 and gm=1 and gd=1)
18 or (gj=1 and gm=0 and gd=1)
19 or (gj=1 and gm=1 and gd=0);

KEYTYPE KEYVAL NUMEMPS TOTSAL
------- ---------------------------------------- ---------- ----------
Manager 7566 2 6000
7698 5 6550
7782 1 1300
7788 1 1100
7839 3 8275
7902 1 800
1 5000
******* ---------- ----------
sum 14 29025
Job ANALYST 2 6000
CLERK 4 4150
MANAGER 3 8275
PRESIDENT 1 5000
SALESMAN 4 5600
******* ---------- ----------
sum 14 29025
Dept 10 3 8750
20 5 10875
30 6 9400
******* ---------- ----------
sum 14 29025


... which gives me the breakdown I want in a single query, and provides the performance boost I was hoping for!

Wednesday, April 20, 2005

Business Logic APIs

Database application development generally has two major components:

1) The user interface
2) The database

These have to talk to each other somehow, typically by an "Application Programmatic Interface" or API. What should such an API look like logically?

Let's take as an example a screen that allows the user to transfer funds from one account to another. It looks something like this:

Source Account: ..........
Target Account: ..........
Amount: ..........

[OK] [Cancel]

The database has the following tables:

CREATE TABLE accounts
( account_id INTEGER PRIMARY KEY
, account_name VARCHAR2(30)
, current_balance NUMBER CHECK (current_balance >= 0)
);

CREATE TABLE transactions
( transaction_id INTEGER PRIMARY KEY
, transaction_date DATE NOT NULL
, source_account_id INTEGER NOT NULL REFERENCES accounts
, target_account_id INTEGER NOT NULL REFERENCES accounts
, amount NUMBER CHECK (amount >= 0)
);

What needs to be done:

Once user has entered the 2 account numbers and the amount then we must:

  1. Update the source account and decrease its balance by the amount (subject to the check constraint that says the balance cannot become negative)

  2. Update the target account and increase its balance by the amount

  3. Create a transactions record for the transfer

How should we build the code?

The wrong approach

This is the approach that I so often see used:

1) Create a "table API" for table ACCOUNTS with a procedure:
PROCEDURE update_account_balance
( p_account_id IN INTEGER
, p_amount IN NUMBER
);

2) Create a “table API” for table TRANSACTIONS with a procedure:
PROCEDURE create_transaction
( p_source_account_id IN INTEGER
, p_target_account_id IN INTEGER
, p_amount IN NUMBER
);

3) In the application, invoke the following code when OK is pressed:
  • accounts_api.update_account_balance (:source_account_id, :amount * -1);
  • accounts_api.update_account_balance (:target_account_id, :amount);
  • transactions_api.create_transaction (:source_account_id, :target_account_id, :amount);
It works, but it is poor design. The client application is made responsible for carrying out the transaction properly: for example, if there are insufficient funds in the source account then the transfer must be aborted.

What is wrong with that?

  • Suppose there are other variants of the transfer process, implemented in different client screens. The business logic must be duplicated between the various screens. This leads to increased maintenance, increased testing, increased chance of something going wrong.

  • Suppose this application is a Windows client/server app, but now we want to add an alternative web-based interface. Yet again, we will be duplicating the business logic, with the attendant costs and risks.

The right approach

What is required is a Business Logic API, in which each complete transaction is represented by a single procedure like this:
PROCEDURE transfer_funds
( p_source_account_id IN INTEGER
, p_target_account_id IN INTEGER
, p_amount IN NUMBER
);

The body of that procedure performs all 3 steps of the process, and handles any errors. It either succeeds in all 3 steps, or leaves the database unchanged.

In the application, the code behind the OK button is:

accounts_api.Transfer_funds (:source_account_id, :target_account_id, :amount);

It is not possible for the application to corrupt the database, because it doesn’t contain any business logic – it is just an interface to the business logic that is in the API.

This has many advantages:
  • The business logic resides in exactly one place, the API – so only needs to be maintained once, tested once

  • The application is simpler – less maintenance, less testing

  • The application is lower risk

  • Adding an alternative interface is easier and low-risk, because we will only be creating a user interface, not new (duplicated) business logic

Conclusion

Have I built a straw man here? Isn’t the above obvious? You would think so, but as I already said I actually see code developed the wrong way all the time. I’m not making it up! In my experience it is usually because the developer (who may be fairly inexperienced in software design principles) is left to his or her own devices to build the application and the API at the same time, without any direction from a senior designer. Since he is building all the code, and only building one client module, what does he care whether the logic is in the client application or in the API?

It is only later, when changes are made to the rules, or a new interface is wanted, that the whole sorry mess becomes apparent.