Thursday, October 21, 2004

OTLT and EAV: the two big design mistakes all beginners make

Many people (myself included) start of as programmers, and only later start to work with databases. After a while, the developer notices two things about databases:

  1. Tables have a fixed set of columns; adding a column to a table involves changing any application code that accesses it.
  2. There are lots of “lookup” tables that typically have a code and a description.

Being a natural problem solver, and knowing that code re-use is a “good thing”, the developer thinks he can do better. Quite often, there is no one more experienced around to advise against, and so the developer implements his ideas; this is so common that both ideas have a name: the One True Lookup Table (OTLT) and the Entity-Attribute-Value (EAV) model.

One True Lookup Table (OTLT)

The idea: instead of having lots of “lookup” tables like these:

create table order_status (status_code varchar2(10), status_desc varchar2(40) );

create table country (country_code varchar2(3), country_name varchar2(30) );

create table priority (priority_no number(1), priority_desc varchar2(40) );

… why not just have ONE lookup table like this?:

create table lookup (lookup_type varchar2(10), lookup_code varchar2(20), lookup_desc varchar2(100) );

Great! Now we only need one “maintain lookup” screen instead of 3 (or 33 or whatever).

The trouble is, the developer doesn’t consider the disadvantages of this. Most importantly, we can no longer use foreign key constraints to protect the integrity of the data – unless one of the following conditions is true:

  • lookup_code is unique within table lookup
  • every child table uses 2 columns referencing (lookup_type,lookup_code) for its foreign keys

In most cases, neither of the above applies, and responsibility for data integrity resides solely in the application code. Show me such a database, and I’ll show you some data where the code does not match anything in the lookup table.

Entity-Attribute-Value (EAV)

The idea: instead of “hard-coding” columns into the table like this:

create table emp (empno integer primary key, ename varchar2(20), sal number, job varchar2(10));

insert into emp (empno, ename, sal, job) values (1234,’ANDREWS’,1000,’CLERK’);

… why not have total flexibility like this:

create table emp (empno integer primary key );

create table emp_value (empno references emp, code varchar2(20), value varchar2(100));

insert into emp (empno) values (1234);

insert into emp_values (‘NAME’,’ANDREWS’);

insert into emp_values (‘SAL’,’1000’);

insert into emp_values (‘JOB’,’CLERK’);

Great! Now we are free to invent new “attributes” at any time, without having to alter the table or the application!

However, consider a simple query: “show the names of all employees who are clerks and earn less than 2000”.

With the standard emp table:

select ename from emp where job=’CLERK’ and sal < 2000;

With the EAV tables:

select ev1.name

from emp_values ev1, emp_values ev2 emp_values ev3

where ev1.code = ‘NAME’

and ev1.empno = ev2.empno

and ev2.code = ‘JOB’

and ev2.value = ‘CLERK’

and ev1.empno = ev3.empno

and ev3.code = ‘SAL’

and TO_NUMBER(ev3.value) < 2000;

Not only is that much harder to follow, it is likely to be much slower to process too. And this is about the most simple of queries!

Conclusion

OTLT and EAV are “generic” approaches that are seductive to programmers, but are actually a bad idea for most databases. Resist the temptation!


Friday, October 15, 2004

"Pivot" Queries

One of the frequently asked questions on SQL forums is how to present data “horizontally” rather than “vertically”, known as a “pivot query”. For example, instead of this:

DEPTNO      JOB         HEADCOUNT
----------  ---------  ----------
10          CLERK               1
10          MANAGER             1
20          ANALYST             2
20          CLERK               2
20          MANAGER             1
30          CLERK               1
30          MANAGER             1
30          SALESMAN            4

… people would like to know how to produce this:

DEPTNO         ANALYST       CLERK     MANAGER    SALESMAN
----------  ----------  ----------  ----------  ----------
10                               1           1
20                   2           2           1
30                               1           1           4


In Oracle, the normal way is to write a query using DECODE like this:

select deptno
, count(DECODE(job,'ANALYST', 1)) as "ANALYST"
, count(DECODE(job,'CLERK', 1)) as "CLERK"
, count(DECODE(job,'MANAGER', 1)) as "MANAGER"
, count(DECODE(job,'SALESMAN', 1)) as "SALESMAN"
from emp
group by deptno
order by deptno;

The DECODE ensures that the jobs go in the right columns.
Now this type of query is quite easy to write (when you know how), but is also quite laborious; also, if the list of jobs changes then the query needs to be changed too.
To make producing such queries a “piece of cake” I have built a package called pivot that can be used to generate the SQL like this:
begin
pivot.print_pivot_query
( 'deptno'
, 'job'
, 'emp'
, '1'
, agg_types => 'count'
);
end;
/
select deptno
, count(DECODE(job,'ANALYST', 1)) as "ANALYST"
, count(DECODE(job,'CLERK', 1)) as "CLERK"
, count(DECODE(job,'MANAGER', 1)) as "MANAGER"
, count(DECODE(job,'SALESMAN', 1)) as "SALESMAN"
from emp
group by deptno
order by deptno
PL/SQL procedure successfully completed.


The package code follows. Note that it requires the “parse” package that I posted earlier.

CREATE OR REPLACE PACKAGE pivot IS
  TYPE ref_cursor IS REF CURSOR;
  PROCEDURE print_pivot_query
  ( group_cols   IN VARCHAR2              -- Comma-separated list of column(s), including table alias if applicable,
                                          -- to be used to group the records
  , pivot_col    IN VARCHAR2              -- The name of the column to be pivoted, including table alias if applicable
  , tables       IN VARCHAR2              -- Comma-separated list of table(s), with table alias if applicable
  , value_cols   IN VARCHAR2              -- Comma-separated list of column(s), including table alias if applicable
                                          -- for which aggregates are to be shown at each intersection
  , pivot_values IN VARCHAR2 DEFAULT NULL -- Comma-separated list of values of pivot column to be used;
                                          -- if omitted, all values are used (determined dynamically)
  , agg_types    IN VARCHAR2 DEFAULT NULL -- Comma-separated list of aggregate types, corresponding to value_cols;
                                          -- if omitted, SUM is the default
  , where_clause IN VARCHAR2 DEFAULT NULL -- where clause of query
  , order_by     IN VARCHAR2 DEFAULT NULL -- order by clause; if omitted, output is ordered by group_cols
  );
  FUNCTION pivot_query
  ( group_cols IN VARCHAR2
  , pivot_col IN VARCHAR2
  , tables IN VARCHAR2
  , value_cols IN VARCHAR2
  , pivot_values IN VARCHAR2 DEFAULT NULL
  , agg_types IN VARCHAR2 DEFAULT NULL
  , where_clause IN VARCHAR2 DEFAULT NULL
  , order_by IN VARCHAR2 DEFAULT NULL
  ) RETURN VARCHAR2;
  FUNCTION pivot_cursor
  ( group_cols IN VARCHAR2
  , pivot_col IN VARCHAR2
  , tables IN VARCHAR2
  , value_cols IN VARCHAR2
  , pivot_values IN VARCHAR2 DEFAULT NULL
  , agg_types IN VARCHAR2 DEFAULT NULL
  , where_clause IN VARCHAR2 DEFAULT NULL
  , order_by IN VARCHAR2 DEFAULT NULL
  ) RETURN ref_cursor;
END;
/
CREATE OR REPLACE PACKAGE BODY pivot IS
  g_mode varchar2(10);
  g_sql varchar2(32767);
  PROCEDURE pr
  ( p_text in varchar2
  )
  IS
    v_text VARCHAR2(32767) := p_text;
  BEGIN
    if g_mode = 'PRINT' then
      WHILE LENGTH(v_text) > 255 LOOP
        DBMS_OUTPUT.PUT_LINE( SUBSTR(v_text,1,255) );
        v_text := SUBSTR(v_text,256);
      END LOOP;
      DBMS_OUTPUT.PUT_LINE( v_text );
    else
      g_sql := g_sql || ' ' || p_text;
    end if;
  END pr;
  /*
  || Generates the SQL statement for a pivot query based on parameters
  || Example:
  || create_pivot_query
  || ( group_cols => 'd.dname'
  || , pivot_col => 'e.job'
  || , tables => 'emp e, dept d'
  || , value_cols => 'e.sal,e.age'
  || , agg_types => 'min,max'
  || , where_clause => 'e.deptno = d.deptno'
  || );
  || Generates a query like:
  || select d.dname
  || , min(DECODE(e.job,'ANALYST', e.sal, 0 )) as "min_ANALYST_esal"
  || , max(DECODE(e.job,'ANALYST', e.age, 0 )) as "max_ANALYST_eage"
  || , min(DECODE(e.job,'CLERK', e.sal, 0 )) as "min_CLERK_esal"
  || , max(DECODE(e.job,'CLERK', e.age, 0 )) as "max_CLERK_eage"
  || , min(DECODE(e.job,'MANAGER', e.sal, 0 )) as "min_MANAGER_esal"
  || , max(DECODE(e.job,'MANAGER', e.age, 0 )) as "max_MANAGER_eage"
  || , min(DECODE(e.job,'PRESIDENT', e.sal, 0 )) as "min_PRESIDENT_esal"
  || , max(DECODE(e.job,'PRESIDENT', e.age, 0 )) as "max_PRESIDENT_eage"
  || , min(DECODE(e.job,'SALESMAN', e.sal, 0 )) as "min_SALESMAN_esal"
  || , max(DECODE(e.job,'SALESMAN', e.age, 0 )) as "max_SALESMAN_eage"
  || from emp e, dept d
  || where e.deptno = d.deptno
  || group by d.dname
  || order by d.dname
  ||
  || i.e. the parameters are used like this:
  || select
  || , (DECODE(,, , 0 ))
  || from
  || where
  || group by
  || order by
  ||
  */
  PROCEDURE define_pivot_query
  ( group_cols IN VARCHAR2
  , pivot_col IN VARCHAR2
  , tables IN VARCHAR2
  , value_cols IN VARCHAR2
  , pivot_values IN VARCHAR2 DEFAULT NULL
  , agg_types IN VARCHAR2 DEFAULT NULL
  , where_clause IN VARCHAR2 DEFAULT NULL
  , order_by IN VARCHAR2 DEFAULT NULL
  )
  IS
    type ref_cursor is ref cursor;
    rc ref_cursor;
    pv_tab parse.varchar2_table;
    val_tab parse.varchar2_table;
    agg_tab parse.varchar2_table;
    num_pvs integer := 0;
    num_vals integer := 0;
    num_aggs integer := 0;
    alias varchar2(100);
  BEGIN
    g_sql := NULL;
    -- Determine pivot values: use list if given, otherwise construct query to find them
    if pivot_values is not null then
      parse.delimstring_to_table( pivot_values, pv_tab, num_pvs );
    else
      open rc for 'select distinct ' || pivot_col || ' from ' || tables || ' where ' || nvl(where_clause,'1=1');
      loop
        num_pvs := num_pvs+1;
        fetch rc into pv_tab(num_pvs);
        exit when rc%notfound;
      end loop;
      close rc;
      num_pvs := num_pvs-1;
    end if;
    parse.delimstring_to_table( value_cols, val_tab, num_vals );
    -- Determine aggregate functions (default is SUM)
    if agg_types is not null then
      parse.delimstring_to_table( agg_types, agg_tab, num_aggs );
    end if;
    if num_aggs <> num_vals then
      for i in num_aggs+1..num_vals loop
        agg_tab(i) := 'sum';
      end loop;
    end if;
    pr('select '||group_cols);
    for pv in 1..num_pvs loop
      pv_tab(pv) := trim(pv_tab(pv));
      for val in 1..num_vals loop
        val_tab(val) := trim(val_tab(val));
        if num_vals = 1 then
          alias := substr(pv_tab(pv),1,30);
        else
          alias := substr(agg_tab(val) || '_' || TRANSLATE(pv_tab(pv),'x. -()<>','x') || '_' || TRANSLATE(val_tab(val),'x. -()<>','x'),1,30);
        end if;
        pr( ', '||agg_tab(val)||'(DECODE(' || pivot_col || ',''' || pv_tab(pv) || ''', ' || val_tab(val) || '))'
            || ' as "' || alias || '"' );
      end loop;
    end loop;
    pr('from ' || tables );
    if where_clause is not null then
      pr('where ' || where_clause);
    end if;
    pr('group by ' || group_cols);
    pr('order by ' || NVL(order_by,group_cols));
  END define_pivot_query;
  PROCEDURE print_pivot_query
  ( group_cols IN VARCHAR2
  , pivot_col IN VARCHAR2
  , tables IN VARCHAR2
  , value_cols IN VARCHAR2
  , pivot_values IN VARCHAR2 DEFAULT NULL
  , agg_types IN VARCHAR2 DEFAULT NULL
  , where_clause IN VARCHAR2 DEFAULT NULL
  , order_by IN VARCHAR2 DEFAULT NULL
  )
  IS
  BEGIN
    g_mode := 'PRINT';
    define_pivot_query
    ( group_cols
    , pivot_col
    , tables
    , value_cols
    , pivot_values
    , agg_types
    , where_clause
    , order_by
    );
  END;
  FUNCTION pivot_query
  ( group_cols IN VARCHAR2
  , pivot_col IN VARCHAR2
  , tables IN VARCHAR2
  , value_cols IN VARCHAR2
  , pivot_values IN VARCHAR2 DEFAULT NULL
  , agg_types IN VARCHAR2 DEFAULT NULL
  , where_clause IN VARCHAR2 DEFAULT NULL
  , order_by IN VARCHAR2 DEFAULT NULL
  ) RETURN VARCHAR2
  IS
  BEGIN
    g_mode := 'TEXT';
    define_pivot_query
    ( group_cols
    , pivot_col
    , tables
    , value_cols
    , pivot_values
    , agg_types
    , where_clause
    , order_by
    );
    RETURN g_sql;
  END;
  FUNCTION pivot_cursor
  ( group_cols IN VARCHAR2
  , pivot_col IN VARCHAR2
  , tables IN VARCHAR2
  , value_cols IN VARCHAR2
  , pivot_values IN VARCHAR2 DEFAULT NULL
  , agg_types IN VARCHAR2 DEFAULT NULL
  , where_clause IN VARCHAR2 DEFAULT NULL
  , order_by IN VARCHAR2 DEFAULT NULL
  ) RETURN ref_cursor
  IS
    rc ref_cursor;
  BEGIN
    g_mode := 'CURSOR';
    define_pivot_query
    ( group_cols
    , pivot_col
    , tables
    , value_cols
    , pivot_values
    , agg_types
    , where_clause
    , order_by
    );
    OPEN rc FOR g_sql;
    RETURN rc;
  END;
END;

/

Parsing delimited fields in a character string

Often we have a need to parse a character string to get data from fields within it. Of course, SQL Loader handles this nicely, but sometimes we may be getting the data via a different route such as from a table or via UTL_FILE.

The following package facilitates this:

CREATE OR REPLACE PACKAGE parse AS
  /*
  || Package of utility procedures for parsing delimited or fixed position strings into tables
  || of individual values, and vice versa.
  */
  TYPE varchar2_table IS TABLE OF VARCHAR2(32767) INDEX BY BINARY_INTEGER;
  PROCEDURE delimstring_to_table
    ( p_delimstring IN VARCHAR2
    , p_table OUT varchar2_table
    , p_nfields OUT INTEGER
    , p_delim IN VARCHAR2 DEFAULT ','
    );
  PROCEDURE table_to_delimstring
    ( p_table IN varchar2_table
    , p_delimstring OUT VARCHAR2
    , p_delim IN VARCHAR2 DEFAULT ','
    );
END parse;
/
CREATE OR REPLACE PACKAGE BODY parse AS
  PROCEDURE delimstring_to_table
    ( p_delimstring IN VARCHAR2
    , p_table OUT varchar2_table
    , p_nfields OUT INTEGER
    , p_delim IN VARCHAR2 DEFAULT ','
    )
  IS
    v_string VARCHAR2(32767) := p_delimstring;
    v_nfields PLS_INTEGER := 1;
    v_table varchar2_table;
    v_delimpos PLS_INTEGER := INSTR(p_delimstring, p_delim);
    v_delimlen PLS_INTEGER := LENGTH(p_delim);
  BEGIN
    WHILE v_delimpos > 0
    LOOP
      v_table(v_nfields) := SUBSTR(v_string,1,v_delimpos-1);
      v_string := SUBSTR(v_string,v_delimpos+v_delimlen);
      v_nfields := v_nfields+1;
      v_delimpos := INSTR(v_string, p_delim);
    END LOOP;
    v_table(v_nfields) := v_string;
    p_table := v_table;
    p_nfields := v_nfields;
  END delimstring_to_table;
  PROCEDURE table_to_delimstring
    ( p_table IN varchar2_table
    , p_delimstring OUT VARCHAR2
    , p_delim IN VARCHAR2 DEFAULT ','
    )
  IS
    v_nfields PLS_INTEGER := p_table.COUNT;
    v_string VARCHAR2(32767);
  BEGIN
    FOR i IN 1..v_nfields
    LOOP
      v_string := v_string || p_table(i);
      IF i != v_nfields THEN
        v_string := v_string || p_delim;
      END IF;
    END LOOP;
    p_delimstring := v_string;
  END table_to_delimstring;
END parse;
/

This is how you might use it with a standard comma-delimited string:

SQL> declare
  2    v_tab parse.varchar2_table;
  3    v_nfields integer;
  4    v_string varchar2(1000) := '1000,Smith,John,13-May-1970';
  5  begin
  6    parse.delimstring_to_table (v_string, v_tab, v_nfields);
  7    for i in 1..v_nfields loop
  8      dbms_output.put_line('Field('||i||') = '||v_tab(i));
  9    end loop;
 10  end;
11 /
Field(1) = 1000
Field(2) = Smith
Field(3) = John
Field(4) = 13-May-1970

PL/SQL procedure successfully completed.

Enforcing complex constraints in Oracle

Oracle supports various kinds of declarative integrity constraints:
  • Primary Key: Uniquely identifies a row in the table
  • Unique: Other columns that must be unique
  • Foreign Key: Column value must match value in another table
  • Check: Simple single-table, single-row data rules.
Examples of possible check constraints are:
“start_date <= end_date”

“check_ind in (‘Y’,’N’)”
“amount between 0 and 99999.99”

However, many more complex business rules cannot be implemented via the above constraints. For example:

  • Employee Project Assignment start date and end date must fall between Project start date and end date
  • An employee may not have 2 Employee Project Assignments that overlap date-wise
  • A Project cannot have more than 10 Employees Assignments
  • An Employee cannot book time to a Task on a Project to which he is not assigned
  • The manager of a Department must belong to the Department.
These are usually enforced (if at all) procedurally, by one of the following methods:
  • Code in the application screens
  • Code in stored procedures (APIs) called from the application screens
  • Database triggers
These all have disadvantages compared to the declarative approach of constraints:
  • Code in application screens can be bypassed, and also the same rule must often be implemented in many places (e.g. in both the Project screen and the Assignment screen).
  • Code in stored procedures must also often be implemented in many places (e.g. in both the Project package and the Assignment package). Also, to prevent bypassing of the rules, all updates must be done via the package, which limits the functionality available to the user (cannot write ad hoc updates)
  • Code in triggers must also often be implemented in many places (e.g. in both the Project triggers and the Assignment triggers). Also, triggers can often become complex due to work-arounds to avoid mutating tables issues etc.
  • Any procedural solution must explicitly lock records to prevent corruption in a multi-user environment – e.g. if a user is amending an Employee Assignment, then no other user may be allowed to amend Employee Assignments for the same employee, or to amend the Project dates.
This paper shows how such complex constraints can be implemented declaratively, such that the rules are defined once and then applied to all updates from whatever source.

This began with an on-line discussion I had with Tom Kyte about enforcing complex constraints here: http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:21389386132607

Objective

Ideally, given a complex rule such as one of the examples above, we would like to be able to create a complex check constraint (or “assertion” in ANSI SQL terms) such as:

CONSTRAINT c ON TABLE project p
CHECK (NOT EXISTS (SELECT null FROM emp_proj_assign ep
WHERE ep.projno = p.projno
AND (ep.start_date <> p.end_date)));

Such a constraint (if it could be implemented) would raise an error in any of the following situations:
  • User attempts to insert an emp_proj_assign record with dates outside the Project dates
  • User attempts to update an emp_proj_assign record with dates outside the Project dates
  • User attempts to update a Project record with dates that do not encompass all the associated emp_proj_assign dates
However, we can’t achieve that via constraints alone.

Solution: use Materialized Views

Materialized Views (a.k.a. Snapshots) are actually tables that are maintained by Oracle, whose contents correspond to the result of a query – i.e. like a view, but “materialized” because the result is actually stored in a table.

These provide the mechanism we need to implement the complex constraint as follows:
  • Create a materialized view to select data that violates the desired constraint (e.g. assignments where the dates are outside the associated project dates). The MV must be defined with REFRESH COMPLETE ON COMMIT so that it is updated before the end of the transaction.
  • Create a check constraint on the materialized view that always evaluates to FALSE – e.g. CHECK (1=0)
That’s it. Whenever the underlying tables are updated, the materialized view is refreshed. If the update violates the rule, then a row will be inserted into the materialized view; but the check constraint on the MV disallows any inserts into it, and so the transaction fails.

Issues
  • Oracle 8i cannot support REFRESH ON COMMIT on materialized views of the complexity required for some rules. 9i can handle some but not all (does not allow self-joins in MVs).. 10G can handle self-joins, but does not seem to allow subqueries. So this approach cannot be used for all rules.
  • Efficiency: needs to be benchmarked. Is a FAST refresh preferable to a COMPLETE refresh? With COMPLETE we have a query that looks at all rows of the MV’s base tables at the end of every transaction that affects those tables.
  • Cannot create a REFRESH ON COMMIT materialized view with a HAVING clause. For such cases (see example 3 below), the materialized view cannot include the constraint violation (e.g. HAVING COUNT(*) > 10), and so the check constraint must do it (e.g. CHECK (cnt <= 10)). Note that in this case the materialized view will consume space in the database.
Worked Examples

Based on the following tables:

create table project
( projno int primary key
, start_date date not null
, end_date date not null
);
create table emp_proj_assign
( empno int not null
, projno int not null
, start_date date not null
, end_date date not null
, primary key (empno, start_date)
);

1) Rule: An employee cannot have overlapping project assignments.

This is implemented as follows:

create materialized view emp_proj_mv1
refresh complete on commit as
select 1 dummy
from emp_proj_assign ep1, emp_proj_assign ep2
where ep1.empno = ep2.empno
and ep1.start_date <= ep2.end_date
and ep1.end_date >= ep2.start_date;


alter table emp_proj_mv1
add constraint emp_proj_mv1_chk
check (1=0) deferrable;

2) An employee's project assignment dates must fall between the project start
and end dates

create materialized view emp_proj_mv2
refresh complete on commit as
select 1 dummy
from emp_proj_assign ep, project p
where ep.projno = p.projno
and (ep.start_date <> p.end_date);
alter table emp_proj_mv2
add constraint emp_proj_mv2_chk
check (1=0) deferrable;

3) A project may not have more than 10 assignments (in total, ever!):

create materialized view emp_proj_mv3
build immediate
refresh complete on commit as
select projno, count(*) cnt
from emp_proj_assign
group by projno;
alter table emp_proj_mv3
add constraint emp_proj_mv3_chk
check (cnt <= 10)
deferrable;

4) A project cannot have more than 10 employees assigned at the same time.

(I have not yet worked this one out!)