Thursday, October 21, 2004

OTLT and EAV: the two big design mistakes all beginners make

Many people (myself included) start of as programmers, and only later start to work with databases. After a while, the developer notices two things about databases:

  1. Tables have a fixed set of columns; adding a column to a table involves changing any application code that accesses it.
  2. There are lots of “lookup” tables that typically have a code and a description.

Being a natural problem solver, and knowing that code re-use is a “good thing”, the developer thinks he can do better. Quite often, there is no one more experienced around to advise against, and so the developer implements his ideas; this is so common that both ideas have a name: the One True Lookup Table (OTLT) and the Entity-Attribute-Value (EAV) model.

One True Lookup Table (OTLT)

The idea: instead of having lots of “lookup” tables like these:

create table order_status (status_code varchar2(10), status_desc varchar2(40) );

create table country (country_code varchar2(3), country_name varchar2(30) );

create table priority (priority_no number(1), priority_desc varchar2(40) );

… why not just have ONE lookup table like this?:

create table lookup (lookup_type varchar2(10), lookup_code varchar2(20), lookup_desc varchar2(100) );

Great! Now we only need one “maintain lookup” screen instead of 3 (or 33 or whatever).

The trouble is, the developer doesn’t consider the disadvantages of this. Most importantly, we can no longer use foreign key constraints to protect the integrity of the data – unless one of the following conditions is true:

  • lookup_code is unique within table lookup
  • every child table uses 2 columns referencing (lookup_type,lookup_code) for its foreign keys

In most cases, neither of the above applies, and responsibility for data integrity resides solely in the application code. Show me such a database, and I’ll show you some data where the code does not match anything in the lookup table.

Entity-Attribute-Value (EAV)

The idea: instead of “hard-coding” columns into the table like this:

create table emp (empno integer primary key, ename varchar2(20), sal number, job varchar2(10));

insert into emp (empno, ename, sal, job) values (1234,’ANDREWS’,1000,’CLERK’);

… why not have total flexibility like this:

create table emp (empno integer primary key );

create table emp_value (empno references emp, code varchar2(20), value varchar2(100));

insert into emp (empno) values (1234);

insert into emp_values (‘NAME’,’ANDREWS’);

insert into emp_values (‘SAL’,’1000’);

insert into emp_values (‘JOB’,’CLERK’);

Great! Now we are free to invent new “attributes” at any time, without having to alter the table or the application!

However, consider a simple query: “show the names of all employees who are clerks and earn less than 2000”.

With the standard emp table:

select ename from emp where job=’CLERK’ and sal < 2000;

With the EAV tables:

select ev1.name

from emp_values ev1, emp_values ev2 emp_values ev3

where ev1.code = ‘NAME’

and ev1.empno = ev2.empno

and ev2.code = ‘JOB’

and ev2.value = ‘CLERK’

and ev1.empno = ev3.empno

and ev3.code = ‘SAL’

and TO_NUMBER(ev3.value) < 2000;

Not only is that much harder to follow, it is likely to be much slower to process too. And this is about the most simple of queries!

Conclusion

OTLT and EAV are “generic” approaches that are seductive to programmers, but are actually a bad idea for most databases. Resist the temptation!


22 comments:

lewisc said...

Tony,

I agree with the EAV. I've seen it happen and have had to argue it down a few time.

The OTLT can work though. If you add an alternate unqiue key to the look up table, you can enforce uniqueness on lookup_type+lookup_code and on alt_key. Use the alt_key in your foreign key constraints.

I'm not saying everyone should rush out and implement that but it does work.

Lewis
http://blogs.ittoolbox.com/oracle/guide/

Nelson Bliek said...

I agree with Tony- these are not the way to create tables. I am less of a programmer and more of a database person, mostly self-taught, and have made these errors in the past. Even if you can make the OTLT work it will create problems as the application evolves.

The concept here is that of normalization. My experience is you violate the first normal form at your own risk. There are reasons to do so, particularly performance issues, but there will be increased need for careful coding and auditing functions to ensure database consistency.

Anonymous said...

In response to Lewis:

Yes the alt_key approach to OTLT does allow DRI, however it is essentially meaningless. The FK constraint to alt_key does not define any true constraint other than "constrain this element to be *any value* in my OTLT". To constrain a field to a *partiicular type* of value stored in the OTLT - i.e. postcode, part type, whatever - which is what the FK constraint *should* be enforcing, you will need to rely on application code. May as well not have the FK at all in that case....

Cheers,

Paul

Toon Koppelaars said...

Clean datamodels above all other, totally agree. The OTLT idea does not fit in there.
And if you want to avoid having to build 33 Code-table maintain screens... Just build a Union-All view on top of the 33 code tables, add some instead-of triggers, and off you go with just one maintain-screen.

Scott Swank said...

The OTLT has more problems than merely that of hiding type -- though this alone ought to be sufficient to dismiss the idea outright. We will set that enormous problem aside for the moment. Lets say that our code is perfect and we never make mistakes with out OTLT. Marvelous.

Now consider performance. How many states are there in the U.S.? Hmm. There are 520,000 rows in the OTLT, I reccon nested loops are in order. How many customer types are there? Well, there are 520,000 rows in OTLT so nested loops it is again.

Well, the clever coder decides, this can be fixed with histograms on the type column. Ok, now we know that there are only 8 order types in our OTLT. But by bad luck they reside in 8 different blocks. So now we have to read 8K or 16K to get each row. And yes, buffered, logical reads still count against us.

So now the cleverer DBA decides that we need to either: use an index-organized or a partitioned table. Partition elimination is faster than b-tree reads -- 520,000 rows gives us an index depth of 3, perhaps 4, and we aren"t too fond of that many extra lio per OTLT read.

So now we have a partitioned table with histograms. We can even drop histograms since each partition is implemented as a separate physical table. But what if we want to add audit columns (update_user & update_date)? Now we have to either add 2 columns to each of the 520,000 rows or none. And if we need an hierarchical relationship -- yup that"s another column. We"ve lost all design flexibility. Every decision is an all or nothing decision.

annerose said...

These comments have been invaluable to me as is this whole site. I thank you for your comment.

Rajesh said...

I would generally agree with the recommendations for smaller projects. However, for very large enterprise wide projects with thousands of tables, individual lookup tables pose a huge maintenance issue. Think Oracle Applications (E Business Suite). It uses the OTLT quite successfully and as far as I know, there have been no major cases of orphaned entries because the user either enters the value using a UI or an application provided script that ensures not only the lookup ID to the OTLT but also a host of other business rules, which cannot be coded using database schema.

On the EAV, again there are a few tables in Oracle Applications that have to follow this approach primarily for end user customizations or for integration purposes e.g. RDF style integration. but in general, I would agree that outside of packaged enterprise applications, EAV has limited value.

John Flack said...

I'm ashamed to say that I did a presentation on the OTLT, and have several of these structures in my applications. The main reason to do this is to make one application to maintain the OTLT. I think that Toon Koppelaars has the right solution - no OTL Table, instead have OTL View with INSTEAD OF triggers to update the "real" lookup tables. Then write your maintenance module against the view.

Horse and Rider said...

I agree on the smaller project modules, even if you set out on a lrage one by building through smaller ones bugs and errors are easier to find and remove in any application.

Mitchel said...

I started programming and the first job I did was databases for a small company but I moved up and now I programme the software for argos dvd players and I quite enjoy it but I never thought I would be doinmg this.

Marc Kurtz said...

Can someone please discuss a *good* solution to this issue? All we've talked about so far is how these two solutions aren't worthwhile.

What about storing the data using the EAV approach, but creating a lookup table that flattens all the fields for quick searching?

Tony Andrews said...

Marc, what is the issue that requires a "good" EAV solution? My original contention was that generally there is no such issue, it is a figment of the developer's imagination.

granadaCoder said...

OTLT

I created a response here:
http://sholliday.spaces.live.com/blog/cns!A68482B9628A842A!569.entry

One solution is to use a CHECK constraint.

My solution is Sql Server specific, but the idea is there.

But I have a common lookup table and a way to make sure only correct lookup values can be applied to an Entity-Column.
(Aka, you cannot put in a Country for a OrderStatus or similar).

..

Uwe K├╝chler said...

Trouble is, that these generic approaches must be in some books or chapters on DB design. I've just seen data models like this implemented too often to believe it was just due to a programmer's temptation.

Even a very well-known (and expensive) IT consulting company implemented EAV in one of my projects - now that it is referenced in thousands of lines of code, we cannot get rid of it, although scalability of queries against that part of the data model is a big issue. Just imagine the last example, but with outer joins instead, as you don't know if every attribute in your query is available for every entity you want to examine... *sigh*.

Mostly this happens, when the specifications for a project aren't clear enough, because: When you have specific requirements, who needs an inspecific, generic and inherently slow data model?

BTW, does anyone here have any experiences in migrating away from EAV in large-scale applications?
Best regards, Uwe.

artsrc said...

I think both of these two patterns attack important problems. EAV would save my company tens of thousands of dollars in one application alone.

Since our databases are not relational, there is a performance cost to EAV. As with most performance comments whether this is significant depends on the details of system in question.

artsrc said...

Scott,

If you have an with lookup with 500,000 rows *give it another table*. This is not an in scope requirement for a lookup table.

If you have a lookup which needs different columns (audit) *give it its own table*. On the other hand if all lookups need audit the OTLT wins even more.

You have not lost any flexibility, just do what makes sense.

KBR said...

Hi,

Your comments re Look-up tables are spot-on, why would you do this - in fact why would a software developer want to do this(?), don't they use separate enums which serve a similar purpose?

As for EAV, I have to disagree with the OP. Those business models, and there are many, which necessarily need to store an unknown set of columnar data, have only three approaches in the relation database environment:

- Keep adding columns to a table every time you import data that has a column you haven't accounted for - not very practical, in fact unworkable.

- Implement a very wide table at the outset. This is usually done with a large number of NVARCHAR(MAX) columns, all values are effectively stored as strings. This table is then supported by a number of other tables which keep track of what data is in what column, possibly on a per row basis. In this case the table is very sparsely populated and indexing is impractical at best. What if we break our column width (I was sure 500 columns would be enough!)

- Store data in an EAV table. Again we need to support the EAV table with a number of transformation tables, again indexing (on the values) is impossible; however, we can index on the EA part of the table and we have the advantage that we our table data is densely populated. Querying is more cumbersome certainly, although utilising a two-stage process of extracting the data required and pivoting to tabular data for querying, whilst not ideal, is readily achievable.

In short, If you know what you data looks like, and it rarely changes, DO NOT use EAV; if you haven't a clue what your data looks like then EAV is a sensible option.

Nanomid said...

For crying out loud, use spatial for EAV.

ebikko said...

I agree with KBR.
Two domains that need to use EAV would be

1) Medical Clinical applications (there are potentially more symptoms to diseases that can be supported by the max limit of columns per table in any current database implementation. And not every patient has these same symptoms - leading to very sparsely populated tables. And no one can predict what other new symptoms might turn up in the future).

2) Enterprise Document Management Systems (EDMS) where document types and their attributes are not known when you ship the product as each customers will define these for their own respective organization.

dmt10 said...

oh man. I just escaped from a job with a EAV based CMS. I didn't know there was a name for it. It was horrendiously slow. Ha! and it wasn't just and EAV, the EAV documents where joined together in a tree, with child documents...

The simplest possible page actually had 4 levels of nesting,
so to get the data to display the simplest possible page (not counting the stuff which provided the page layout etc) you where looking at a 4 three way joins, each joined to a linking table.

Since SQL can't do recursive joins, I think it had to hit the database 4 times.

Basicially, what you are doing with a EAV is implementing a no-schema/key-value database in SQL.

If you need to frequently change your data, you should just use a no schema database like couchdb, or mongodb.

Rodger said...

I wrote about this in series on Database Design Mistakes To Avoid. I called it MUCK.

http://rodgersnotes.wordpress.com/2010/09/21/muck-massively-unified-code-key-generic-three-table-data-model/

To me, a good db design should eliminate code, not cause it:

http://rodgersnotes.wordpress.com/2010/09/14/database-design-mistakes-to-avoid/

djs said...

As a developer, not a database guy, I'm all for the OTLT model, but I only ever implement it with a unique alt_key as suggested by lewisc.

Lets say I didn't use OTLT and I created a table called Gender with columns gender_id, gender_code and gender_description. I don't see how this stops somebody inserting into the table like so:

insert into gender(1, 'animal', 'cat')

With an OTLT table I can build just one GUI maintenance screen to handle the lot, as opposed to 10, 100, 1000... separate GUI maintenance screens for each category of lookup? No thanks, I'd rather take the performance hit and implement any data constraints some other way, if necessary.

Also, I think 500,000 rows in an OTLT table would be a very rare case.