Many people (myself included) start of as programmers, and only later start to work with databases. After a while, the developer notices two things about databases:
- Tables have a fixed set of columns; adding a column to a table involves changing any application code that accesses it.
- There are lots of “lookup” tables that typically have a code and a description.
Being a natural problem solver, and knowing that code re-use is a “good thing”, the developer thinks he can do better. Quite often, there is no one more experienced around to advise against, and so the developer implements his ideas; this is so common that both ideas have a name: the One True Lookup Table (OTLT) and the Entity-Attribute-Value (EAV) model.
One True Lookup Table (OTLT)
The idea: instead of having lots of “lookup” tables like these:
create table order_status (status_code varchar2(10), status_desc varchar2(40) );
create table country (country_code varchar2(3), country_name varchar2(30) );
create table priority (priority_no number(1), priority_desc varchar2(40) );
…
… why not just have ONE lookup table like this?:
create table lookup (lookup_type varchar2(10), lookup_code varchar2(20), lookup_desc varchar2(100) );
Great! Now we only need one “maintain lookup” screen instead of 3 (or 33 or whatever).
The trouble is, the developer doesn’t consider the disadvantages of this. Most importantly, we can no longer use foreign key constraints to protect the integrity of the data – unless one of the following conditions is true:
- lookup_code is unique within table lookup
- every child table uses 2 columns referencing (lookup_type,lookup_code) for its foreign keys
In most cases, neither of the above applies, and responsibility for data integrity resides solely in the application code. Show me such a database, and I’ll show you some data where the code does not match anything in the lookup table.
Entity-Attribute-Value (EAV)
The idea: instead of “hard-coding” columns into the table like this:
create table emp (empno integer primary key, ename varchar2(20), sal number, job varchar2(10));
insert into emp (empno, ename, sal, job) values (1234,’ANDREWS’,1000,’CLERK’);
… why not have total flexibility like this:
create table emp (empno integer primary key );
create table emp_value (empno references emp, code varchar2(20), value varchar2(100));
insert into emp (empno) values (1234);
insert into emp_values (‘NAME’,’ANDREWS’);
insert into emp_values (‘SAL’,’1000’);
insert into emp_values (‘JOB’,’CLERK’);
Great! Now we are free to invent new “attributes” at any time, without having to alter the table or the application!
However, consider a simple query: “show the names of all employees who are clerks and earn less than 2000”.
With the standard emp table:
select ename from emp where job=’CLERK’ and sal < 2000;
With the EAV tables:
select ev1.name
from emp_values ev1, emp_values ev2 emp_values ev3
where ev1.code = ‘NAME’
and ev1.empno = ev2.empno
and ev2.code = ‘JOB’
and ev2.value = ‘CLERK’
and ev1.empno = ev3.empno
and ev3.code = ‘SAL’
and TO_NUMBER(ev3.value) < 2000;
Not only is that much harder to follow, it is likely to be much slower to process too. And this is about the most simple of queries!
Conclusion
OTLT and EAV are “generic” approaches that are seductive to programmers, but are actually a bad idea for most databases. Resist the temptation!
 
 
23 comments:
Tony,
I agree with the EAV. I've seen it happen and have had to argue it down a few time.
The OTLT can work though. If you add an alternate unqiue key to the look up table, you can enforce uniqueness on lookup_type+lookup_code and on alt_key. Use the alt_key in your foreign key constraints.
I'm not saying everyone should rush out and implement that but it does work.
Lewis
http://blogs.ittoolbox.com/oracle/guide/
I agree with Tony- these are not the way to create tables. I am less of a programmer and more of a database person, mostly self-taught, and have made these errors in the past. Even if you can make the OTLT work it will create problems as the application evolves.
The concept here is that of normalization. My experience is you violate the first normal form at your own risk. There are reasons to do so, particularly performance issues, but there will be increased need for careful coding and auditing functions to ensure database consistency.
In response to Lewis:
Yes the alt_key approach to OTLT does allow DRI, however it is essentially meaningless. The FK constraint to alt_key does not define any true constraint other than "constrain this element to be *any value* in my OTLT". To constrain a field to a *partiicular type* of value stored in the OTLT - i.e. postcode, part type, whatever - which is what the FK constraint *should* be enforcing, you will need to rely on application code. May as well not have the FK at all in that case....
Cheers,
Paul
Clean datamodels above all other, totally agree. The OTLT idea does not fit in there.
And if you want to avoid having to build 33 Code-table maintain screens... Just build a Union-All view on top of the 33 code tables, add some instead-of triggers, and off you go with just one maintain-screen.
These comments have been invaluable to me as is this whole site. I thank you for your comment.
I would generally agree with the recommendations for smaller projects. However, for very large enterprise wide projects with thousands of tables, individual lookup tables pose a huge maintenance issue. Think Oracle Applications (E Business Suite). It uses the OTLT quite successfully and as far as I know, there have been no major cases of orphaned entries because the user either enters the value using a UI or an application provided script that ensures not only the lookup ID to the OTLT but also a host of other business rules, which cannot be coded using database schema.
On the EAV, again there are a few tables in Oracle Applications that have to follow this approach primarily for end user customizations or for integration purposes e.g. RDF style integration. but in general, I would agree that outside of packaged enterprise applications, EAV has limited value.
I'm ashamed to say that I did a presentation on the OTLT, and have several of these structures in my applications. The main reason to do this is to make one application to maintain the OTLT. I think that Toon Koppelaars has the right solution - no OTL Table, instead have OTL View with INSTEAD OF triggers to update the "real" lookup tables. Then write your maintenance module against the view.
I agree on the smaller project modules, even if you set out on a lrage one by building through smaller ones bugs and errors are easier to find and remove in any application.
I started programming and the first job I did was databases for a small company but I moved up and now I programme the software for argos dvd players and I quite enjoy it but I never thought I would be doinmg this.
Can someone please discuss a *good* solution to this issue? All we've talked about so far is how these two solutions aren't worthwhile.
What about storing the data using the EAV approach, but creating a lookup table that flattens all the fields for quick searching?
Marc, what is the issue that requires a "good" EAV solution? My original contention was that generally there is no such issue, it is a figment of the developer's imagination.
OTLT
I created a response here:
http://sholliday.spaces.live.com/blog/cns!A68482B9628A842A!569.entry
One solution is to use a CHECK constraint.
My solution is Sql Server specific, but the idea is there.
But I have a common lookup table and a way to make sure only correct lookup values can be applied to an Entity-Column.
(Aka, you cannot put in a Country for a OrderStatus or similar).
..
Trouble is, that these generic approaches must be in some books or chapters on DB design. I've just seen data models like this implemented too often to believe it was just due to a programmer's temptation.
Even a very well-known (and expensive) IT consulting company implemented EAV in one of my projects - now that it is referenced in thousands of lines of code, we cannot get rid of it, although scalability of queries against that part of the data model is a big issue. Just imagine the last example, but with outer joins instead, as you don't know if every attribute in your query is available for every entity you want to examine... *sigh*.
Mostly this happens, when the specifications for a project aren't clear enough, because: When you have specific requirements, who needs an inspecific, generic and inherently slow data model?
BTW, does anyone here have any experiences in migrating away from EAV in large-scale applications?
Best regards, Uwe.
I think both of these two patterns attack important problems. EAV would save my company tens of thousands of dollars in one application alone.
Since our databases are not relational, there is a performance cost to EAV. As with most performance comments whether this is significant depends on the details of system in question.
Scott,
If you have an with lookup with 500,000 rows *give it another table*. This is not an in scope requirement for a lookup table.
If you have a lookup which needs different columns (audit) *give it its own table*. On the other hand if all lookups need audit the OTLT wins even more.
You have not lost any flexibility, just do what makes sense.
Hi,
Your comments re Look-up tables are spot-on, why would you do this - in fact why would a software developer want to do this(?), don't they use separate enums which serve a similar purpose?
As for EAV, I have to disagree with the OP. Those business models, and there are many, which necessarily need to store an unknown set of columnar data, have only three approaches in the relation database environment:
- Keep adding columns to a table every time you import data that has a column you haven't accounted for - not very practical, in fact unworkable.
- Implement a very wide table at the outset. This is usually done with a large number of NVARCHAR(MAX) columns, all values are effectively stored as strings. This table is then supported by a number of other tables which keep track of what data is in what column, possibly on a per row basis. In this case the table is very sparsely populated and indexing is impractical at best. What if we break our column width (I was sure 500 columns would be enough!)
- Store data in an EAV table. Again we need to support the EAV table with a number of transformation tables, again indexing (on the values) is impossible; however, we can index on the EA part of the table and we have the advantage that we our table data is densely populated. Querying is more cumbersome certainly, although utilising a two-stage process of extracting the data required and pivoting to tabular data for querying, whilst not ideal, is readily achievable.
In short, If you know what you data looks like, and it rarely changes, DO NOT use EAV; if you haven't a clue what your data looks like then EAV is a sensible option.
For crying out loud, use spatial for EAV.
I agree with KBR.
Two domains that need to use EAV would be
1) Medical Clinical applications (there are potentially more symptoms to diseases that can be supported by the max limit of columns per table in any current database implementation. And not every patient has these same symptoms - leading to very sparsely populated tables. And no one can predict what other new symptoms might turn up in the future).
2) Enterprise Document Management Systems (EDMS) where document types and their attributes are not known when you ship the product as each customers will define these for their own respective organization.
oh man. I just escaped from a job with a EAV based CMS. I didn't know there was a name for it. It was horrendiously slow. Ha! and it wasn't just and EAV, the EAV documents where joined together in a tree, with child documents...
The simplest possible page actually had 4 levels of nesting,
so to get the data to display the simplest possible page (not counting the stuff which provided the page layout etc) you where looking at a 4 three way joins, each joined to a linking table.
Since SQL can't do recursive joins, I think it had to hit the database 4 times.
Basicially, what you are doing with a EAV is implementing a no-schema/key-value database in SQL.
If you need to frequently change your data, you should just use a no schema database like couchdb, or mongodb.
I wrote about this in series on Database Design Mistakes To Avoid. I called it MUCK.
http://rodgersnotes.wordpress.com/2010/09/21/muck-massively-unified-code-key-generic-three-table-data-model/
To me, a good db design should eliminate code, not cause it:
http://rodgersnotes.wordpress.com/2010/09/14/database-design-mistakes-to-avoid/
As a developer, not a database guy, I'm all for the OTLT model, but I only ever implement it with a unique alt_key as suggested by lewisc.
Lets say I didn't use OTLT and I created a table called Gender with columns gender_id, gender_code and gender_description. I don't see how this stops somebody inserting into the table like so:
insert into gender(1, 'animal', 'cat')
With an OTLT table I can build just one GUI maintenance screen to handle the lot, as opposed to 10, 100, 1000... separate GUI maintenance screens for each category of lookup? No thanks, I'd rather take the performance hit and implement any data constraints some other way, if necessary.
Also, I think 500,000 rows in an OTLT table would be a very rare case.
While this may have been true 10 years ago. Things have changed drastically. If you only have a couple of "lookup" tables then you can use separate tables, but if its something as simple as countries or states, then its probably better to hard code the array into your application than to use precious network activity for something that rarely changes. Its as silly as having a "gender" table with only 4 entries.
I use a few tables that are semi-OTLT and semi-EAV, but I use them quite a bit differently that what was discussed here.
First table. Like above is a codes table (semi-OTLT). This codes table has ~2,000,000 codes.
create table codes (type varchar(10), codeset varchar(10), code varchar(10), description_display varchar(100), description_full varchar(200), jsondata text, primary key(type,codeset,code));
i.e.
insert into codes('race', 'std', 'W', 'Wht', 'White', '{}');
insert into codes('race', 'cdc', '2106-3', 'White', 'White', '{}');
insert into codes('marital', 'std', 'S', 'Single', 'Single', '{}');
I do this for 3 reasons, #1, over 1000 different code types, and even if we had 1000 tables to look up, the data structure doesn't work with foreign keys anyway. #2 is mostly static, not able to be updated by general users of the system, usually updated through patches. #3 99% of the time, this table isn't used for actual lookup. I use a codes searching table, this way I can control how codes are searched. For example, I add the word 'Caucasian', so that the user can still find 'White', if by another term. Although these particular tables are small, I have some code sets with thousands of codes, so a select * from specific_codes_table, would not be feasible anyway to show a list. In which case, the user HAS to search for them.
Its also, insanely fast, AND allows me update while people are using the system with extremely minimal impact (single code at a time, rather than an entire code set being down). I do sacrifice space and writing speed, for read speed. This table + search tables take up ~1G of disk space. Searches that return under 5000 results are returned in the the 10s of milliseconds. And for a complete rebuild of all the code sets takes hours, but some codes never change, some change once a year, some once a week, and a couple, nightly. The flexibility, user searching speed, and ability to update with users using the application, is a valid trade off to disk space and write speed, in my case.
Now for the semi-EAV table.
create table person(id int(8), section varchar(10), jsondata text, primary key(id, section));
create table person_search(id int(8), section varchar(10), name varchar(30), value varchar(100), primary key (id, section, name), index(section,name,value));
The person_search table is my semi-eav. I use this table for adding "search" information for the system to use, which makes it very dynamic and very fast. i.e.
Person id:1, section:details, jsondata:{"name":{"last":"Smith","first":"Bob"},"address":{"street":["123 Fake st"],"city":"Bedrock","postal":"12345"}}
Search id:1, section:details, name:bob_externalid, value:12345
select id from person_search where section='details' and name='bob_externalid' and value='12345'; or
select jsondata from person_search inner join person using(id) where section='details' and name='bob_externalid' and value='12345';
allows me to very quickly find the person who has an external id of 12345 in the system. With a B+Tree database, this is roughly 0(1). It also provides usability across all sql-based db platforms. Indexing of arbitrary JSON is AFAIK is only available in Oracle, PostgreSQL, and NoSQL systems. The only exception is that search data can be duplicated, even if you don't want it to. For example there could be 2 bob_externalid's on separate people even if you don't want it. But, this can be nice as well, I also have a ticket tracking system which will a search variable of owner/userid, which provides quick access to open tickets that are "owned" by that user (name="owner", value="(userid)")
Personally, I don't like Foreign Keys, #1, should deleting of a state (even accidentally), delete all users/people who have that state on them? For me, no, very bad #2, should it delete the code or update the description on the person? Again no, because that code/description reflects the actions the user performed at the time. i.e. If say, you delete OR update the state TX for a description 'No Mans Land', should update all users to display that as their state? In some cases, but a lot of cases I would say no, because if the user selected "Texas" and then goes back into the system and see's 'No Mans Land'. They are going to say "I never chose that!?". One of the ways I get around this, is that I store the codeset, code, and description, which only gets updated when the user actually updates the record.
{ "code": "TX", "codeset": "std", "description": "Texas" }
IMO, when a code is unused or changed, should only affect new stuff being inputted in the system, and should not affect old data because that changes the meaning of the past data. This is still integrity, and changing the code to reflect the codes data set (via Foreign Keys and Joins) is more what I consider Normalization. Just because the code doesn't exist or the description of the code has change, doesn't mean the integrity of the data has failed. Note: the only exception to this rule should be anything that is required by a Law that you must follow.
Post a Comment