Thinking Relationally

Next Meeting: Sat, TBD
Meeting Directions

Be a Member
Join SCOUG

Navigation:

20 Most Recent Documents
Search Archives
Index by date, title, author, category.

Features:

Mr. Know-It-All
Ink
Download!

Supporting Warpstock Little Rock 2025

Supporting Warpstock Phoenix 2023

Supporting Warpstock Orlando 2022

SCOUG:

Home

Email Lists

SIGs (Internet, General Interest, Programming, Network, more..)

Online Chats

Pictures from Sept. 1999

The views expressed in articles on this site are those of their authors.

warptech
SCOUG was there!

SCOUG, Warp Expo West, and Warpfest are trademarks of the Southern California OS/2 User Group. OS/2, Workplace Shell, and IBM are registered trademarks of International Business Machines Corporation. All other trademarks remain the property of their respective owners.

The Southern California OS/2 User Group
USA

Thinking Relationally
Randell Flint, Sundial Systems and SCOUG Member

Abstract:

So you own a relational database and now want to do something with it. The first step is to come up with a problem you need to solve, like keeping a home inventory for example. But where do you go from there?

This session will give you an introduction to thinking about your problem "relationally" -- how to best capture and express your data as not just any database, but as a relational one. We'll cover both the theoretical and practical aspects (and the trade-offs between them) so that you can take better advantage of whatever relational database product you're using.

So you own a relational database product (like DB2, or Approach, or Sundial's own DBExpert) and now want to do something with it.

Of course, the first step is to come up with a problem you need to solve... but where do you go from there?

Typically, when you want to use a database to solve a problem, the first step of the process to think about the "things" that you want to keep track of. That's because, when you really get down to it, a database is nothing more than a collection of details about a bunch of individual things...

For this presentation, we're going to concentrate on using a database to keep track of a "home inventory" (such as you might want to have for insurance purposes) but the principles involved are essentially the same no matter what you want to use a database for.

So, what "things" do we want to keep track of?

In this particular example, most of the things are pretty concrete "real world" objects like TVs, VCRs, computers, furniture, china, jewelry, tools, etc. But "things" in a database aren't necessarily concrete -- they can be more conceptual objects like insurance policies or real estate ownership records or the like.

Regardless of the particular database, however, the key design idea is to figure out what information to keep track of about these things and how to organize that information so it can be used effectively.

But how do you go about doing that (especially if you haven't done it very much before)? Well there really is no one best way... and certainly there is no step-by-step recipe that works for every database in every situation. And the formal methods that have been defined by several experts in the field are just plain overkill when it comes to the needs of most database users.

Thus, our goal in this presentation is to simply give you a series of "guidelines" that will help you think about things "relationally" and that, when applied with care and understanding, will give you quality relational database designs.

So let's get started...

Guideline: "It's the types of things that are important, not the things themselves"

Somewhat surprisingly (to database newcomers), the actual information (the details about your particular VCR for instance) are some of the last things you worry about when coming up with a database.

What's far more important is figuring out the "kinds" or "types" of things you are going to keep information about -- a process sometimes called "abstraction". What you need to do is figure out the structure of the database -- that's the hard part. (Later, after you use that structure to define your database using some particular database management product, you then "populate" the database with the actual information -- that's the easy part, though it can be quite tedious if you have lots and lots of data.)

For many database design problems, this first step of the organization process is very much the same as you might go through if you just wanted to keep track of the same information on paper -- it's simply coming up with what "lists" of information you would want to have.

For our home inventory, if we were doing it on paper, we might just end up with one long list of everything we have. (Or we might use several different lists like "things in the garage", "things in the house", etc., but we'll return to that issue later.)

It's quite common, in fact, for you to only end up with one such "type" (or perhaps a few) as a result of this first stage of the design process.

Guideline: "The world is flat"

Just like when you try to put a home inventory "on paper" and that paper model of the world imposes certain constraints on what you can do, different database systems have rules about how you can "structure" your information.

When it comes to relational databases (which are, by far, the most common form of database in use today), the most important structural constraint is that the database consists of nothing but tables. There may be logical (not physical) "connections" between tables (as we will see later) but it's a flat, two dimensional world of nothing but rows and columns of information.

Thus, there is no concept in a relational database of hierarchical relationships... "trees" of information (such as the tree view of the directories on your hard drive) can't be represented directly.

Further, there is no concept in a relational database of "containment"... tables cannot contain other tables and all values in the table must be "simple" (whatever that means... we'll return to that later).

So, basically, all a table is... is a list of information about a bunch of similar things... where each row represents one such "thing" you want to keep track of.

(As far as the "structure" of the database is concerned, you really don't care very much about the individual rows of data. Each row represents, for instance, one item in our home inventory -- a VCR, a TV, an antique dining table, or whatever -- but those details are only important later after you've created and database and start use it by filling in the data.)

Thus, a table typically represents a "type" of thing and each row represents the specific information about one such particular thing.

And, thus, as part of your initial design process, you are going to end up with a table for each of the "types" of thing you need to keep track of for your particular application.

Guideline: "The world is unordered"

Unlike some other forms of databases, relational databases make no guarantees about the order of entries in a table.

Tables are unlike many lists we use in the real world in that the entries (rows) aren't in any particular order. In fact, the database system reserves the right to change the order at any time -- even from one time you look at a table to the next.

This may sound like potential chaos, but it isn't. Relational databases do allow you to specify that you what to see (or use) the rows in a particular order... but this only happens if you ask and it doesn't say anything about the order in which the rows of the table are actually stored away someplace on disk.

Thus, in our home inventory, we don't need to worry about whether we want our item list to be in alphabetical order... or in the order the items were acquired... or in some other order. In general, we can ask to see the data in any order at any time whenever the need for order arises.

But it also means that we can't depend on having one particular row after another (for instance)... or do things like continuing information from one row into the next. Each row must be able to "stand alone" as an independent entity.

Guideline: "The differences are in the details"

If each row of each table represents a "thing" you want to keep track of, what do the columns of the table represent? Well, if all the rows in the table represent different things, then something makes these different things "different". And, it's the columns that record those detailed differences.

For instance, things listed in our home inventory might have a description, a price, a color, a date they were acquired.

These are called attributes (or characteristics or properties) and are represented by the columns in the table; each row contains the values of those attributes for a particular item.

In contrast to the rows of a table, the columns are part of the structure of the database and are something you need to worry about "up front".

For one thing, relational databases typically require that all the values in a column must have the same "data type". In other words, they must all be numbers, or all be dates, or all be strings of text, or whatever. The specific types you can choose from will vary from one database product to another but virtually always include such basic types and perhaps many others.

Further, as a consequence of the "flat world" requirement, the values in a column must be "simple". They can't be lists, they can't be other tables -- they must be simple singular values consistent with the data type of the column. Later, we'll see what you need to do if you have a characteristic that isn't "simple" and how you need to change the structure of your database to accommodate that situation.

(Now don't confuse "simple" with "short." The data-type could be a text string of significant size -- or perhaps virtually unlimited length -- or it could be a bitmap or other binary data. The particular data-types that are available depend on the database product being used but the key concept is that the values have no "internal structure" as far as the database is concerned -- they are "simple" values.)

Also, in keeping with the "unordered" nature of tables, you can't make any assumptions about "which column comes first". While a database system will usually always display the columns of a table in the same order left to right, there is no guarantee of this and you can never reference a column by its position -- you can only refer to the characteristic of an item by that characteristic's name, not by a "column number".

Guideline: "Beware of the unknown"

Ideally, the things represented in a table all share the same characteristics... but just because there is a "slot" in the table for each property of each thing, that doesn't mean you need to keep track of that piece of information.

Most relational databases support the idea of a NULL value. A null value for a particular row and column means simply that the particular characteristic "doesn't apply" (or is "unknown") for the particular item represented by that row.

So, in a home inventory, for example, you might have a color column as noted above. But some items, like an oil painting, might not have an identifiable "color"... or, like a camera, might have a color but the color isn't anything you care about... or, perhaps like that old trunk stored in the attic, might have a color but you don't know (or don't remember) what the color is.

The important thing to know about NULL values is that they really mean that you don't know (or don't care) and that the database system doesn't know either. If it's a numeric field, a NULL value is NOT the same as having a zero (or some other number) in that field -- for instance, if you add NULL and 3 together, you don't get 3, you get NULL.

Or, if you ask the database system "what do I have that is red", the result will not include that trunk in the attic. And, importantly, if you ask it "what do I have that is NOT red", the result will still not include that trunk in the attic -- that's because the database cannot tell if the trunk is red or is not red. (Most database systems also have a way for you to ask something like "what do I have that does not a have a color" or "what do I have with NULL for the color" -- and then that trunk in the attic, with its NULL color value, would be included.)

The bottom line here is that NULL values are very useful but can cause unexpected results if you don't plan to use them wisely.

Guideline: "It's a database, not an encyclopedia"

Once you start thinking about the characteristics of the things in your database -- in other words, the columns for your tables -- it can be hard to know when and how to stop the process of coming up with more details.

And where you stop can greatly affect the design of your database.

Another way of saying this is that you should only try to keep track of the characteristics that are important and/or relevant to your particular application.

In other words, how you keep track of all the items in your house for a household inventory (for insurance or whatever) is probably different than how you might keep track of all the items you have stored in your garage (or a rented warehouse) for some other purpose (such as for you to sell on ebay).

Thus, in your household inventory database, it's probably not relevant to keep information about who in your family bought the item or what size box it was delivered in or what the weight of the item is.

Sometimes, however, it's not entirely clear if you should include a piece of information or not. For instance, should your home inventory include which credit card was used to purchase the item? Superficially, this would seem to be irrelevant. But if you have some credit cards that offer automatic extended warranties or insurance against stolen items, then the information is probably relevant to include.

Fortunately, you don't have to be 100% correct about such things "up front" and you generally are not "locked in" by the decisions you make in your initial structuring of the database.

Most relational databases let you add additional columns to a table... so adding more characteristics later isn't a big thing. (Other than adding all the data!) And just as a point of reference, it's usually the case that when you do add a column, the value of that column for all your existing rows is set to NULL (meaning "unknown" in this case).

But, as we will see later, some additions could change the way you need to structure your database. So it's best to at least consider all the possible characteristics you can think of "up front" to see if they will impact your design (even if you don't choose to include them).

Guideline: "Group like things together..."

Another way to think of a table is that it represents a "group" of "similar" things... with each row containing information about one of the things in the group.

They are "similar" in the sense that they have the same characteristics (as represented by the different columns).

As with most "groups" in the real world, the things in a table typically share some common characteristics but not necessarily all. We've seen that NULL values can be used to fill in the gaps when a characteristic doesn't apply.

So, the question is, what makes things "similar" enough that they should be in the same table... and the answer is that it's up to you.

But, for most purposes, it's better to keep things similar than it is to make them different. So, when you have a choice, assume things are similar... and enhance your table definition to accommodate this if you need to.

So, for instance, when it comes to your home inventory, you might originally come up with a list like:

things in the house
things in the garage
things in the yard

You could use a different table for each of these, but you're better off thinking of them as:

things we have

and adding a new characteristic called "location" and using that to keep track of whether they are in the house, the garage, or the yard.

One other advantage of this is that it's much easier to expand your database if you need to add a "new" category of things like:

things in our vacation house

since you don't need to create a new table. Instead, you just start using a new location value ("vacation house" in addition to "home", "garage", "yard") in the appropriate rows of the table. Similarly, if you decide to go in to more detail at some point in the future, you could update your general "home" entries with more detailed location values like "den", "living room", "guest bedroom", etc.

Guideline: "... and don't group unlike things together"

The analog of grouping like together is to keep unlike things apart.

There is, for instance, nothing to prevent you from putting all your data in one table... even if some of the items share nothing in common. For instance, if your database not only includes home inventory information, but also ball game schedules, you could have a "home inventory and ball game table" with a column that you use to distinguish between the two types of rows... but that's not a good idea.

If the purposes for which you will use the data are radically different or there are so few characteristics in common between the different uses (so that you will end up with lots and lots of NULL values in the table because they don't apply), then it's best to separate the "unlike" things into two (or more) groups and thus two or more tables.

"Grouping" is the first and foremost reason why you want to divide your database into multiple tables.

Another way to help check to see if you are following this guideline is to look at how you might name a table. If you cannot come up with a relatively simple (yet descriptive) name, then it may be that you are trying (intentionally or unintentionally) to group unlike things together in that table.

Guideline: "Be able to distinguish this one from that one"

If each row of a table represents a different item, then you generally need to be able to decide which row describes the characteristics of which item.

Another way to put this is that something in each row should distinguish that row from all other rows in the same table. And, similarly, one (or more) characteristics in each row should be usable to "identify" that particular row.

(Most, but not all, relational database systems don't let you have two entirely identical rows in the same table.)

As we will see later, its useful (though not required) to be able to identify a particular row by something "simple" (like a number) rather than by a text string (like a name) or by some complex combination of values in several columns.

If you are coming from a file-oriented background (or from the approaches used in some non-relational databases), you might think of using the "record number" or "row number" to do this.

That won't work in a relational database, however. Remember that there is no guarantee that rows will even be kept in any particular order and thus that there really is no concept of a "row number" (except perhaps on a transitory basis as you are looking at something).

If your items don't have such a simple, singular characteristic to identify them, it often pays to create an artificial characteristic of your items just for this purpose -- an inventory number or asset number in our home inventory database for instance. (We'll see an example of where this can be useful later.)

Guideline: "Don't duplicate information"

Having information in one characteristic that directly depends on information in another characteristic is a bad thing that can lead to all sorts of problems. Thus, it's very important to do some "dependency analysis" to make sure you aren't duplicating information.

For instance, one thing you might want to keep track of in your home inventory is where you purchased the item... perhaps so you can know how much "stuff" you are buying from particular stores.

If all you want to keep track of is the store name (for instance), then this makes sense as a simple characteristic of the individual rows.

But what if you want to keep track of more information -- like the store's customer service phone number and/or their address?

Each of these pieces of information is dependent on the other -- they are always the same for any given store.

You could make all these individual characteristics and repeat this information for each item that you bought from that store... but if you buy a lot of things from that same store, that's a lot of information to enter over and over. And if the store's phone number changes (for instance), you've got a lot of places to update it and you might make a mistake and forget some.

What makes more sense is to record this information once for each store -- a process sometimes called "factoring" common information. This would result in you having a list of stores in addition to your list of inventory items -- i.e., two tables rather than one.

In your inventory table you would only identify the store... and then use the other table to record the characteristics (name, address, phone number, etc.) of each store you buy things at.

If you buy multiple things at the same store, then the "store identifier" will appear in multiple rows of the inventory table... but there will be only one row for that store (and it's characteristics) in the store table. Thus the store characteristics will not be duplicated -- there will be less data to enter, less chance for error, and less data to take up space in your database.

This is also one of the reasons why being able to identify individual rows in a table (as discussed earlier) can be quite important. How you identify an individual store (in the store table) is how you reference that store in the inventory table.

(The identifier will, of course, be duplicated -- that can't be avoided. It is what ties the items in one table to the items in the other table. This is, however, an argument for using a "simple" identifier -- such as a number -- rather than something like the store name -- as the identifier. Most database management systems have special tools and techniques for helping you manage such identifiers.)

Guideline: "Don't store what you can compute"

Computers are very good at computing things... they do it much more reliably than people do. In fact they do it much more reliably than people even enter data into computers.

In terms of database design, that means you should let the computer compute things whenever possible.

Thus, you may have things that can be computed based on other things in your database, for example:

the price of a quantity of items as the single price times the quantity
or
the total value of a collection as the sum of all the individual values

If you store the original data and those computed results in your database, sooner or later you're going to change one, forget to change the other, and end up with inconsistent (and thus incorrect) data!

So it's generally best to store only the original data and then compute the resulting value whenever you need it (to print on a report, to display on the screen, or to use in some other computation).

(Essentially, this guideline is just another form of the one about not storing duplicated information -- information you can "compute" is a "computed duplicate" of other information you have stored in the database.)

Guideline: "Does this have more than one of that"

This is, absolutely, the 100% most important question you need to ask yourself about each table you might need to create and is a direct consequence of the "flat world" constraint of relational databases.

In essence, you need to analyze each characteristic of a table and decide if you might have more than one value of that characteristic for a given row.

For some things, it will be absolutely clear that you have only one such value... for instance, the serial number of an item in the home inventory. (Having no such value is OK as well -- you would have a NULL value for the serial number for instance.)

For others, it will be very clear that you have more than one value... while for still others it will be questionable at best. That last category is the hardest to deal with -- and we'll see why later...

Our home inventory doesn't have a clear cut example of something where we would always have multiple values of a characteristic (at least as far as the properties most people would want to keep track of), but let's create one (somewhat artificially) so we can see what the problem is.

Let's say that you want to keep track of the date when the item was serviced or repaired... not just the most recent such date, but the dates that the item has ever been serviced or repaired.

You can't do this with just one such characteristic (since you can't list more than one value in a "slot") unless you have multiple rows in the table for the same item -- each with a different date value.

In other words, say you have a VCR that's never been serviced. It's represented by one row in the table. Now you get it serviced and fill in the service date. But as is often the case with VCRs, something goes wrong and you get it serviced again. Where do you put the date? So far, you have no choice but to duplicate that VCR row in a new row and fill in the new date. And, if you get it serviced again, then there's still another row to add...

There are a whole bunch of reasons why you don't want to do this. The most obvious is that it violates the "don't duplicate information" guideline -- things like the serial number, the store name, etc., will all be duplicated in each one of the rows about that particular VCR.

It also violates the guiding idea that a row in the table corresponds to an item in the (real) world that you want to keep track of -- now you have two or more rows corresponding to that item. That also means that you have probably also violated the guideline about uniquely being able to identify each row -- certainly something like the serial number, or even an artificial "asset number", which would have worked great before isn't going to work now because you have that number in more than one row!

So, how do you solve the problem. (We'll ignore the obvious possibility of just throwing out the VCR and getting a new one.)

It you know you may have a large number of possible values for a characteristic, you really have only one reasonable way of solving the problem. You need to create another table and "link" each row of "this" table to zero or more rows of "that" table.

For our somewhat contrived home inventory example, this would mean creating another table of "service dates". And while you might want to expand the details later, the new table might initially have only two columns: one to contain the actual date and another to contain the "identifier" (inventory item number or whatever) of the corresponding item in the original home inventory table.

We haven't taken the time in this presentation to describe much about how you use multiple tables like this, but let's just say it's fairly easy with most database products to "join" the tables together and thus find out which service dates go with which inventory items (and have these listed on the screen or printed in a report or whatever). The common "identifier" fields (between the two tables) are used to match things up.

(Now if you are paying close attention, you are saying, hey, wait a minute, the service record table does not have a nice, clean, simple value that can be used to identify each record... Well, that's OK -- since the whole purpose of this table is to represent multiple values from records in the other table, the items in this table don't really need to be uniquely identified. They are, however, still unique when you take the combination of the inventory identifier and the service date together.)

As noted above, using multiple tables like this is the only practical solution when you have (or might have) a large number of multiple values for a particular attribute of a particular record. But, when the number of possible multiple values is small, say two or three, you need to weigh the trade-offs

Some designers would tell you that you should always still separate this information out into another table. But, to be frank, having multiple tables always increases the overhead in processing your database and also can make it more complicated to do some things that are relatively easy when only one table is involved.

So, for instance, if you knew you only wanted to record the last two service dates for an item (rather than the entire service history), you may well be better off just having two or three characteristics (say "most recent service date" and "earlier service date" for instance) to record that information in the original table.

When the issue of multiple values is questionable, then, likewise, you need to consider the trade-offs. The key thing you need to worry about "up front" in your design is whether you think you will ever need to separate the information out into that other table.

If you have a requirement that you keep track of multiple values (like service dates in this case) and you must keep track of them all, then it's probably best to make the decision to use another table. The reason for this is that, unlike just adding another column to a table, adding a new table and connecting up all the corresponding rows is not an easy thing to do once you have populated the database with information.

On the other hand, if you know you won't ever need more than a couple of the values for a possibly multi-valued attribute, it's reasonable to at least consider just having multiple attributes in the original table. This will make your database less complicated... but you must realize that you have traded off long term flexibility in the process.

And the complications associated with just that type of problem bring us to our last, very important guideline...

Guideline: "Guidelines are meant to be broken"

All guidelines, no matter what their source, are just that -- they are "guidelines" not rules.

It's 100% OK to "break" them... but they shouldn't be ignored. The key thing is to know when you are breaking them and take the time to make a conscious decision that that is what you want to do.