Data Normalization

Thanks to Brad Tuttle, Ph.D. at University of South Carolina for this material

Relationships are inherent in data. If a system is to produce meaningful outputs, it must record not only the data but also the relevant relationships. A traditional data file system maintains relationships by physically recording related data fields adjacent to each other. Programmers code application programs to recognize these relationships. With a Database Management System (DBMS), related data are not physically adjacent. Therefore, a DBMS must be told what the relationships are. Either way, using a tradition file system or using a DBMS, someone must determine what relationships exist in the data. Typically, the accountant is in the best position to perform this task.

MSC 525: When you are finished reading this document, do the assignment at the bottom of the page.

Click here to go directly to: Assignment for MSC 525

Four possible relationships:

Relationships dictate what can and what can't be done on the system.

Computers need a way to "unlock" the data for the user. For instance, if you wanted to find the balance in a customer's account, the computer would ask you for the Customer's Account Number. The computer would use the customer's account number to locate the customer balance data. The customer account number, in this case, is call a Primary Key.

The relationship between the primary key and the data is always anything-to-one. That is, one-to-one or many-to-one but never one-to-many or many-to-many. Otherwise, the computer could not locate the exact piece of data you are looking for because it would point to "many" data items rather than one.

Relationships are defined to the computer in something called a schema. The word schema means "map," "outline," or "plan." In other words, the schema shows the plan for the data and their relationships.

When defining the schema, it is necessary to normalize the data or you will not be able to retrieve data from the computer after you put it in. Accountants generally find that to be frustrating. Normalization is a structured process for organizing data into tables in such a way that it preserves the relationships among the data. There are several levels of normalization, the first three of which are most commonly used and which prevent most problems from occuring.

First Normal Form

The objective of first normal form is to ensure that every cell in the table means only one thing and that it contains only one value. This follows a basic concept of the relational approach that a table has a fixed number of columns, but that the number of rows is limited only by the physical storage of the computer system. Consider the following data in "unnormal" form:

PNUM

TYPE

MATERIAL

DIAM,WGT

CNUM

CNAME

743 NUT ALUMINUM 10,5 E200 ERGO
763 WASHER STEEL 5,5 M300,A100,Z400 MELLO,ATLAS,ZEUS

This data is in unnormal form because it violates the rules of first normal form. Look at the column titled, "DIAM,WGT." This column contains two meanings: diameter and weight. Because they are in the same column, it is impossible for the computer to locate a part with a diameter of 10 because the computer considers the table cell to be a single value (i.e., 10,5 not just 5). Also, look at the CNAME column. Even though the column has just one meaning (company name of the supplier for the part), one cell has three values. The following table, corrects these problems and is in first normal form:

PNUM

TYPE

MATERIAL

DIAM

WGT

CNUM

CNAME

743 NUT ALUMINUM 10 5 E200 ERGO
763 WASHER STEEL 5 5 M300 MELLO
763 WASHER STEEL 5 5 A100 ATLAS
763 WASHER STEEL 5 5 Z400 ZEUS

In the above table, every column means only one thing and every cell only has one value--that's first normal form.


Second Normal Form

The objective of second normal form is to eliminate redundant data. Significant problems associated with inconsistent data arise when data is repeated inside a computer. For instance, if PNUM 763 were to change its material from steel to aluminum, three records require updating rather than just one. If not all records are updated to reflect the new material, inconsistencies would result.

Second normal form is accomplished by eliminating partial dependencies. Because primary keys can be composed of multiple fields, it is sometimes possible for data to depend on only a part of the primary key for its access. This condition violates second normal form. Data fields that depend on only a portion of the primary key must be put in separate tables.

Below is the same table from above but with some added records and fields. Note that this table is still in first normal form. Also, notice the PNUM 420 does not have a supplier associated with it. This does not violate normal form because there is no requirement that every record contain data in every field. Rather, normal form requires that every field have no more than one value per record.

PNUM

TYPE

MATERIAL

DIAM

WGT

CNUM

CNAME

TERMS
CODE

TERMS
DESC

743 NUT ALUMINUM 10 5 E200 ERGO 1 NET 30
763 WASHER STEEL 5 5 M300 MELLO 2 1% NET 30
763 WASHER STEEL 5 5 A100 ATLAS 1 NET 30
763 WASHER STEEL 5 5 Z400 ZEUS 3 C.O.D.
419 BOLT STEEL 35 20 M300 MELLO 2 1% NET 30
420 BOLT ALUMINUM 10 10        
416 NUT STEEL 5 10 M300 MELLO 2 1% NET 30

To put a table in second normal form requires (1) that you identify the primary key, and (2) that you determine the relationships between the parts of the key and the data.

Step 1--Identifying the primary key

Remember that the relationship between the primary key and the data must be one-to-one or many-to-one. Let's choose a primary key for the above table by considering PNUM as the primary key. The relationship between PNUM and TYPE is many-to-one, therefore, PNUM would work for the TYPE field. However, the relationship between PNUM and CNAME is many-to-many. This is obvious in the data as PNUM 763 is associated with MELLO, ATLAS, and ZEUS. Hence, PNUM cannot be relied upon to uniquely identify an individual record in the table. Therefore, PNUM cannot be the primary key for all the data in the row.

In fact, there is no one field that has a one-to-one or many-to-one relationship with every other field in the table. This suggests that we must combine two or more fields together to form a primary key. (Important--this produces 1 primary key NOT 2.)

The appropriate primary key for the above table is PNUM+CNUM. Notice that combining these two fields into a single number produces a value that uniquely identifies each row in the table. For instance, only one row has PNUM+CNUM = "763A100."

Step 2--Determine the relationship between the parts of the key and the data

When data depend on less than the full primary key for their access, a partial dependency exists. Partial dependencies result in redundant data. For example, TYPE, MATERIAL, DIAM, and WGT could be accessed by knowing only PNUM. This can be seen from the relationship between PNUM and each data field which is many-to-one. Also notice that TYPE, MATERIAL, DIAM, and WGT are the same (redundant) for all three rows of PNUM=763.

Data fields that depend on only a portion of the primary key for their access must be put in separate tables. Another way of thinking about this is that, when a single field of a multiple-field primary key can act as a primary key by itself to individual data fields, those data fields must be separated into their own tables. This produces two tables, the new table we might label PARTS:

PARTS

PNUM

TYPE

MATERIAL

DIAM

WGT

743 NUT ALUMINUM 10 5
763 WASHER STEEL 5 5
419 BOLT STEEL 35 20
420 BOLT ALUMINUM 10 10
416 NUT STEEL 5 10

And the original table with the data removed:

Original Table

PNUM

CNUM

CNAME

TERMS
CODE

TERMS
DESC

743 E200 ERGO 1 NET 30
763 M300 MELLO 2 1% NET 30
763 A100 ATLAS 1 NET 30
763 Z400 ZEUS 3 C.O.D.
419 M300 MELLO 2 1% NET 30
420        
416 M300 MELLO 2 1% NET 30

Please notice that PNUM is included in the both the original table and in the new PARTS table! Unless PNUM is in both tables, there is no way to cross reference the data in the two tables. PNUM becomes the primary key for the new PARTS table. In this case, the primary key consists of a single field. This table is in second normal form because it has no partial dependencies. (In fact, you cannot have a partial dependency if the primary key consists of only a single field!)

Also notice that some redundant data has been eliminated. For instance, STEEL appeared 5 times in the old table, but now appears only 3 times in the new tables. WASHER appeared 3 times in the old table but now only appears once in the new tables.

Notice, however, that partial dependencies still exist in the original table between the CNUM and CNAME, TERMS-CODE, and TERMS-DESC. This requires us to separate the CNAME, TERMS-CODE, and TERMS-DESC to their own table which we can call SUPPLIERS:

SUPPLIERS

CNUM

CNAME

TERMS
CODE

TERMS
DESC

E200 ERGO 1 NET 30
M300 MELLO 2 1% NET 30
A100 ATLAS 1 NET 30
Z400 ZEUS 3 C.O.D.

And the original table with the data removed:

Original Table

PNUM

CNUM

743 E200
763 M300
763 A100
763 Z400
419 M300
420  
416 M300

CNUM is the primary key of the new SUPPLIERS table. It may seem strange, but there is no data left in the original table, just the primary key. This is not uncommon to have a table without data. It serves as a link between data in other tables (i.e., the PARTS table and the SUPPLIERS table).

All three tables are now in second normal form. All three tables have their own primary key. All three tables have only one primary key. All data depend entirely on the primary key in that the relationship between the primary key and the data is one-to-one or many-to-one in all tables.

Third Normal Form

Once data are in second normal form, further data redundancies not involving the primary key can sometimes be eliminated. Once this is done, the data will be in "third normal form." For example, the relationship between the TERMS-CODE field and the TERMS-DESC field is one-to-one. This means that whenever TERMS-CODE is 1, the TERMS-DESC is always NET 30. Whenever a one-to-one or one-to-many relationship is found among the data fields, transitive dependencies exist and the data must be removed to a separate table.

TERMS

TERMS
CODE

TERMS
DESC

1 NET 30
2 1% NET 30
3 C.O.D.

All structured tables have a primary key. Thus, our new table requires a primary key. There are two options, TERMS-CODE or TERMS-DESC. A faithful rule is to never choose a name or description as a primary key. Never use a field that is updated by transactions as a primary key such as the CUSTOMER-BALANCE field. You should have some intuitive feel for pure data as opposed to fields that make good keys. In our new TERMS table, TERMS-CODE makes a good primary key, TERMS-DESC does not. One way to see this is to think about what a primary key does--unlock the data out of the computer. No one would use the description to find the code. Rather our experience is that we use codes to look up descriptions.

We must leave the primary key field (TERMS-CODE) in the SUPPLIERS table so that we can cross reference the suppliers with their terms. We eliminate the data field (TERMS-DESC) from the SUPPLIERS table to get rid of the redundant data:

REVISED SUPPLIERS TABLE

CNUM

CNAME

TERMS
CODE

E200 ERGO 1
M300 MELLO 2
A100 ATLAS 1
Z400 ZEUS 3

Both the SUPPLIERS table and the TERMS table are now in third normal form. Looking at the original table, it is also in third normal form. This is because there is no data (so it cannot have redundant data). Also, the PARTS table is in third normal form as there are no data fields in which a one-to-one or many-to-one relationship exists. All tables are now in third normal form!

Notice that redundant data was eliminated from the SUPPLIERS table by creating the TERMS table. In the earlier version of the SUPPLIERS table, NET 30 appeared 2 times but only appears once in the new tables.

Computers do not know if data is normal or what normal form they are in. In fact, computers allow you to define data any way you want, including in unnormal form. The problem is--unless you define the data to the computer in normal form, you will obtain incorrect answers to your data requests. In other words, this is something that humans (i.e., accountants) must do before implementing a new system.

Below are reproduced the final tables in third normal form. Fields comprising the primary key are in ITALIC.

Original Table

PNUM

CNUM

743 E200
763 M300
763 A100
763 Z400
419 M300
420  
416 M300



PARTS

PNUM

TYPE

MATERIAL

DIAM

WGT

743 NUT ALUMINUM 10 5
763 WASHER STEEL 5 5
419 BOLT STEEL 35 20
420 BOLT ALUMINUM 10 10
416 NUT STEEL 5 10



SUPPLIERS

CNUM

CNAME

TERMS
CODE

E200 ERGO 1
M300 MELLO 2
A100 ATLAS 1
Z400 ZEUS 3



TERMS

TERMS
CODE

TERMS
DESC

1 NET 30
2 1% NET 30
3 C.O.D.

Third normal form requires all four tables, including the original.


Assignment: Look at this table. Using the materials and examples in the "Data Normalization" document, put this data table in 3rd normal form. Post your results to the Forum.

 

Easy way to describe a table... You do not have to draw a box around it. Just type the table name and then list the attributes for that table below the table name.

Table Name: ORDER

ORDER Attributes: