Welcome to my set of tutorials on MySQL. While
there are loads of tutorials on MySQL (usually in combination with a
scripting language) on the Net aimed at building applications etc.
Part 7 - Normalisation
A look at database theory that aims to be accessible for the novice developer.
Introduction
In the previous workshops in the MySQL series, I've hinted at elements of
database theory (such as normalisation and foreign keys) without fully explaining
these concepts and why people use them when building databases. This workshop
aims to fill this gap, but from previous experience in trying to teach this
material, I know that a lot of people when learning about normalisation have
problems relating the theory to practice. This workshop aims to bridge this
gap by providing numerous examples that illustrate both why we need to normalise
at each step and also what effect is has on the data. It is not intended to
be an indepth theoretical discussion of database theory, however that said
we cannot normalise by studying the data alone and thus we will need to touch
on a few other database concepts to fully understand what is going on.
What is normalisation?
Normalisation is a process whereby the tables in a database are optimised
to remove the potential for redundancy. Two main problems may arise if this
is not done:
- Repeated data makes a database bigger.
- Multiple instances of the same values make maintaining the data more difficult
and can create anomalies.
I'm not going to develop the idea of specific anomalies much here, but to
briefly illustrate an update anomaly imagine the problem of updating a customer
address. If there are multiple instances of that customer in a database, any
query designed to update the address will have to update them all. That may
seem easy - as we've seen in previous workshops, SQL is quite capable of updating
information based on field values. But if one of those field values (say an
instance of customer name) has been input incorrectly, then the query would
fail to update the address in that row.
Normalisation aims to remove this redundancy by applying rules in a series
of stages, splitting tables and creating relationships between unique identifiers
(keys), to ensure that the database table structure is efficient, but data
can still be accurately manipulated. It can also be used in conjunction with
other database modeling techniques, such as Entity-Relationship diagrams, but for simple databases normalisation can sometimes be enough.
Before we begin looking at the stages of normalisation it is worth having
a brief look at some of the relational database concepts that inform the process.
Relational Databases
Relational databases, such as MySQL, are so called as they rely on a series
of relationships to connect data stored in different tables. For example in Part 5 we looked at joining tables based
on a common ID number. This ID number was stored as the uniquely valued field
(Primary Key) in one table (for example in the artists table), but was also
present in another table (the cds table) acting as a reference (Foreign Key)
that allowed data to be retrieved from both tables when the values matched.
For ease of understanding we used unique numbers, but a key can have any value
provided it conforms to a few rules.
Terminology
When were we were looking at the physical database in the other workshops
we used certain terms such as TABLE, COLUMN, ROW and FIELD. Discussion of
database theory uses different terms:
Relation/ Entity - These are the same as a table.
Attributes - These are similar to columns in that they describe
the type of data stored.
Domain - This is values within the same attribute (a collection
of fields that exist in the same column).
Tuple - This is a record similar to a row.
As I'm trying to tie the theoretical discussion to the practical database
experience that the first Virtual Workshops provide, I will use both sets
of terms in this workshops for clarity and to reinforce their meanings.
Different Keys
The Primary Key is frequently used to identify a specific
row (although other keys may exist) and all the other values are dependent
on that value to be meaningfully identified. A primary key is usually one
attribute (column) in a relation (table), but can be a combination of attributes
(Composite Key). If we consider the following attributes
some cds may have.
| Catalogue No. |
Artist |
Title |
The obvious Primary Key for a CDs table would be the Catalogue No. as each
cd has already been given a unique identifier. But consider if we knew for
a FACT that each artist could only release one CD per year (perhaps something
a cold war communist five year plan would envisage ;-).
| CDs |
| Catalogue No. |
Artist |
Year |
Title |
By adding the Year attribute we could also use the combination of Artist
and Year to identify the row, as together they form a Composite Key.
Thus we have two potential or Candidate Keys. We would probably
still choose the Catalogue No. as it is simpler to use one column. Consider
a Radio station wishing to record how many times a day they played a CD. The
combination of Catalogue No. and the date would be sufficient to identify
that record in a new Play History table.
| Play History |
| Catalogue No. |
Date |
Times Played |
In this situation the Catalogue No. number is a Primary Key in the CDs table
but only part of the Composite Key in the Play History table as the date is
required to form the whole key. An alternative could have been to create a
new unique single-attribute Primary Key for the table. As we still need to
know the catalogue number in order to work out how many times a DJ has played
Britney, Christina or Pink (BBC Radio One is just rubbish isn't it? ;-). The
catalogue number would become the Foreign Key in the Play
History table, i.e. a Foreign Key is a attribute (column) in a table that
refers to a Primary Key in another table.
| Play History |
| HistoryID |
Catalogue No. |
Date |
Times Played |
During the Normalisation process we will look at the rules that determine
how tables (and thus Keys) are formed.
NULLs
NULLs are used in databases to signify that no value exists, but are not
really values themselves. Instead they are just an indictor that a particular
field has no value. Some argue this is better that leaving a blank value or
having a default value. I personally see little value in a blank field, but
default values (such as inserting the current date if no date is offered)
can be useful. The interesting point about NULLs is that no Primary Key can
contain a NULL and thus it's useful when comparing Candidate Keys to see if
one could potentially contain a NULL, thus disqualifying that key.
Functional Dependency
This is another important concept and describes the relationship that columns
have to the Primary Key. If the value of a column can be determined by knowing
the value of the Primary Key and no other, then that column
is said to be functionally dependent. Or more simply, if we know the value
of Primary Key we can find out the value of any other dependent column. This
dependency is often written as follows (where A is the Primary Key and B is
the dependent column):
A → B
So for the CDs table above we could express the dependency of the title column
as either:
Catalogue No → Title or
(Artist, Year) → Title
This is important as we will see later as having a column that is NOT wholly
dependent on the Primary Key causes redundancy. Having introduced some of
the terminology of Normalisation we can begin to look at the stages.
Describing relations (tables)
When describing tables there is a way of expressing their structure that
we will follow here. This includes the name of the relation, the attributes
within the relation and which attributes are key. The format of this is:
RELATION (attribute one, attribute two, attribute three etc)
NOTE: The key attribute is underlined to signify its status and the relation
is always UPPERCASE.
For example consider how we might express the CDs table above:
CDS (Catalogue No, Artist, Year, Title)
Stages of Normalisation
There are several stages of normalisation that a database structure can be
subject to, each with rules that constrain the database further and each creating
what is called a Normal Form. These are, in order:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Fifth Normal Form (5FN)
As the database goes through each stage of normalisation it is said to be
in the normal form of that stage, i.e. my database is 2NF and I need it to
be 3NF. We are not going to take a detailed look at all of these Normal Forms
as BCNF, 4NF and 5NF are probably overkill for small to medium databases,
with the first 3 normal forms usually being sufficient. There is also one
other stage - that of the Un-normalised Normal Form (UNF) which is the starting
point for Normalisation. To begin our examination we must first create this
Un-normalised data.
Our Scenario
In order to look at database normalisation we are going to use a fictional
CD library (quaint in the post-napster internet age I know). Monitoring who
has which CD is done by logging the CDs in and out of a loans logbook that
contains the following information (with sample data):
| Date Borrowed |
Due Back |
Borrower |
Artist |
Title |
| 1st Dec |
8th Dec |
David Findlay |
Britney Spears
Pink
Christina Aguilera |
Britney
Cant Take Me Home
Stripped |
| 8th Dec |
15th Dec |
Iain Brown |
Darius
Christina Aguilera
Will Young |
Dive In
Stripped
From Now On |
Loans Log
Information is also stored about each borrower in a separate register:
| Borrower |
Address |
Telephone No. |
E-mail |
| David Findlay |
Davies Lodge
23 Driven St
Edinburgh
EH5 APE |
0131 555 5579 |
david@keithjbrown.co.uk |
| Iain Brown |
54 Home St,
Edinburgh,
EH23 8TF |
0131 555 5580 |
iain@keithjbrown.co.uk |
Borrower Log
Finally there is a ledger that stores information about the CDs:
| Artist |
Title |
Year |
Label |
No. of Tracks |
Copy |
| Britney Spears |
Britney |
2001 |
Jive |
14 |
1 |
| Pink |
Cant Take Me Home |
2002 |
LaFace |
15 |
3 |
| Christina Aguilera |
Stripped |
2002 |
RCA |
20 |
1 |
| Darius |
Dive In |
2002 |
Mercury |
13 |
2 |
| Will Young |
From Now On |
2002 |
RCA |
13 |
2 |
CDs Log
We can see that by keeping the information in different places there has
been some common sense organisation to avoid repeatedly entering the same
data. This is quite common, but doesn't really help up as we need to use the
normalisation rules to determine our structure. Thus to begin with we'll merge
the fields (attributes) into a one-table structure along with the same sample
data.
We have also created a column called 'Transaction No' which we use as a primary
key to uniquely identify each lending transaction. We do this as there is
no other simple method of identifying the row, as the obvious Composite Key
is pretty complex (date borrowed, name, artist, title ), and thus assigning
a unique 'Transaction No' is sensible at this time.
Note: As I mentioned above, a modeling tool like Entity
Relationship diagramming is good to use before normalisation. This would be
essential on larger database projects as it would be pretty impossible to
create one big flat database to begin normalising. It is also worth noting
that although we have included sample data, we are normalising the database
structure and not the data itself. It can be easier however to spot potential
problems when using example data.
So to begin the normalisation process, we use the Un-normalised data and
apply our first rules.
First Normal Form (1NF)
For a relation (table) to conform be 1NF, it must obey the following rules:
To ensure that all attributes (columns) are atomic - what
this means is that any single field should only have a value for ONE thing.
For example in a 'cars' database a field in a car may have a value 'Blue Ford
Focus'. This field is not atomic as it contains more than one piece of information
(colour, manufacturer and model). To demostrate the problem with this, try
to imagine constructing a query to select all Ford cars. This would be difficult
as the value 'Ford' is trapped in the middle of the field. We must therefore
split this column into three (colour, manufacturer, model).
There must be no repeating groups. This is perhaps most
easily understood as there cannot be repeating columns (that which contain
the same type of information). For example a cars field may contain a list
of Ford cars (Focus, Mondeo, Puma). As we know we must also make this field
atomic, but splitting the column into three (cars1, cars2, cars3) to hold
these values isn't an option this time as that would create three domains
that contain similar data. In practice this would make querying the databases
difficult as we would have to create a statement to match potential values
in 3 columns (or four, or five or a hundred if there was a list of a hundred).
Splitting the field so that each value was in a separate row would require
the other values in the rows to be repeated. The solution is to change the
structure of the database by extracting the repeating groups into a new relation.
1NF in practice
Looking at the Un-normalised data we can see
that there are non-atomic attributes in the following columns:
- Borrower
- Address
- The elements that make up the CD information (Artist, Title, Year, Label,
No_of_Tracks and Copy)
All of these attributes will have to be normalised so the domain contains
only atomic values. To begin, let's look at the Borrower attribute. Thus we should split this attribute in two. The following excerpt from the
table illustrates how this would be implemented:
| TransactionNo. |
Forename |
Surname |
Address |
TeleNo |
| 1 |
David |
Findlay |
Davies Lodge
23 Driven St
Edinburgh
EH5 APE |
0131 555 5579 |
| 2 |
Iain |
Brown |
54 Home St
Edinburgh
EH23 8TF
|
0131 555 5580 |
Similarly the address should be broken down into different attributes. Anyone
that has filled out an application form will be familiar with this structure.
- Address Line One
- Address Line Two (for the purposes of this workshop we'll stop at two
address lines - in reality you may need more)
- City
- Postcode
Once more we can see the effect that this has on an excerpt of the table:
| TransactionNo. |
Forename |
Surname |
Address_One |
Address_Two |
City |
Postcode |
| 1 |
David |
Findlay |
Davies Lodge |
23 Driven St |
Edinburgh |
EH5 APE |
| 2 |
Iain |
Brown |
54 Home St |
|
Edinburgh |
EH23 8TF |
The Repeating Group
If we have a look at the remaining non-atomic attributes we can see that
they are similar informational attributes about a CD. In order to make these
attributes atomic we cannot repeat the trick of creating additional columns
as each repeated column would hold values of a similar type. If we take the
artist field for example:
| TransactionNo. |
(Other Attributes) |
Artist_One |
Artist_Two |
Artist_Three |
| 1 |
... |
Britney Spears |
Pink |
Christina Aguilera |
| 2 |
... |
Darius |
Christina Aguilera |
Will Young |
If we split each original field into 3 we can see the value 'Christina Aguilera'
appears twice, once in Artist_Two and once in Artist_Three. If we were searching
to see who had borrowed CDs by Christina we'd have to create a query to search
ALL the artist fields. Also if someone borrowed a 4th CD as part of a transaction
we would have to create a fourth Artist column and so on. The other reason
that repeating groups are not allowed to be split into columns is that we
MUST know the number of columns required in order to create the database.
This is a trap that we could fall into by looking at only the data. We have
not actually changed the structure of the database and thus the problem of
repeating groups still exists. This can be seen by the fact that the Primary
Key is no longer unique. The correct solution is to remove the repeating groups
(Artist, Title, Year, Label, No_of_Tracks and Copy) into a new 'Borrowed
CDs' relation .
| TransactionNo |
Artist |
Title |
Year |
Label |
No_of_Tracks |
Copy |
| 1 |
Britney Spears |
Britney |
2001 |
Jive |
14 |
1 |
| 1 |
Pink |
Cant Take Me Home |
2002 |
LaFace |
15 |
3 |
| 1 |
Christina Aguilera |
Stripped |
2002 |
RCA |
20 |
1 |
| 2 |
Darius |
Dive In |
2002 |
Mercury |
13 |
2 |
| 2 |
Will Young |
From Now On |
2002 |
RCA |
13 |
2 |
| 2 |
Christina Aguilera |
Stripped |
2002 |
RCA |
20 |
1 |
Because we need to be able to join this data back with that left in the original
table we also need to include the TransactionNo in this Borrowed CDs
relation which will act (along with a qualifying value) as Key for this relation.
The qualifying value, allowing us to identify each CD in the library, combined
with the transaction number will be able to tell us the specific CD borrowed
per transaction. Thus we next need to define the qualifying value which will
be the least number of attributes required to uniquely identify the CD. This
could be (Artist, Title) if we were sure that this was a unique indentifer
for a CD, but we already can see that there are different copies of the same
CD. We could try adding Copy to the Key (Artist, Title, Copy), but in practice
having a single unique value for each CD is preferable. Thus we will also
create a cdID attribute in this relation and use that to represent the uniqueness
of a CD.
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo,cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
Second Normal Form (2NF)
The second normal form has one rule that must be followed:
All non-key attributes must be dependant ON THE WHOLE KEY and not
just one attribute of the key. This obviously only applies to Composite
Keys and means that attributes in the relation that only contain information
(and have no role in the structure of the database) must be functionally dependent
on all parts of the Key. Another way of thinking about this is that if there
are some attributes that are not dependent on all parts of the Key this means
that they are in the wrong relation.
For example: Imagine we have a cars_bought table where each row (record)
contains information about the car that a customer bought, which has a Primary
Key of (CustomerID, Model). Information about the car (such as engine size,
year etc) could be derived from just one part of the Key (Model) and thus
isn't dependent on the whole key. Whereas if we wanted to find out the colour
of the car the customer bought then we could search for the customer and this
would give us the colour - assuming he has only bought one car. BUT what if
the customer has ordered two cars, of different models and different colours.
In order to find the colour of a specific car we would need to know the CustomerID
AND the model number. Thus colour is dependent on the WHOLE Primary Key and
not just a part of it. What this tells us is that colour is in the correct
relation as it is specific to the customer and the model, whereas information
about the car, which is dependent upon the Model alone, should be in a different
table.
2NF in practice
When removing the partial dependencies from our database we can only look
at tables that have a Composite Key and not a sole attribute as the Primary
Key. As the loans Relation has a Primary Key of 'TransactionNo' this
table doesn't require to be changed in this Normal Form. The second relation
(Borrowed CDs) DOES have a Composite Key so we must examine the attributes
of this table to see if they are all dependent on the whole Key.
We can see that ALL the CD information attributes are dependent on the cdID
attribute rather than the whole Key. This is because they aren't really about
the CDs borrowed and more about just the CDs bought and thus belong on their
own relation - CDs Bought.
| cdID |
Artist |
Title |
Year |
Label |
No_of_Tracks |
Copy |
| 1 |
Britney Spears |
Britney |
2001 |
Jive |
14 |
1 |
| 2 |
Pink |
Cant Take Me Home |
2002 |
LaFace |
15 |
3 |
| 3 |
Christina Aguilera |
Stripped |
2002 |
RCA |
20 |
1 |
| 4 |
Darius |
Dive In |
2002 |
Mercury |
13 |
2 |
| 5 |
Will Young |
From Now On |
2002 |
RCA |
13 |
2 |
CDs Bought
And that means that we can remove all the rest of CD attributes from the
Borrowed CDs table.
| TransactionNo |
cdID |
| 1 |
1 |
| 1 |
2 |
| 1 |
3 |
| 2 |
4 |
| 2 |
5 |
| 2 |
3 |
Borrowed CDs
LOANS(TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS(TransactionNo,cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
CDS BOUGHT(cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
Third Normal Form (3NF)
To achieve Third Normal Form no attribute must be dependent on a
non-key attribute. This means that every informational attribute
must be DIRECTLY dependent on the Primary Key and not on another column. If
we again look at an imagined cars database, a customer table contains information
such as address, city, postcode and also a column called shipping cost. The
value of shipping cost changes in relation to where the car should be delivered,
and thus is not directly dependent on the customer, but the city. Thus we
would need to create another separate relation to hold the information about
cities and shipping costs.
3NF in practice
As is good practice we should examine every table in turn to see if the non-key
attributes are indeed directly dependent on the Primary Key.
If we look at the loans relation we can see that the columns AddressOne,
AddressTwo, City, Postcode, and E-mail are not dependent on the TransactionNo,
but rather a Composite Key of Forename and Surname. Thus they should be removed
from this table into a new relation Borrowers. As there is a possibility
that there could be more that one David Findlay or Iain Brown, this Composite
Key isn't very effective so once more we will replace it with a unique numeric
reference (BorrowerID). This will become the new Primary Key of Borrowers
and a foreign Key in the loans relation.
The Borrowed CDs relation only has two attributes, which together make up
the Primary Key and thus is not required to be normalised. The final table
CDs Bought can however be nornailised further. Year, Label and No
of Tracks are dependent on a Composite Key of Artist + Title. Copy is dependent
on the cdID i.e. that which relates a particular physical CD that has been
purchased, rather than album details such as artist or label. Thus we can
create another relation CD Releases with a releaseID that identifies
an album of which there may be millions of copies.
| releaseID |
Artist |
Title |
Year |
Label |
No_of_Tracks |
| 1 |
Britney Spears |
Britney |
2001 |
Jive |
14 |
| 2 |
Pink |
Cant Take Me Home |
2002 |
LaFace |
15 |
| 3 |
Christina Aguilera |
Stripped |
2002 |
RCA |
20 |
| 4 |
Darius |
Dive In |
2002 |
Mercury |
13 |
| 5 |
Will Young |
From Now On |
2002 |
RCA |
13 |
CD Releases Relation
We of course have to replace these dependent attributes in the CDs Bought
relation with the releaseID.
| cdID |
releaseID |
Copy |
| 1 |
1 |
1 |
| 2 |
2 |
3 |
| 3 |
3 |
1 |
| 4 |
4 |
2 |
| 5 |
5 |
2 |
CDs Bought Relation
To further illustrate the difference between the CDs and the releases consider
that we have bought two new CDs that will be available to borrow. One is new
to our database (Robbie Williams - Escapology) whereas the Christina Aguilera
- Stripped release has been so popular we decided to buy another copy.
Adding the Robbie album would mean adding data to two relations, CDs
Bought and CD Releases, like so.
| releaseID |
Artist |
Title |
Year |
Label |
No_of_Tracks |
| 6 |
Robbie Williams |
Escapology |
2002 |
Chrysalis
|
14 |
CD Releases Relation
| cdID |
releaseID |
Copy |
| 6 |
6 |
12/12/2002 |
CDs Bought Relation
Whereas we would only need to add information about the Christina Aguilera
album to the CDs Bought relation, as its release information has
already been entered into the CD Releases relation.
| cdID |
releaseID |
Copy |
| 7 |
3 |
12/12/2002 |
CDs Bought Relation
LOANS(TransactionNo, BorrowerID, Date_Borrowed, Due_Back)
BORROWER(BrrowerID, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail)
BORROWED CDS(TransactionNo,cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
CDS BOUGHT (cdID, releaseID, Copy)
CD RELEASES (releaseID, Artist, Title, Year, Label, No_of_Tracks)
Other Normal Forms
As mentioned earlier there are more Normal Forms that
we could use to normalise our database further. We are not going to do this,
because in practice the anomalies that they are designed to remove are rare.
We will however have a brief look at BCNF so that there is at least some understanding
of that norm should you come across it. If you are confused or struggling
with the above, it would perhaps be worth skipping past this section until
you are more confident about your understanding of the first three Normal
Forms.
Boyce-Codd Normal Form (BCNF)
This often referred to as a strong 3NF and states that each determinant
must be a Candidate Key. In a relation there may be more than just
the Primary Key from which we can derive the values of other attributes, i.e.
from another attribute or combination of attributes. These are known as Candidate
Keys and in 3NF it is possible to have only some of the attributes functionally
dependent on them. In BCNF all attributes must depend on all Candidate Keys.
Variations
While the rules of Normalisation stay the same the practical implementation
of them differs. The method I've used above is the one that I think is easiest
to understand for the beginner, but there are a couple of others worth mentioning
in case you see them discussed elsewhere. It doesn't really matter which you
use as the derived data structure is the same.
Composite Key in the First Relation
In the above example we had 1NF of :
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS(TransactionNo,cdID, Artist, Title, Year, Label,
No_of_Tracks, Copy)
We are reminded that there was a Composite Key in the second relation (Borrowed
CDs). We could have just as easily created the Composite Key in the first
relation, where we remove (Artist, Title, Year, Label, No_of_Tracks, Copy)
to a new relation called CDs Bought, create the Primary Key cdID for that
relation and create a new attribute in the Loans relation that is a Foreign
Key and the qualifying value for the TransactionNo. This would have duplicated
the data in the non-repeating groups, which doesn't matter as we're normalising
the database structure not the actual data, and that repetition is removed
by the later Normal Forms. This would leave a 1NF of:
LOANS (cdID,TransactionNo, Forename, Surname, AddressOne,
AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT(cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
This means that in to achieve 2NF we no longer look at the second relation,
but instead the first relation to determine if the attributes are functionally
dependent on the whole Composite Key. All the attributes are functionally
dependent on the TransactionNo. not the whole Key, so we remove them to a
new relation called Transaction Info. 2NF is thus expressed:
LOANS (cdID,TransactionNo)
TRANSACTION INFO(TransactionNo, Forename, Surname, AddressOne, AddressTwo,
City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT(cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
We can see that this is the same data structure as the original method at
2NF, but we've just chosen different names due to the order in which a relation
was first established.
1NF with only one relation
This is another method that is quite popular. If we recall the rules for
1NF they are:
- To ensure that all attributes (columns) contain only atomic values
- There must be no repeating groups
This says nothing about not repeating data or having to create new relations
to resolve the problems. The problem we had with this is that there was no longer a Primary Key. We
can however make a Composite Key by using the same cdID as we did above to
represent the uniqueness, except this time in just one relation. So 1NF would
be:
LOANS(cdID , TransactionNo, Forename, Surname, AddressOne,
AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back, Artist,
Title, Year, Label, No_of_Tracks, Copy)
The consequence of this is that to achieve 2NF would be a longer stage. Initially
we can see (similar to previous examples) that the attributes (Artist, Title,
Year, Label, No_of_Tracks, Copy) are dependant on only the cdID part of the
key and thus can be removed to a new relation:
LOANS (cdID , TransactionNo, Forename, Surname, AddressOne,
AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
This creates a structure identical to that of the 'Composite Key in the First
Relation' example in 1NF, where the loan information attributes are only dependent
on the Transaction No, and thus we have to split the relation again:
LOANS (cdID , TransactionNo)
TRANSACTION INFO (TransactionNo, Forename, Surname, AddressOne, AddressTwo,
City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
The database structure is now 2NF.
Without A Joining Table and cdID
This produces the only truly normalised database schema in this workshop
- and I'll explain why in a moment. But first what is a 'Joining Table'? This
is a table that is an entirely Composite Key of attributes which are Foreign
Keys that contain no informational data and thus its function is to join the
Primary Keys of other tables together.
For example: in each of the database structures we have produced so far there
has been a relation with two attributes (cdID , TransactionNo)
so that if we look up a Transaction No we can retrieve the cds by the corresponding
cdIDs. In strict normalisation this table should not exist in this form, it
has come into existence because we created an artificial key (cdID) to represent
the Composite Key (Artist, Title, Copy). This is something that is done quite
frequently in practice (why I've done so above), but is theoretically wrong.
To explain let's normalise the structure again.
Because we cannot replace the Composite Key elements with cdID in 1NF, we
are left with a structure like so:
LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo,Artist,Title,Copy,
Year, Label, No_of_Tracks)
As we can see, the Composite Key for the Borrowed CDs relation is now a lot
more complicated. When we achieve 2NF we would need to create a structure
like so:
LOANS( TransactionNo, Forename, Surname, AddressOne, AddressTwo, City,
Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo,Artist,Title,Copy)
CDS BOUGHT (Artist,Title,Copy, Year, Label, No_of_Tracks)
This has meant that any join we perform using SQL becomes much more complicated,
as we are having to match more fields. Consider the following sample MySQL
Inner Join to retrieve all the information about a bought cd.
With cdID:
SELECT *
FROM cds_bought, cds_released
WHERE (cds_bought.cdID = cds_released.cdID);
Without cdID:
SELECT *
FROM cds_bought, cds_released
WHERE (cds_bought.Artist = cds_released.Artist)
AND (cds_bought.Title = cds_released.Title)
AND (cds_bought.Copy = cds_released.Copy);
As we observe, joining tables on three fields is more complicated than on
one field.
NOTE: The names of tables and columns are slightly different
as the physical implementation of any database is best done without spaces
in the names, whereas in the theoretical design stage spaces aid readability.
Thus to summarise, we have replaced the 'true' Composite Key with an artificial
Primary Key, in order to make things easier in the physical implementation,
BUT THERE IS NO RULE IN NORMALISATION THAT SAYS YOU SHOULD.
Conclusion
This has hopefully been a pretty straightforward guide to a complicated (for
the novice anyway) subject. I first got into this whole web design game from
a media standpoint rather than a technical computer standpoint. When I started
using databases on the web, I had to become familiar with concepts that are
second nature to those from a formal technical background, which was tricky
at first due to the target audience of the literature. Thus I hope that this
VW can be of some use to those coming to normalisation from other backgrounds
as well as those in IT seeking a little clarification.
|