SELECT DISTINCT verses additional table. Database design

I have a table called orders which will be very large. Two of many columns on that table are called phone and address. There will have duplicate values for these two columns. While the orders table will be huge, the amount of results it returns back from queries will never be more than 5. That's just the nature of our software design.

On our form is an address input text field with an associated datalist. This field uses ajax to query the database for address values that match the phone number entered by the user, and populates the datalist with the values.

The table would normally return duplicate values, which we do not want. So in terms of database performance, the amount of transactions, amount of data being stored, etc etc.... is it better to:

A) Use a SELECT DISTINCT on the large table?
B) Create another table called addresses. After the initial record is added to the orders table, the phone number and address values also get written to the addresses table, but only if the value for BOTH fields does not already exist.

So the question is in terms of what is the best database practice for best performance to handle a situation like this? SELECT DISTINCT from the orders table, or SELECT with no DISTINCT clause needed on the addresses table? Is the duplication of data and writes worth it?
LVL 9
BobHavertyComhAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Chris StanyonWebDevCommented:
A fundamental aspect of database design is called Normalisation. This is a process of designing your database to reduce data redundancy.

In your case, you seem to have a lot of data redundancy. A good design here would be to have an address table storing the phone number and address, along with an iD (the primary key). Your orders tables then has a foreign key (addressID) which links to the address table.

This will give you a one-to-many relationship - one address can have many orders.

Obviously, you'll need to consider your business rules as to whether this is the right 'model' for you.
0
BobHavertyComhAuthor Commented:
Hi Chris, thanks for your answer. A phone number and address cannot exist unless it was added as part of an order. So what is the difference between what you are suggesting and just putting the phone number and address values on the orders table itself? Why a second table? The only possible value I saw with option B was that at least the query doesn't have to check duplicate values, because an address and phone number that already exist on the Addresses table does not get written to the Addresses table, whereas the values MUST be duplicated on the orders table, because they are part of a unique event, and are therefore useful. So it seems like a contest between possible slower queries using SELECT DISTINCT on one table, or increasing write transactions and redundant info by creating an additional table that will have a smaller amount of records. But in the model you are suggesting, since every address has to have an ID that corresponds to the orders table, the Address table will have as many rows as the orders table.
0
Scott PletcherSenior DBACommented:
B) is better in your case.

I expect the same customers to order repeatedly.  A separate addresses table means that the ajax query does not have to read through thousands (or even millions?) of duplicate addresses to provide a DISTINCT list.  It also saves you space overall in the orders table by not storing the same address thousands, or millions, of times.
0
Chris StanyonWebDevCommented:
OK. Like I said, the way you normalise your database will depend largely on your business model.

If a single address/telephone number combination can be saved with several orders, then you will get data duplication. By normalising the data, you store the telephone number and address once, and then link that record to as many order records as you need. This is fundamental to database normalisation. Take a quick look at this data example:

// Your model
Order ID | Order Date | Address | Telephone
1 | 10 Jan 2015 | 123 High Street | 01234 321654
2 | 11 Jan 2015 | 123 High Street | 01234 321654
3 | 12 Jan 2015 | 123 High Street | 01234 321654
4 | 13 Jan 2015 | 123 High Street | 01234 321654
5 | 14 Jan 2015 | 567 Main Street | 555-12123
6 | 15 Jan 2015 | 567 Main Street | 555-12123

You can see here that you have a lot of duplicated data

Now look at a normalised model:

//Addresses
Address ID | Address | Telephone
1 | 123 High Street | 01234 321654
2 | 567 Main Street | 555-12123

//Orders
Order ID | Order Date | Address ID
1 | 10 Jan 2015 | 1
2 | 11 Jan 2015 | 1
3 | 12 Jan 2015 | 1
4 | 13 Jan 2015 | 1
5 | 14 Jan 2015 | 2
6 | 15 Jan 2015 | 2

No data duplication!

In your code, you would then need to enforce your business rules. When a user enters an address, run a SELECT query against the address table to see if the address already exists. If it does, get the Address ID to insert into the Order Table. If the address doesn't exist, then create a new address record, get the new record's Address ID and use that to insert into the Orders table.

When you want to retrieve the data, you will just need to use a simple JOIN query.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
BobHavertyComhAuthor Commented:
Thanks guys. I think Chris's solution was more specific, but both were correct, as long as duplication of information is avoided, and I think it is. My initial idea for B involved a limited amount of duplication, and these solutions really have none, just links. It's just an extra step of checking if it's already there or not and grabbing it if it is, and adding it if it is not.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Query Syntax

From novice to tech pro — start learning today.