[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x
?
Solved

Advice on select statement

Posted on 2004-10-27
21
Medium Priority
?
335 Views
Last Modified: 2012-08-13
I have following table [ZipCodes]. Last column with values of 1 and 2 is [ZipCodeID Int Identity(1,1)] which is [Unique Clustered Index]
                                                                                                           
07009       S      Cedar Grove      D      New Jersey      NJ      973      40.854334999999999      -74.231713999999997    1
07009       S      Overbrook         N      New Jersey      NJ      973      40.854334999999999      -74.231713999999997    2
......
......
......
(70,500 Records Total)

then I have table [Jobs], when somebody places a new job they fill in the form with [Address1, Address2, City, State, ZipCode], but table [Jobs] doesn't really have all this columns, it only has [Address1, Address2] and  [ZipCodeID] which is [Nonclustered Index] Foreign Key referenced to [dbo.ZipCodes(ZipCodeID)], in order to find jobs I use following select statement:

SELECT j.JobID     --j.JobID is Unique Clustered Index
FROM
      CONTAINSTABLE(Jobs, JobDetails, 'Keyword') as j
      JOIN ZipCodes z ON j.ZipCodeID=z.ZipCodeID
      WHERE
            j.Datein between GETDATE() and '10/17/2004'
            and z.ZipCode='07009' and z.State='NJ' and z.City='Cedar Grove'


now [j.Datein] is Unique Clustered Index and [z.ZipCode, z.State and z.City] are Nonclustered Indexes, so what I'm trying to do in in first place is that, I want to save some space because my hosting company allows me to have 250MB space for database and my empty database right not has reached up to 25MB and I would like to ask you 2 questions regarding my select statement

1) when I join tables like this [JOIN ZipCodes z ON j.ZipCodeID=z.ZipCodeID] and then match [z.ZipCode='07009' and z.State='NJ' and z.City='Cedar Grove'] is this gonna scan whole [ZipCodess] table for matching values in [z.ZipCode, z.State, z.City] or is it gonna go straight to the referenced record?

2) and after all is it worth saving space? Otherwize I have to store [City State and ZipCode] in five seperate tables, this way I have to store only [ZipCodeID] in every table which is Foreign Key to [ZipCodes]

sorry if all this sounds confusing, but that's the problem I have right now...  :)
0
Comment
Question by:davidlars99
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 8
  • 3
21 Comments
 
LVL 8

Accepted Solution

by:
sigmacon earned 1200 total points
ID: 12430517
1) If you have indexes on z.ZipCode, z.State and z.City, SQL Server is going to attempt an index seek ('going directly there' - which is good) - not a table scan. If you have a separate index for each, and you are specifying all, then the query plan may deteriorate, because it will have to join the results from each seek - if you are as specific as in your example above, that's ok because it will only have to join VERY few rows. It's always a good idea to match indexes to span the same columns that are going to show up in the where clause. That most often guarantees good results. So, if space allows it, you would want one index covering all the columns in the WHERE clause. I am assuming that ZipCodeID is the clustered index, so you are good there. Of course, space constraints require you to be picky about the indexes, but even if you specify two out of three columns covered by and index, the resulting index scan may be cheap if the WHERE clause had high specificity - much cheaper than a table scan anyway.

Before you change anything, test performance and look at query plans the way your indicies are in right now. If performance is not satisfactory, then reorganize indexes.

2) In your model, City and State are repeated many times for each ZIP Code or one ZIP Code may be repeated for more than one town. Your table stores an additional piece of data: The association between a Zip Code, a Town, and a State - and it is storing this association in a denormalized form. In the long term, when the number of Jobs approaches the number of rows in the ZipCodes table and your selection of ZipCodeIDs is wide, then you MAY save some space using this model. If you really need to safe space, you may want to normalize further, but your data retrieval operation may suffer:

a) store zip codes as int instead of char(5) -> save three(?) bytes -- lets not do that unless you need to deal with 9 digit strings, then you save 8 bytes * 70000 = 560,000 bytes ~ 0.5 MB still not worth it.
b) store all cities in their own table and give them a number - adding a new city every time a user enters one you don't know, then creating a mapping table between zip codes and cities ... mmh - if we ignore the increased maintenance, this might save a few MB, but you can test that quickly locally if you decompose the table as described.
c) storing states seperately ... why not, this is more a matter of taste - I've seen too many flame wars about this to make a statement. If you never care about the long state name, you might not have to do this. If your app's supposed to be international, then you want to use ISO 3166-1 and -2 codes properly put in respective tables.

If you want to maintain the association, you may not save much space with the decomposition because of the mapping tables between ZIP Codes, States and Cities. You also will not be able to use an index covering multiple columns and thus most of your queries will involve joins. Doing it the usual way, storing ZIP, City and State in some address record will cost less space until you have 70,000 jobs. And although I am a big fan of total normalization, in this case I would say: Leave everything the way it is. Your only problem is going to be when people try to enter ZIP, City and State combinations that you don't have in your table. If you are going to allow all those (including misspellings) then you are not going to save space. This is not a data base issue, this is an application issue where a too-rigit data model might bite you later.

I hope some other experts can share their opinion on this.
0
 
LVL 32

Assisted Solution

by:Brendt Hess
Brendt Hess earned 800 total points
ID: 12436601
You also should look at how the data matches the real world.

If you have a zipcode, you can automatically get the state.  You need not join on the state, since any given zip code is wholly within one state.
Many people claim that there is only once City per zip code.  However, your sample data removes that as a possibility.  A zipcode can be within two or more city jurisdictions.

To speed things up, you should seek only on Zipcode + City, and have an index on those two fields.  Drop the state from the lookup.

Also, you may want a table for looking up the state separately.  A table structured as:

    State char(2) not null
    StateName Varchar(20) Not Null
    MinimumZipCode char(5) not null
    MaximumZipCode char(5) not null

With an index (MinimumZipCode, MaximumZipCode), and an index (State) [note - possible clustered key]

To find the state for a given zip code,

SELECT State FROM StateZipRange WHERE MinimumZipCode <= '07009' AND MaximumZipCode >='07009'

(32 bytes max) * (51 states + DC) is less than 2K data space total.  However, keeping the State and State Name in the individual Zip code records is (conservatively) an average of 7 characters for the state name + 2 for the abbreviation + a bit for overhead.  10 bytes * ... 100000 records?  200000 records?  is a lot more data space taken up.
0
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1200 total points
ID: 12436855
I agree with bhess1 suggestions above, but I would put the clustered key for the other lookup table on MinimumZipCode and MaximumZipCode, because clustered indicies are perfect for range lookups. Of course all this will make your database even bigger.

Considering todays storage costs, it seems anachronistic of your host to limit DB size to such a small amount. Maybe you can find a better deal somewhere?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 13

Author Comment

by:davidlars99
ID: 12437513
I removed some records from ZipCodes database and I have 29564 rows (total before was 70500) of unique ZipCodes, I removed only those which are not used by US Postal Service and guess what, that's all I will ever need to use, so I have clean database now and I think my select statement's goal looks brighter. I can put Unique Clustered index on ZipCode itself...  :)

BTW, another question: what is the diference between how I create my index


>> ZipCode Char(5) Not null Primary Key Clustered
AND
>> Create Unique Clustered Index on dbo.ZipCodes(ZipCode)


especialy in my case? what if I do it this way, I mean not specifying the Primary key:


>> ZipCode Char(5) Not null
and then create index, it still going to be unique anyway, or is it all about [Primary Key]..?
>> Create Unique Clustered Index on dbo.ZipCodes(ZipCode)




0
 
LVL 13

Author Comment

by:davidlars99
ID: 12437589
have you ever worked with the database like that? I mean US zipcode database, it is very strange that two same zipcodes have two diferent Cities THAT IS NOT RIGHT, businesswize of course   :)
0
 
LVL 32

Assisted Solution

by:Brendt Hess
Brendt Hess earned 800 total points
ID: 12437688
The clustered index on Zip Code should not be UNIQUE -- as noted in your data, there can be more than one city per zip code.

davidlars99 -- that is a very common misconception that is not true.  Many zip codes contain more than one named jurisdiction, especially in rural areas where a zip might take in 500 square miles containing a half dozen small (30 to 300 person) communities.  In fact, there is a Zip in Portland, OR right now that is partially within the city of Gresham as well.  That's how I know.
0
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1200 total points
ID: 12437716
Of course, it's perfectly valid for two separate towns to share the same ZIP code. I think there are over 100,000 cities and towns incorporated in the US. If each of them would have it's own zip code, than big office buildings in downtown LA wouldn't be able to get the two ZIP codes they need ;-) ( I am exaggerating here! )
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12437761
0
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1200 total points
ID: 12437831
With the exaggeration I meant the TWO! I know of many buildings that have an entire ZIP code of their own. I don't know of any with more than one, but it's plausible, I guess.
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12437864
I can assure you that all zipcodes are unique across the country and now in my database as well  :)
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12437879
I guess US postal service cares about their database as much as we do  :)
0
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1200 total points
ID: 12437902
From the product description:

>> With over 70,000 records – just under 43,000 ZIP Codes - you’ll be glad you have ...

If you have 43,000 ZIP codes, but 70,000 records, then mathematically speaking you are going to have some ZIP Codes in there more than once ... I take it your kidding?
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12438012
S-  Standard ZIP Code
P - PO Box Only
U - Used for Organizations such as large businesses or buildings
M - Military, often used for military vessels, but mostly for military bases
 

D - Default City Name - recommended by the USPS
A - Alternate City Name, also used by the USPS, although not the primary name of that ZIP Code.
N - Not-recommended City Name, identifies the name of the ZIP Code - often used by locals as a place name. Not used, nor accepted, by the USPS.
 


these are all type of zipcodes and cityes that US has and I guess to setisfy my business needs S and U are more then enough. As for cities D is doing just fine, all that makes 29,564 of unique total rows in my database
0
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1200 total points
ID: 12438053
ok, now it makes sense how you reduced the number of rows, but what if you get a client that has a different ZIP/City combination then the those you kept? What if someone from Gresham with that ZIP code that bhess1 mentioned is a client - and you only have that ZIP code for Portland?
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12438100
>> if someone from Gresham with that ZIP code that bhess1 mentioned is a cli...
in this case of mistake I will display all possible cities coresponding that particular zipcode or whatever they enter, it is only matter of little SQL select statement...  :)


and my select statement should do well without join

declare @ZipCode Char(5)
select @ZipCode=ZipCode from ZipCodes where City='Cedar Grove' and State='NJ'

SELECT JobID
FROM
     CONTAINSTABLE(Jobs, JobDetails, 'Keyword')
     WHERE
          Datein between GETDATE() and '10/17/2004'  
          and ZipCode=@ZipCode


in this case I will make Index on (City, State) in my ZipCodes table
0
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1200 total points
ID: 12438155
Well, I guess your question is answered then. Good luck.
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12438165
>> in this case of mistake I will display all possible cities coresponding that particular zipcode or whatever they enter, it is only matter of little SQL select statement...  :)


I mean, that search will not be performed, instead there will be links like:


Please choose one or more specific area:
==========================

Cedar Grove, NJ
Cedar Grove, CA
Cedar Grove, WA
...
...
...

and then they are more then welcome to search :)
there are 9 cities most with the same name in US so I don't think it's gonna hurt performance and I don't either that 100% of the visitors are going to make this mistakes and even they do that should not produse any noticableoverhead
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12439276
0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 12439396
Personal opinion: The only substantive difference is that Primary Key automatically disallows NULL values in the underlying field(s), and that there can be only one Primary Key.  If the underlying data field(s) are defined as Not Null, then there is no substantive difference.
0
 
LVL 13

Author Comment

by:davidlars99
ID: 12439591
as long as there is no diference performancewize I don't care...  :)

thank you very much guys
0
 
LVL 8

Expert Comment

by:sigmacon
ID: 12439658
You will not be able to create a foreign key unless it points to a primary key. This is one of those dualities that gets easily missed: SQL Server treats primary keys both as constraints from a conceptual model point a view and as indexes from a physical model point of view. A primary key (if nothing else is specified) is by default a unique clustered index, and a unique constraint. But if you don't call it a primary key, you will not be able to create a foreign key to it.
0

Featured Post

Tech or Treat!

Submit an article about your scariest tech experience—and the solution—and you’ll be automatically entered to win one of 4 fantastic tech gadgets.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have a large data set and a SSIS package. How can I load this file in multi threading?
Recently we ran in to an issue while running some SQL jobs where we were trying to process the cubes.  We got an error saying failure stating 'NT SERVICE\SQLSERVERAGENT does not have access to Analysis Services. So this is a way to automate that wit…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

650 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question