Solved

Advice on select statement

Posted on 2004-10-27
323 Views
Last Modified: 2012-08-13
I have following table [ZipCodes]. Last column with values of 1 and 2 is [ZipCodeID Int Identity(1,1)] which is [Unique Clustered Index]
                                                                                                           
07009       S      Cedar Grove      D      New Jersey      NJ      973      40.854334999999999      -74.231713999999997    1
07009       S      Overbrook         N      New Jersey      NJ      973      40.854334999999999      -74.231713999999997    2
......
......
......
(70,500 Records Total)

then I have table [Jobs], when somebody places a new job they fill in the form with [Address1, Address2, City, State, ZipCode], but table [Jobs] doesn't really have all this columns, it only has [Address1, Address2] and  [ZipCodeID] which is [Nonclustered Index] Foreign Key referenced to [dbo.ZipCodes(ZipCodeID)], in order to find jobs I use following select statement:

SELECT j.JobID     --j.JobID is Unique Clustered Index
FROM
      CONTAINSTABLE(Jobs, JobDetails, 'Keyword') as j
      JOIN ZipCodes z ON j.ZipCodeID=z.ZipCodeID
      WHERE
            j.Datein between GETDATE() and '10/17/2004'
            and z.ZipCode='07009' and z.State='NJ' and z.City='Cedar Grove'


now [j.Datein] is Unique Clustered Index and [z.ZipCode, z.State and z.City] are Nonclustered Indexes, so what I'm trying to do in in first place is that, I want to save some space because my hosting company allows me to have 250MB space for database and my empty database right not has reached up to 25MB and I would like to ask you 2 questions regarding my select statement

1) when I join tables like this [JOIN ZipCodes z ON j.ZipCodeID=z.ZipCodeID] and then match [z.ZipCode='07009' and z.State='NJ' and z.City='Cedar Grove'] is this gonna scan whole [ZipCodess] table for matching values in [z.ZipCode, z.State, z.City] or is it gonna go straight to the referenced record?

2) and after all is it worth saving space? Otherwize I have to store [City State and ZipCode] in five seperate tables, this way I have to store only [ZipCodeID] in every table which is Foreign Key to [ZipCodes]

sorry if all this sounds confusing, but that's the problem I have right now...  :)
0
Question by:davidlars99
    21 Comments
     
    LVL 8

    Accepted Solution

    by:
    1) If you have indexes on z.ZipCode, z.State and z.City, SQL Server is going to attempt an index seek ('going directly there' - which is good) - not a table scan. If you have a separate index for each, and you are specifying all, then the query plan may deteriorate, because it will have to join the results from each seek - if you are as specific as in your example above, that's ok because it will only have to join VERY few rows. It's always a good idea to match indexes to span the same columns that are going to show up in the where clause. That most often guarantees good results. So, if space allows it, you would want one index covering all the columns in the WHERE clause. I am assuming that ZipCodeID is the clustered index, so you are good there. Of course, space constraints require you to be picky about the indexes, but even if you specify two out of three columns covered by and index, the resulting index scan may be cheap if the WHERE clause had high specificity - much cheaper than a table scan anyway.

    Before you change anything, test performance and look at query plans the way your indicies are in right now. If performance is not satisfactory, then reorganize indexes.

    2) In your model, City and State are repeated many times for each ZIP Code or one ZIP Code may be repeated for more than one town. Your table stores an additional piece of data: The association between a Zip Code, a Town, and a State - and it is storing this association in a denormalized form. In the long term, when the number of Jobs approaches the number of rows in the ZipCodes table and your selection of ZipCodeIDs is wide, then you MAY save some space using this model. If you really need to safe space, you may want to normalize further, but your data retrieval operation may suffer:

    a) store zip codes as int instead of char(5) -> save three(?) bytes -- lets not do that unless you need to deal with 9 digit strings, then you save 8 bytes * 70000 = 560,000 bytes ~ 0.5 MB still not worth it.
    b) store all cities in their own table and give them a number - adding a new city every time a user enters one you don't know, then creating a mapping table between zip codes and cities ... mmh - if we ignore the increased maintenance, this might save a few MB, but you can test that quickly locally if you decompose the table as described.
    c) storing states seperately ... why not, this is more a matter of taste - I've seen too many flame wars about this to make a statement. If you never care about the long state name, you might not have to do this. If your app's supposed to be international, then you want to use ISO 3166-1 and -2 codes properly put in respective tables.

    If you want to maintain the association, you may not save much space with the decomposition because of the mapping tables between ZIP Codes, States and Cities. You also will not be able to use an index covering multiple columns and thus most of your queries will involve joins. Doing it the usual way, storing ZIP, City and State in some address record will cost less space until you have 70,000 jobs. And although I am a big fan of total normalization, in this case I would say: Leave everything the way it is. Your only problem is going to be when people try to enter ZIP, City and State combinations that you don't have in your table. If you are going to allow all those (including misspellings) then you are not going to save space. This is not a data base issue, this is an application issue where a too-rigit data model might bite you later.

    I hope some other experts can share their opinion on this.
    0
     
    LVL 32

    Assisted Solution

    by:bhess1
    You also should look at how the data matches the real world.

    If you have a zipcode, you can automatically get the state.  You need not join on the state, since any given zip code is wholly within one state.
    Many people claim that there is only once City per zip code.  However, your sample data removes that as a possibility.  A zipcode can be within two or more city jurisdictions.

    To speed things up, you should seek only on Zipcode + City, and have an index on those two fields.  Drop the state from the lookup.

    Also, you may want a table for looking up the state separately.  A table structured as:

        State char(2) not null
        StateName Varchar(20) Not Null
        MinimumZipCode char(5) not null
        MaximumZipCode char(5) not null

    With an index (MinimumZipCode, MaximumZipCode), and an index (State) [note - possible clustered key]

    To find the state for a given zip code,

    SELECT State FROM StateZipRange WHERE MinimumZipCode <= '07009' AND MaximumZipCode >='07009'

    (32 bytes max) * (51 states + DC) is less than 2K data space total.  However, keeping the State and State Name in the individual Zip code records is (conservatively) an average of 7 characters for the state name + 2 for the abbreviation + a bit for overhead.  10 bytes * ... 100000 records?  200000 records?  is a lot more data space taken up.
    0
     
    LVL 8

    Assisted Solution

    by:sigmacon
    I agree with bhess1 suggestions above, but I would put the clustered key for the other lookup table on MinimumZipCode and MaximumZipCode, because clustered indicies are perfect for range lookups. Of course all this will make your database even bigger.

    Considering todays storage costs, it seems anachronistic of your host to limit DB size to such a small amount. Maybe you can find a better deal somewhere?
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    I removed some records from ZipCodes database and I have 29564 rows (total before was 70500) of unique ZipCodes, I removed only those which are not used by US Postal Service and guess what, that's all I will ever need to use, so I have clean database now and I think my select statement's goal looks brighter. I can put Unique Clustered index on ZipCode itself...  :)

    BTW, another question: what is the diference between how I create my index


    >> ZipCode Char(5) Not null Primary Key Clustered
    AND
    >> Create Unique Clustered Index on dbo.ZipCodes(ZipCode)


    especialy in my case? what if I do it this way, I mean not specifying the Primary key:


    >> ZipCode Char(5) Not null
    and then create index, it still going to be unique anyway, or is it all about [Primary Key]..?
    >> Create Unique Clustered Index on dbo.ZipCodes(ZipCode)




    0
     
    LVL 13

    Author Comment

    by:davidlars99
    have you ever worked with the database like that? I mean US zipcode database, it is very strange that two same zipcodes have two diferent Cities THAT IS NOT RIGHT, businesswize of course   :)
    0
     
    LVL 32

    Assisted Solution

    by:bhess1
    The clustered index on Zip Code should not be UNIQUE -- as noted in your data, there can be more than one city per zip code.

    davidlars99 -- that is a very common misconception that is not true.  Many zip codes contain more than one named jurisdiction, especially in rural areas where a zip might take in 500 square miles containing a half dozen small (30 to 300 person) communities.  In fact, there is a Zip in Portland, OR right now that is partially within the city of Gresham as well.  That's how I know.
    0
     
    LVL 8

    Assisted Solution

    by:sigmacon
    Of course, it's perfectly valid for two separate towns to share the same ZIP code. I think there are over 100,000 cities and towns incorporated in the US. If each of them would have it's own zip code, than big office buildings in downtown LA wouldn't be able to get the two ZIP codes they need ;-) ( I am exaggerating here! )
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    0
     
    LVL 8

    Assisted Solution

    by:sigmacon
    With the exaggeration I meant the TWO! I know of many buildings that have an entire ZIP code of their own. I don't know of any with more than one, but it's plausible, I guess.
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    I can assure you that all zipcodes are unique across the country and now in my database as well  :)
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    I guess US postal service cares about their database as much as we do  :)
    0
     
    LVL 8

    Assisted Solution

    by:sigmacon
    From the product description:

    >> With over 70,000 records – just under 43,000 ZIP Codes - you’ll be glad you have ...

    If you have 43,000 ZIP codes, but 70,000 records, then mathematically speaking you are going to have some ZIP Codes in there more than once ... I take it your kidding?
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    S-  Standard ZIP Code
    P - PO Box Only
    U - Used for Organizations such as large businesses or buildings
    M - Military, often used for military vessels, but mostly for military bases
     

    D - Default City Name - recommended by the USPS
    A - Alternate City Name, also used by the USPS, although not the primary name of that ZIP Code.
    N - Not-recommended City Name, identifies the name of the ZIP Code - often used by locals as a place name. Not used, nor accepted, by the USPS.
     


    these are all type of zipcodes and cityes that US has and I guess to setisfy my business needs S and U are more then enough. As for cities D is doing just fine, all that makes 29,564 of unique total rows in my database
    0
     
    LVL 8

    Assisted Solution

    by:sigmacon
    ok, now it makes sense how you reduced the number of rows, but what if you get a client that has a different ZIP/City combination then the those you kept? What if someone from Gresham with that ZIP code that bhess1 mentioned is a client - and you only have that ZIP code for Portland?
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    >> if someone from Gresham with that ZIP code that bhess1 mentioned is a cli...
    in this case of mistake I will display all possible cities coresponding that particular zipcode or whatever they enter, it is only matter of little SQL select statement...  :)


    and my select statement should do well without join

    declare @ZipCode Char(5)
    select @ZipCode=ZipCode from ZipCodes where City='Cedar Grove' and State='NJ'

    SELECT JobID
    FROM
         CONTAINSTABLE(Jobs, JobDetails, 'Keyword')
         WHERE
              Datein between GETDATE() and '10/17/2004'  
              and ZipCode=@ZipCode


    in this case I will make Index on (City, State) in my ZipCodes table
    0
     
    LVL 8

    Assisted Solution

    by:sigmacon
    Well, I guess your question is answered then. Good luck.
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    >> in this case of mistake I will display all possible cities coresponding that particular zipcode or whatever they enter, it is only matter of little SQL select statement...  :)


    I mean, that search will not be performed, instead there will be links like:


    Please choose one or more specific area:
    ==========================

    Cedar Grove, NJ
    Cedar Grove, CA
    Cedar Grove, WA
    ...
    ...
    ...

    and then they are more then welcome to search :)
    there are 9 cities most with the same name in US so I don't think it's gonna hurt performance and I don't either that 100% of the visitors are going to make this mistakes and even they do that should not produse any noticableoverhead
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    0
     
    LVL 32

    Expert Comment

    by:bhess1
    Personal opinion: The only substantive difference is that Primary Key automatically disallows NULL values in the underlying field(s), and that there can be only one Primary Key.  If the underlying data field(s) are defined as Not Null, then there is no substantive difference.
    0
     
    LVL 13

    Author Comment

    by:davidlars99
    as long as there is no diference performancewize I don't care...  :)

    thank you very much guys
    0
     
    LVL 8

    Expert Comment

    by:sigmacon
    You will not be able to create a foreign key unless it points to a primary key. This is one of those dualities that gets easily missed: SQL Server treats primary keys both as constraints from a conceptual model point a view and as indexes from a physical model point of view. A primary key (if nothing else is specified) is by default a unique clustered index, and a unique constraint. But if you don't call it a primary key, you will not be able to create a foreign key to it.
    0

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
    International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
    This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
    Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed

    857 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now