Link to home
Start Free TrialLog in
Avatar of gkhotline
gkhotline

asked on

Access 2007, complex query

I have a table that consists of 3 columns: state, rate and zip. Each record in this table is unique. Every state has several zip codes and each one of those has a rate assigned to it. The table has about 40,000 records and I need to get it to a much smaller number. I need to combine some of the records and use zip ranges that have the same rate. Example:
State    Zip       Rate
AL       02103   .07
AL       02550   .07
AL       05982   .07

I need to make it look like this:
State    Zip_from     Zip_to      Rate
AL         02103        05982       .07

That would've been easy as I could do group by state and group by zip query and then add 2 columns Min(zip) and Max(zip), BUT there are occasions when we could find in the same state there is another record with the zip code that falls into the same range, but with different rate: to add to example above there could be a record like this:

State    Zip       Rate
AL       03276   .08

It falls within this range:

State    Zip_from     Zip_to      Rate
AL         02103        05982       .07

but it has a different rate. So, these ranges have to be somehow evaluated to see if there  are other records with different zip codes and rates that would fall within the range. If so, the range would have to be stopped and then restarted. So, following the example above we would have 2 ranges instead of 1:

this range:
State    Zip_from     Zip_to      Rate
AL         02103        05982       .07
will become 2 ranges:
State    Zip_from     Zip_to      Rate
AL         02103        02550       .07
AL         05982        05982       .07

That way this record:
State    Zip       Rate
AL       03276   .08
will be in it's own range.

I hope I explained it well enough.
Thanks
Avatar of Dale Fye
Dale Fye
Flag of United States of America image

Why do you feel the need to make this transition?  

I would advise against it for just the reason you are proposing.  What happens when the rate changes for a particular zip that is in the middle of a range.  You now have to parse that one record into 3.  40,000 records is a small table for a well designed relational database.

Leave it the way it is, it is far easier to use in this format, despite a slight database size decrease.
Avatar of gkhotline
gkhotline

ASKER

Well, if the rate changes it would just be assigned into already existing range or to a new range of it's own. So it could only create one more record at the maximum. 40,000 records is only for now. The total number may increase significantly and we need to make the operation as quick as possible with the fewest records possible. I have been given this task and it has to be done. I have been trying different things, but cant figure it out.
Thanks
gkhotline,I have to second fyed's recommendation to leave your table as is.  40k records is not a lot in the grand scheme of things, and your current design is exactly what I would have recommended a priori.Think about it: today you might have a valid range of 02103 - 05982, but if tomorrow that range changes you might have to do a heck of a lot of work to catch up.Patrick
as a followup: i need some sort of a code that would go through every record and for every state where we have same rate - look at the zip codes 1 by 1 and see if any of them go in order (ex. 02110,02111, 02112). If that is the case - put them into a range. So range would only be created if there are consecutive zip codes for a same state and rate. The rest of them would just be individual ranges of 1 record
Basically, I would be importing about 70,000 records every month into this tool that I am building and then will run all these queries to try to get it down to a much smaller number. Right now it is down to 40,000 and I need to make it even smaller - that is my requirement. Then I would export this file to be used in a different application. And the structure that it needs to have is:

State   zip_from   zip_to    rate

Any suggestions?
So what you are basically saying is that you have no choice, and management will not listen to common sense that says the current structure, when indexed properly will produce excellent results, without the added headache of this unnecessary step.
That does happen, from time to time.  Sometimes, after the warned-about problems occur, they will changes their minds.
ASKER CERTIFIED SOLUTION
Avatar of Dale Fye
Dale Fye
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
sample database attached.
ZipRanges.mdb
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Oops, forgot to add the database back, and hope fyed doesnt mind too much that I borrowed it :D
ZipRanges.mdb
Mark,

I prefer SQL solutions over code, but I just cannot get my brain to work this early.  I'll have to take a look at this later.  The vba took me about 5 minutes, it would probably have taken me an hour to come up with the SQL.
G'Day fyed,

For me, it is the other way round :) And didnt mean anything by that comment "it can be done as a straight query" in reference to your code, just that it could also be done as a query. Hope I didnt offend. And yes, it was a bit cheeky to grab your database - ta muchly for posting it though :)
No offense taken.  The point is to help the OP, and learn a bit at the same time.

I'm pretty good at nested queries, but start to get confused with 3rd order nesting.
You guys are good! It will take me some time to understand step by step what you did... How do you want me to split the points? Given the fact fyed gave his solution first... Just want to be fair...
However you like - it is your question :)  

And please just ask if there is anything I can help with explaining what the heck that query is really doing :) :)
Thank you Mark,
Would that be too much to ask to tell me step by step what you are doing? I could possible create a new question for this so that you dont waist your time for nothing.
No worries, and no need for another question... Explaination is an integral part of the solution :)
So, first step is understanding what the real problem is. Believe it or not, you actually had the solution, just didnt recognise it as a query...

>> That would've been easy as I could do group by state and group by zip query and then add 2 columns Min(zip) and Max(zip),

followed by :

>> BUT there are occasions when we could find in the same state there is another record...

So, the basis is we can use min and max except when there is a break point. That break point is an intercepting zip code with a different rate. And you provided that information, you knew my query before I did *laughing*.

Step 1 was to then start constructing the "raw" elements, gathering the necessary information so we can run a query.

What it meant in this case was starting with a fairly "raw" query to see if we can get those breakpoints.

      SELECT top 100 percent z.state,z.zip,z.rate,  '???????'  as next_rate_break
      FROM tbl_Zip as Z
      ORDER BY z.state, z.zip, z.rate

the challenge then becomes how do we find that breakpoint ? We might be able to join, but in reality, we are really looking for just the very next instance of a zip code with a different rate from the "current" row.

So we introduce an in-line query :

     (select top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break

Now it doesnt run on its own because it needs the "current" row as a reference. We also dont have to be hugely concerned where it lives in the overall scheme of things (just yet) so long as we do know that for "this" current series, it will stop at "next_rate_break" if there is one.

Lets now incorporate that back into our query replacing the question marks:

      SELECT top 100 percent z.state,z.zip,z.rate,(select top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break
      FROM tbl_Zip as Z
      ORDER BY z.state, z.zip, z.rate

And running that we see that the detail rows are in fact picking up an instance of "next break point" when there really is one. But the rows look a bit of a mess.

So, now we need to filter out that information, so we wrap up the above inside another query. ie, it becomes a subquery

select state, min(zip) as start_zip, max(zip) as end_zip, rate
from (
      SELECT top 100 percent z.state,z.zip,z.rate,(select top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break
      FROM tbl_Zip as Z
      ORDER BY z.state, z.zip, z.rate
     ) as SQ
Group by state, rate, next_rate_break
order by 1,2,3,4


Notice how we wrap up our supquery in brackets and give it an alias (a name) of "SQ". This is also where Access can be a bit tiresome because it reckons it knows better and ssometimes tries to replace some of our brackets with sqare brackets next time you edit it. Just be careful of that ;)

So, we now have an outer query selecting from our inner or subquery "SQ" and we need to (now) find the appropriate pieces from that raw information to report on.

Back to your original observations. We can use min and max except where there is a breakpoint. Guess what, we now know that there is a breakpoint, and can use that vital piece of information to help the MIN and MAX aggregate functions.

For our outer query, we already know we want to group things by State and rate, and as observed, that might be broken up because of those breakpoints, and so, that also must be included into our grouping...

   select state, min(zip) as start_zip, max(zip) as end_zip, rate
   from (...) as SQ
   Group by state, rate, next_rate_break
   order by 1,2,3,4

Which gives us the group by clause  "   Group by state, rate, next_rate_break " and all we really have to do know is get the state and rate, and then the min(zip) and max(zip) within that breakpoint.

Oh, and to finish it off we also have to sort it in some sequence and I tend to use the "column poistion" type notation rather than column names (seen in the order by)

So, it really becomes a compound select based on facts / information as we can gather...
1) get the "raw" rows (our subquery)
2) add in those breakpoints that will upset the sequence (ie a different rate within that state ) (add an inline query into our subquery)
3) Now select and present that information (the outer query)

Does that help explain it a bit more ?
Wow,
I didnt expect such a detailed answer! Thank you very much! I will go through it a little bit later, but it looks like if I can't understand this then all is lost for me.
Again, thank you! I will post my comments later...
No problems....

I do apologise for all the typos in there - and didnt explain why I was selecting top 100 percent.

That part is an old habit. In SQL if you use an ORDER BY then you normally need a selection qualifier for that ORDER BY to work within a sub query or view in a lot of different databases (e.g. MS SQL Server). So, when / if using an order by as part of a subquery, then I normally throw in the top 100 percent.

In reality I am not so sure that the order by nor the top 100 percent is actually required in the subquery apart from wanting to make sure that we do have all the postcodes in correct sequence for the state (regardless of when the postcodes were added to the table).

In those type of situations, if you are not sure what the sequence will really be, then always best to take control of it yourself to avoid any unexpected results. In the above, however, the inline query will always get the "next-break-point" regardless of sequence because that too is also using an order by and so could get some performance improvement if we really needed to by testing the need for the "top 100 percent" and "order by" within the "SQ" subquery (it is definitely needed in the inline query).

Also, if you have any "where" qualifiers then you would first look to add them to the "SQ" subquery part to minimise the amount of "raw" data, or, you can add it to the end as well if it is performing OK given the volume of data you have.

See ? Always room for more explanation, so, if it doesnt make sense, simply ask away :)
I go with mark_wills's comment ID:33750782. It is easier for me to check. Did not check other comments, whicj may give you the same accepted result.

If you still need more clarification to this comment, I may help.
Thank you for your solutions and explanations. Very, very helpful!