gkhotline
asked on
Access 2007, complex query
I have a table that consists of 3 columns: state, rate and zip. Each record in this table is unique. Every state has several zip codes and each one of those has a rate assigned to it. The table has about 40,000 records and I need to get it to a much smaller number. I need to combine some of the records and use zip ranges that have the same rate. Example:
State Zip Rate
AL 02103 .07
AL 02550 .07
AL 05982 .07
I need to make it look like this:
State Zip_from Zip_to Rate
AL 02103 05982 .07
That would've been easy as I could do group by state and group by zip query and then add 2 columns Min(zip) and Max(zip), BUT there are occasions when we could find in the same state there is another record with the zip code that falls into the same range, but with different rate: to add to example above there could be a record like this:
State Zip Rate
AL 03276 .08
It falls within this range:
State Zip_from Zip_to Rate
AL 02103 05982 .07
but it has a different rate. So, these ranges have to be somehow evaluated to see if there are other records with different zip codes and rates that would fall within the range. If so, the range would have to be stopped and then restarted. So, following the example above we would have 2 ranges instead of 1:
this range:
State Zip_from Zip_to Rate
AL 02103 05982 .07
will become 2 ranges:
State Zip_from Zip_to Rate
AL 02103 02550 .07
AL 05982 05982 .07
That way this record:
State Zip Rate
AL 03276 .08
will be in it's own range.
I hope I explained it well enough.
Thanks
State Zip Rate
AL 02103 .07
AL 02550 .07
AL 05982 .07
I need to make it look like this:
State Zip_from Zip_to Rate
AL 02103 05982 .07
That would've been easy as I could do group by state and group by zip query and then add 2 columns Min(zip) and Max(zip), BUT there are occasions when we could find in the same state there is another record with the zip code that falls into the same range, but with different rate: to add to example above there could be a record like this:
State Zip Rate
AL 03276 .08
It falls within this range:
State Zip_from Zip_to Rate
AL 02103 05982 .07
but it has a different rate. So, these ranges have to be somehow evaluated to see if there are other records with different zip codes and rates that would fall within the range. If so, the range would have to be stopped and then restarted. So, following the example above we would have 2 ranges instead of 1:
this range:
State Zip_from Zip_to Rate
AL 02103 05982 .07
will become 2 ranges:
State Zip_from Zip_to Rate
AL 02103 02550 .07
AL 05982 05982 .07
That way this record:
State Zip Rate
AL 03276 .08
will be in it's own range.
I hope I explained it well enough.
Thanks
ASKER
Well, if the rate changes it would just be assigned into already existing range or to a new range of it's own. So it could only create one more record at the maximum. 40,000 records is only for now. The total number may increase significantly and we need to make the operation as quick as possible with the fewest records possible. I have been given this task and it has to be done. I have been trying different things, but cant figure it out.
Thanks
Thanks
gkhotline,I have to second fyed's recommendation to leave your table as is. 40k records is not a lot in the grand scheme of things, and your current design is exactly what I would have recommended a priori.Think about it: today you might have a valid range of 02103 - 05982, but if tomorrow that range changes you might have to do a heck of a lot of work to catch up.Patrick
ASKER
as a followup: i need some sort of a code that would go through every record and for every state where we have same rate - look at the zip codes 1 by 1 and see if any of them go in order (ex. 02110,02111, 02112). If that is the case - put them into a range. So range would only be created if there are consecutive zip codes for a same state and rate. The rest of them would just be individual ranges of 1 record
ASKER
Basically, I would be importing about 70,000 records every month into this tool that I am building and then will run all these queries to try to get it down to a much smaller number. Right now it is down to 40,000 and I need to make it even smaller - that is my requirement. Then I would export this file to be used in a different application. And the structure that it needs to have is:
State zip_from zip_to rate
Any suggestions?
State zip_from zip_to rate
Any suggestions?
So what you are basically saying is that you have no choice, and management will not listen to common sense that says the current structure, when indexed properly will produce excellent results, without the added headache of this unnecessary step.
That does happen, from time to time. Sometimes, after the warned-about problems occur, they will changes their minds.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
sample database attached.
ZipRanges.mdb
ZipRanges.mdb
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Oops, forgot to add the database back, and hope fyed doesnt mind too much that I borrowed it :D
ZipRanges.mdb
ZipRanges.mdb
Mark,
I prefer SQL solutions over code, but I just cannot get my brain to work this early. I'll have to take a look at this later. The vba took me about 5 minutes, it would probably have taken me an hour to come up with the SQL.
I prefer SQL solutions over code, but I just cannot get my brain to work this early. I'll have to take a look at this later. The vba took me about 5 minutes, it would probably have taken me an hour to come up with the SQL.
G'Day fyed,
For me, it is the other way round :) And didnt mean anything by that comment "it can be done as a straight query" in reference to your code, just that it could also be done as a query. Hope I didnt offend. And yes, it was a bit cheeky to grab your database - ta muchly for posting it though :)
For me, it is the other way round :) And didnt mean anything by that comment "it can be done as a straight query" in reference to your code, just that it could also be done as a query. Hope I didnt offend. And yes, it was a bit cheeky to grab your database - ta muchly for posting it though :)
No offense taken. The point is to help the OP, and learn a bit at the same time.
I'm pretty good at nested queries, but start to get confused with 3rd order nesting.
I'm pretty good at nested queries, but start to get confused with 3rd order nesting.
ASKER
You guys are good! It will take me some time to understand step by step what you did... How do you want me to split the points? Given the fact fyed gave his solution first... Just want to be fair...
However you like - it is your question :)
And please just ask if there is anything I can help with explaining what the heck that query is really doing :) :)
And please just ask if there is anything I can help with explaining what the heck that query is really doing :) :)
ASKER
Thank you Mark,
Would that be too much to ask to tell me step by step what you are doing? I could possible create a new question for this so that you dont waist your time for nothing.
Would that be too much to ask to tell me step by step what you are doing? I could possible create a new question for this so that you dont waist your time for nothing.
No worries, and no need for another question... Explaination is an integral part of the solution :)
So, first step is understanding what the real problem is. Believe it or not, you actually had the solution, just didnt recognise it as a query...
>> That would've been easy as I could do group by state and group by zip query and then add 2 columns Min(zip) and Max(zip),
followed by :
>> BUT there are occasions when we could find in the same state there is another record...
So, the basis is we can use min and max except when there is a break point. That break point is an intercepting zip code with a different rate. And you provided that information, you knew my query before I did *laughing*.
Step 1 was to then start constructing the "raw" elements, gathering the necessary information so we can run a query.
What it meant in this case was starting with a fairly "raw" query to see if we can get those breakpoints.
SELECT top 100 percent z.state,z.zip,z.rate, '???????' as next_rate_break
FROM tbl_Zip as Z
ORDER BY z.state, z.zip, z.rate
the challenge then becomes how do we find that breakpoint ? We might be able to join, but in reality, we are really looking for just the very next instance of a zip code with a different rate from the "current" row.
So we introduce an in-line query :
(select top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break
Now it doesnt run on its own because it needs the "current" row as a reference. We also dont have to be hugely concerned where it lives in the overall scheme of things (just yet) so long as we do know that for "this" current series, it will stop at "next_rate_break" if there is one.
Lets now incorporate that back into our query replacing the question marks:
SELECT top 100 percent z.state,z.zip,z.rate,(sele ct top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break
FROM tbl_Zip as Z
ORDER BY z.state, z.zip, z.rate
And running that we see that the detail rows are in fact picking up an instance of "next break point" when there really is one. But the rows look a bit of a mess.
So, now we need to filter out that information, so we wrap up the above inside another query. ie, it becomes a subquery
select state, min(zip) as start_zip, max(zip) as end_zip, rate
from (
SELECT top 100 percent z.state,z.zip,z.rate,(sele ct top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break
FROM tbl_Zip as Z
ORDER BY z.state, z.zip, z.rate
) as SQ
Group by state, rate, next_rate_break
order by 1,2,3,4
Notice how we wrap up our supquery in brackets and give it an alias (a name) of "SQ". This is also where Access can be a bit tiresome because it reckons it knows better and ssometimes tries to replace some of our brackets with sqare brackets next time you edit it. Just be careful of that ;)
So, we now have an outer query selecting from our inner or subquery "SQ" and we need to (now) find the appropriate pieces from that raw information to report on.
Back to your original observations. We can use min and max except where there is a breakpoint. Guess what, we now know that there is a breakpoint, and can use that vital piece of information to help the MIN and MAX aggregate functions.
For our outer query, we already know we want to group things by State and rate, and as observed, that might be broken up because of those breakpoints, and so, that also must be included into our grouping...
select state, min(zip) as start_zip, max(zip) as end_zip, rate
from (...) as SQ
Group by state, rate, next_rate_break
order by 1,2,3,4
Which gives us the group by clause " Group by state, rate, next_rate_break " and all we really have to do know is get the state and rate, and then the min(zip) and max(zip) within that breakpoint.
Oh, and to finish it off we also have to sort it in some sequence and I tend to use the "column poistion" type notation rather than column names (seen in the order by)
So, it really becomes a compound select based on facts / information as we can gather...
1) get the "raw" rows (our subquery)
2) add in those breakpoints that will upset the sequence (ie a different rate within that state ) (add an inline query into our subquery)
3) Now select and present that information (the outer query)
Does that help explain it a bit more ?
>> That would've been easy as I could do group by state and group by zip query and then add 2 columns Min(zip) and Max(zip),
followed by :
>> BUT there are occasions when we could find in the same state there is another record...
So, the basis is we can use min and max except when there is a break point. That break point is an intercepting zip code with a different rate. And you provided that information, you knew my query before I did *laughing*.
Step 1 was to then start constructing the "raw" elements, gathering the necessary information so we can run a query.
What it meant in this case was starting with a fairly "raw" query to see if we can get those breakpoints.
SELECT top 100 percent z.state,z.zip,z.rate, '???????' as next_rate_break
FROM tbl_Zip as Z
ORDER BY z.state, z.zip, z.rate
the challenge then becomes how do we find that breakpoint ? We might be able to join, but in reality, we are really looking for just the very next instance of a zip code with a different rate from the "current" row.
So we introduce an in-line query :
(select top 1 n.zip from tbl_zip n where n.state = z.state and n.zip > z.zip and n.rate <> z.rate order by n.state, n.zip) as next_rate_break
Now it doesnt run on its own because it needs the "current" row as a reference. We also dont have to be hugely concerned where it lives in the overall scheme of things (just yet) so long as we do know that for "this" current series, it will stop at "next_rate_break" if there is one.
Lets now incorporate that back into our query replacing the question marks:
SELECT top 100 percent z.state,z.zip,z.rate,(sele
FROM tbl_Zip as Z
ORDER BY z.state, z.zip, z.rate
And running that we see that the detail rows are in fact picking up an instance of "next break point" when there really is one. But the rows look a bit of a mess.
So, now we need to filter out that information, so we wrap up the above inside another query. ie, it becomes a subquery
select state, min(zip) as start_zip, max(zip) as end_zip, rate
from (
SELECT top 100 percent z.state,z.zip,z.rate,(sele
FROM tbl_Zip as Z
ORDER BY z.state, z.zip, z.rate
) as SQ
Group by state, rate, next_rate_break
order by 1,2,3,4
Notice how we wrap up our supquery in brackets and give it an alias (a name) of "SQ". This is also where Access can be a bit tiresome because it reckons it knows better and ssometimes tries to replace some of our brackets with sqare brackets next time you edit it. Just be careful of that ;)
So, we now have an outer query selecting from our inner or subquery "SQ" and we need to (now) find the appropriate pieces from that raw information to report on.
Back to your original observations. We can use min and max except where there is a breakpoint. Guess what, we now know that there is a breakpoint, and can use that vital piece of information to help the MIN and MAX aggregate functions.
For our outer query, we already know we want to group things by State and rate, and as observed, that might be broken up because of those breakpoints, and so, that also must be included into our grouping...
select state, min(zip) as start_zip, max(zip) as end_zip, rate
from (...) as SQ
Group by state, rate, next_rate_break
order by 1,2,3,4
Which gives us the group by clause " Group by state, rate, next_rate_break " and all we really have to do know is get the state and rate, and then the min(zip) and max(zip) within that breakpoint.
Oh, and to finish it off we also have to sort it in some sequence and I tend to use the "column poistion" type notation rather than column names (seen in the order by)
So, it really becomes a compound select based on facts / information as we can gather...
1) get the "raw" rows (our subquery)
2) add in those breakpoints that will upset the sequence (ie a different rate within that state ) (add an inline query into our subquery)
3) Now select and present that information (the outer query)
Does that help explain it a bit more ?
ASKER
Wow,
I didnt expect such a detailed answer! Thank you very much! I will go through it a little bit later, but it looks like if I can't understand this then all is lost for me.
Again, thank you! I will post my comments later...
I didnt expect such a detailed answer! Thank you very much! I will go through it a little bit later, but it looks like if I can't understand this then all is lost for me.
Again, thank you! I will post my comments later...
No problems....
I do apologise for all the typos in there - and didnt explain why I was selecting top 100 percent.
That part is an old habit. In SQL if you use an ORDER BY then you normally need a selection qualifier for that ORDER BY to work within a sub query or view in a lot of different databases (e.g. MS SQL Server). So, when / if using an order by as part of a subquery, then I normally throw in the top 100 percent.
In reality I am not so sure that the order by nor the top 100 percent is actually required in the subquery apart from wanting to make sure that we do have all the postcodes in correct sequence for the state (regardless of when the postcodes were added to the table).
In those type of situations, if you are not sure what the sequence will really be, then always best to take control of it yourself to avoid any unexpected results. In the above, however, the inline query will always get the "next-break-point" regardless of sequence because that too is also using an order by and so could get some performance improvement if we really needed to by testing the need for the "top 100 percent" and "order by" within the "SQ" subquery (it is definitely needed in the inline query).
Also, if you have any "where" qualifiers then you would first look to add them to the "SQ" subquery part to minimise the amount of "raw" data, or, you can add it to the end as well if it is performing OK given the volume of data you have.
See ? Always room for more explanation, so, if it doesnt make sense, simply ask away :)
I do apologise for all the typos in there - and didnt explain why I was selecting top 100 percent.
That part is an old habit. In SQL if you use an ORDER BY then you normally need a selection qualifier for that ORDER BY to work within a sub query or view in a lot of different databases (e.g. MS SQL Server). So, when / if using an order by as part of a subquery, then I normally throw in the top 100 percent.
In reality I am not so sure that the order by nor the top 100 percent is actually required in the subquery apart from wanting to make sure that we do have all the postcodes in correct sequence for the state (regardless of when the postcodes were added to the table).
In those type of situations, if you are not sure what the sequence will really be, then always best to take control of it yourself to avoid any unexpected results. In the above, however, the inline query will always get the "next-break-point" regardless of sequence because that too is also using an order by and so could get some performance improvement if we really needed to by testing the need for the "top 100 percent" and "order by" within the "SQ" subquery (it is definitely needed in the inline query).
Also, if you have any "where" qualifiers then you would first look to add them to the "SQ" subquery part to minimise the amount of "raw" data, or, you can add it to the end as well if it is performing OK given the volume of data you have.
See ? Always room for more explanation, so, if it doesnt make sense, simply ask away :)
I go with mark_wills's comment ID:33750782. It is easier for me to check. Did not check other comments, whicj may give you the same accepted result.
If you still need more clarification to this comment, I may help.
If you still need more clarification to this comment, I may help.
ASKER
Thank you for your solutions and explanations. Very, very helpful!
I would advise against it for just the reason you are proposing. What happens when the rate changes for a particular zip that is in the middle of a range. You now have to parse that one record into 3. 40,000 records is a small table for a well designed relational database.
Leave it the way it is, it is far easier to use in this format, despite a slight database size decrease.