D B
asked on
SQL Query to Find Earliest of 'Sets' of Data
I have the following data:
CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity)
There can be multiple ItemID items issued to a Customer on a date. I want to identify the earliest (MIN()) Date where the items, and SUM(Quantity) issued on that date match subsequently issued items and Quantity.
Thus, with the given data:
What I want to do is return the earliest (e.g. 6/12/2013 and 9/12/2013) list of data associated with a customer. If ANY of the values change from order to order, then that becomes the start of a new set and the items on that date are returned. Thus, the results from my query will return:
CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity)
There can be multiple ItemID items issued to a Customer on a date. I want to identify the earliest (MIN()) Date where the items, and SUM(Quantity) issued on that date match subsequently issued items and Quantity.
Thus, with the given data:
CustID IssueDate ClassID ItemID Quantity
------ -------- ------- ------ --------
123 20130612 20 182 100
123 20130612 20 145 50
123 20130612 20 111 62
123 20130712 20 182 100
123 20130712 20 145 50
123 20130712 20 111 62
123 20130812 20 182 100
123 20130812 20 145 50
123 20130812 20 111 62
123 20130912 20 182 60
123 20130912 20 132 18
123 20130912 20 111 62
You will note that for 6/12/2013, 7/12/2013 and 8/12/2013 the exact same order was placed. Then on 9/12/2013 it changed.What I want to do is return the earliest (e.g. 6/12/2013 and 9/12/2013) list of data associated with a customer. If ANY of the values change from order to order, then that becomes the start of a new set and the items on that date are returned. Thus, the results from my query will return:
CustID IssueDate ClassID ItemID Quantity
------ -------- ------- ------ --------
123 20130612 20 182 100
123 20130612 20 145 50
123 20130612 20 111 62
123 20130912 20 182 60
123 20130912 20 132 18
123 20130912 20 111 62
this article will explain how exactly you have to approach this kind of questions:
https://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html
https://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html
you can try something like this
Select CustID, IssueDate, ClassID, ItemID, Sum(Quantity) as SUM
from TABLENAME
group by CustID, IssueDate, ClassID, ItemID, Quantity
Order by Convert(DATETIME, LEFT(IssueDate, 8)) desc
My first attempt wouldn't pick up the row
123 20130912 20 111 62
The problem is determining a change in the order as a whole. In order to do that we'll create an arbitrary new as a checksum. We'll sum the product of the classid, itemid and the quantity for the order. These numbers don't really mean anything except to serve as indicators of an order group.
Note, the checksums aren't guaranteed. You could possibly miss some changes if the checksums of two different orders happen to calculate out to the same value but it should be somewhat rare.
123 20130912 20 111 62
The problem is determining a change in the order as a whole. In order to do that we'll create an arbitrary new as a checksum. We'll sum the product of the classid, itemid and the quantity for the order. These numbers don't really mean anything except to serve as indicators of an order group.
SELECT customerid,
issuedate,
classid,
itemid,
quantity
FROM (SELECT customerid,
issuedate,
classid,
itemid,
quantity,
checksum,
ROW_NUMBER()
OVER(
PARTITION BY customerid,
classid,
itemid,
quantity,
checksum
ORDER BY issuedate
)
AS rn
FROM (SELECT customerid,
issuedate,
classid,
itemid,
SUM(quantity)
OVER(
PARTITION BY customerid,
issuedate,
classid,
itemid
)
AS quantity,
SUM(classid * itemid * quantity) OVER (PARTITION BY customerid, issuedate)
checksum
FROM yourtable) as x) as y
WHERE rn = 1
ORDER BY customerid,
issuedate,
classid,
itemid DESC,
quantity
Note, the checksums aren't guaranteed. You could possibly miss some changes if the checksums of two different orders happen to calculate out to the same value but it should be somewhat rare.
ASKER
sdstuber: There are a finite number of values for ClassID and ItemID.
In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
ASKER
Guy,
I read your article and I am not certain there is anything in it that pertains to what I am looking for. You define how to get the last based on different criteria. I may want the first, middle and last, just the first or the first and last. The 'KEY' is the CustomerID. I ALWAYS want the first set of records (I.e. all records associated with the earliest date, which would be the first three in my example above). Then, whenever any of the other associated data changes, I want the set of records associated with the date of change (the last set of three records in my example). That set becomes the new 'base' set of data, and anytime in the future anything changes in that set, I would then capture that set of data.
A 'change' that would identify the start of a new set of data could be a change in quantity or class, an item added, an item removed, etc.
I read your article and I am not certain there is anything in it that pertains to what I am looking for. You define how to get the last based on different criteria. I may want the first, middle and last, just the first or the first and last. The 'KEY' is the CustomerID. I ALWAYS want the first set of records (I.e. all records associated with the earliest date, which would be the first three in my example above). Then, whenever any of the other associated data changes, I want the set of records associated with the date of change (the last set of three records in my example). That set becomes the new 'base' set of data, and anytime in the future anything changes in that set, I would then capture that set of data.
A 'change' that would identify the start of a new set of data could be a change in quantity or class, an item added, an item removed, etc.
>>> In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
It's ok if individual items and quantities stay the same, in fact, that's what makes it work.
The problem would be if you had
Class 10 Item 10 Qty 20
Class 10 Item 12 Qty 5
For a checksum of 2600
Then that changed to
Class 10 Item 25 Qty 10
Class 10 Item 2 Qty 5
For a checksum of 2600
It's ok if individual items and quantities stay the same, in fact, that's what makes it work.
The problem would be if you had
Class 10 Item 10 Qty 20
Class 10 Item 12 Qty 5
For a checksum of 2600
Then that changed to
Class 10 Item 25 Qty 10
Class 10 Item 2 Qty 5
For a checksum of 2600
I understand your doubt.
and indeed it's not easy, as your "set" to be identical are over several rows.
hence, I would suggest to group by the first fields, and do a listagg of the 2 last fiels, and then apply my article ...
and indeed it's not easy, as your "set" to be identical are over several rows.
hence, I would suggest to group by the first fields, and do a listagg of the 2 last fiels, and then apply my article ...
ASKER
I was originally thinking I could use window functions to do this, but doesn't seem to be coming together.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
sdstuber: Works like a charm, with a small number of customers (10-20). When I ran it against the production database (250K rows), I killed the query after 2 and a half hours.
I then tried to 'pre-filter' (not very accurate, but for timing, it was close enough) and restricted the query to a list of about 300 customer ID (where CustomerID IN (...)), and it ran in a little over 5 minutes. However, the pre-filtering is not accurate enough for my needs and a 2-1/2 hour+ run time is not acceptable.
I may just have to accept your answer (it does work, just too much data for my needs) and try to figure out another way to do it. YUCK! I had such hopes.
I then tried to 'pre-filter' (not very accurate, but for timing, it was close enough) and restricted the query to a list of about 300 customer ID (where CustomerID IN (...)), and it ran in a little over 5 minutes. However, the pre-filtering is not accurate enough for my needs and a 2-1/2 hour+ run time is not acceptable.
I may just have to accept your answer (it does work, just too much data for my needs) and try to figure out another way to do it. YUCK! I had such hopes.
You could run it in batches of distinct customers and kick off several of them at a time.
Customers 1-100 in one query
Customers 101-200 in another
etc
Customers 1-100 in one query
Customers 101-200 in another
etc
I wonder...
You might be able to do a simpler query for big changes and then the stuff/xml for more subtle.
By simple changes I mean where it's obvious "something" changed, even if you don't dig into the details of exactly which line items did. For example: sum(qty) or count(distinct item) or min(item) or max(item)
By subtle changes, I mean all orders that aren't identified as obvious by previous criteria.
So that would entail a not in or not exists type subquery using previous results.
I know it would be possible, I'm not sure it would actually be more efficient to triple-dip into the table like that
You might be able to do a simpler query for big changes and then the stuff/xml for more subtle.
By simple changes I mean where it's obvious "something" changed, even if you don't dig into the details of exactly which line items did. For example: sum(qty) or count(distinct item) or min(item) or max(item)
By subtle changes, I mean all orders that aren't identified as obvious by previous criteria.
So that would entail a not in or not exists type subquery using previous results.
I know it would be possible, I'm not sure it would actually be more efficient to triple-dip into the table like that
ASKER
The actual structure of the database is rather more complex than I made the question (why muddle the point?) There are in fact, six tables involved in a join structure to pull together the five columns mentioned above.
Indexes are in place, although not the most efficient for this particular task. This is basically a data conversion effort that will be run 4-5 times in the next few months into a production environment, so having a clustered index definition change is not going to happen. The 250K rows mentioned above are just from 'stage 1' of the conversion effort. I expect that number to jump well over 1.5 million by the final conversion, although the 'set' of data I am interested in will still be roughly the same.
To some degree I can create additional indexes on the fly for the conversion effort, then drop them afterwards, but I believe I've done about the best I can to optimize from that standpoint.
I may look at executing the FOR XML stand-alone into a temp table and working off it, creating sufficient indexes, and then do the ranking off the temp table.
Indexes are in place, although not the most efficient for this particular task. This is basically a data conversion effort that will be run 4-5 times in the next few months into a production environment, so having a clustered index definition change is not going to happen. The 250K rows mentioned above are just from 'stage 1' of the conversion effort. I expect that number to jump well over 1.5 million by the final conversion, although the 'set' of data I am interested in will still be roughly the same.
To some degree I can create additional indexes on the fly for the conversion effort, then drop them afterwards, but I believe I've done about the best I can to optimize from that standpoint.
I may look at executing the FOR XML stand-alone into a temp table and working off it, creating sufficient indexes, and then do the ranking off the temp table.
ASKER
We're on the right track if I can optimize it.
select CustomerID, MIN(IssueDate) as IssueDate, ClassID, ItemID, Quantity from
(select CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity) as Quantity from yourtable
group by CustomerID, IssueDate, ClassID, ItemID) as x
group by CustomerID, ClassID, ItemID, Quantity
order by CustomerID, IssueDate, ClassID, ItemID, Quantity