Link to home
Start Free TrialLog in
Avatar of D B
D BFlag for United States of America

asked on

SQL Query to Find Earliest of 'Sets' of Data

I have the following data:
CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity)

There can be multiple ItemID items issued to a Customer on a date. I want to identify the earliest (MIN()) Date where the items, and SUM(Quantity) issued on that date match subsequently issued items and Quantity.
Thus, with the given data:
CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130712          20       182        100
123       20130712          20       145         50
123       20130712          20       111         62
123       20130812          20       182        100
123       20130812          20       145         50
123       20130812          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62

Open in new window

You will note that for 6/12/2013, 7/12/2013 and 8/12/2013 the exact same order was placed. Then on 9/12/2013 it changed.

What I want to do is return the earliest (e.g. 6/12/2013 and 9/12/2013) list of data associated with a customer. If ANY of the values change from order to order, then that becomes the start of a new set and the items on that date are returned. Thus, the results from my query will return:
CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62

Open in new window

Avatar of Sean Stuber
Sean Stuber

Something like this?

select CustomerID, MIN(IssueDate) as IssueDate, ClassID, ItemID, Quantity from
(select CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity) as Quantity from yourtable
group by CustomerID, IssueDate, ClassID, ItemID) as x
group by CustomerID, ClassID, ItemID, Quantity
order by CustomerID, IssueDate, ClassID, ItemID, Quantity
you can try something like this

Select CustID, IssueDate, ClassID, ItemID, Sum(Quantity) as SUM
from TABLENAME
group by CustID, IssueDate, ClassID, ItemID, Quantity

Order by Convert(DATETIME, LEFT(IssueDate, 8)) desc

Open in new window

My first attempt wouldn't pick up the   row

123       20130912          20       111         62

The problem is determining a change in the order as a whole.  In order to do that we'll create an arbitrary new  as a checksum.  We'll sum the product of the classid, itemid and the quantity for the order.  These numbers don't really mean anything except to serve as indicators of an order group.

  SELECT customerid,
         issuedate,
         classid,
         itemid,
         quantity
    FROM (SELECT customerid,
                 issuedate,
                 classid,
                 itemid,
                 quantity,
                 checksum,
                 ROW_NUMBER()
                     OVER(
                         PARTITION BY customerid,
                                      classid,
                                      itemid,
                                      quantity,
                                      checksum
                         ORDER BY issuedate
                     )
                     AS rn
            FROM (SELECT customerid,
                         issuedate,
                         classid,
                         itemid,
                         SUM(quantity)
                             OVER(
                                 PARTITION BY customerid,
                                              issuedate,
                                              classid,
                                              itemid
                             )
                             AS quantity,
                         SUM(classid * itemid * quantity) OVER (PARTITION BY customerid, issuedate)
                             checksum
                    FROM yourtable) as x) as y
   WHERE rn = 1
ORDER BY customerid,
         issuedate,
         classid,
         itemid DESC,
         quantity

Open in new window


Note, the checksums aren't guaranteed.  You could possibly miss some changes if the checksums of two different orders happen to calculate out to the same value but it should be somewhat rare.
Avatar of D B

ASKER

sdstuber: There are a finite number of values for ClassID and ItemID.
In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
Avatar of D B

ASKER

Guy,

I read your article and I am not certain there is anything in it that pertains to what I am looking for. You define how to get the last based on different criteria. I may want the first, middle and last, just the first or the first and last. The 'KEY' is the CustomerID. I ALWAYS want the first set of records (I.e. all records associated with the earliest date, which would be the first three in my example above). Then, whenever any of the other associated data changes, I want the set of records associated with the date of change (the last set of three records in my example). That set becomes the new 'base' set of data, and anytime in the future anything changes in that set, I would then capture that set of data.

A 'change' that would identify the start of a new set of data could be a change in quantity or class, an item added, an item removed, etc.
>>> In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.

It's ok if individual items and quantities stay the same,  in fact, that's what makes it work.

The problem would be if you had
Class 10 Item 10 Qty 20
Class 10 Item 12 Qty 5

For a checksum of 2600


Then that changed to

Class 10 Item 25 Qty 10
Class 10 Item 2 Qty 5

For a checksum of 2600
I understand your doubt.
and indeed it's not easy, as your "set" to be identical are over several rows.

hence, I would suggest to group by the first fields, and do a listagg of the 2 last fiels, and then apply my article ...
Avatar of D B

ASKER

I was originally thinking I could use window functions to do this, but doesn't seem to be coming together.
ASKER CERTIFIED SOLUTION
Avatar of Sean Stuber
Sean Stuber

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of D B

ASKER

sdstuber: Works like a charm, with a small number of customers (10-20). When I ran it against the production database (250K rows), I killed the query after 2 and a half hours.
I then tried to 'pre-filter' (not very accurate, but for timing, it was close enough) and restricted the query to a list of about 300 customer ID (where CustomerID IN (...)), and it ran in a little over 5 minutes. However, the pre-filtering is not accurate enough for my needs and a 2-1/2 hour+ run time is not acceptable.

I may just have to accept your answer (it does work, just too much data for my needs) and try to figure out another way to do it. YUCK! I had such hopes.
You could run it in batches of distinct customers and kick off several of them at a time.

Customers 1-100 in one query
Customers 101-200 in another
etc
I wonder...

You might be able to do a simpler query for big changes and then the stuff/xml for more subtle.

By simple changes I mean where it's obvious "something" changed, even if you don't dig into the details of exactly which line items did.  For example:  sum(qty)  or  count(distinct item) or min(item) or max(item)

By subtle changes, I mean all orders that aren't identified as obvious by previous criteria.
So that would entail a not in or not exists type subquery using previous results.

I know it would be possible, I'm not sure it would actually be more efficient to triple-dip into the table like that
Avatar of D B

ASKER

The actual structure of the database is rather more complex than I made the question (why muddle the point?) There are in fact, six tables involved in a join structure to pull together the five columns mentioned above.

Indexes are in place, although not the most efficient for this particular task. This is basically a data conversion effort that will be run 4-5 times in the next few months into a production environment, so having a clustered index definition change is not going to happen. The 250K rows mentioned above are just from 'stage 1' of the conversion effort. I expect that number to jump well over 1.5 million by the final conversion, although the 'set' of data I am interested in will still be roughly the same.

To some degree I can create additional indexes on the fly for the conversion effort, then drop them afterwards, but I believe I've done about the best I can to optimize from that standpoint.

I may look at executing the FOR XML stand-alone into a temp table and working off it, creating sufficient indexes, and then do the ranking off the temp table.
Avatar of D B

ASKER

We're on the right track if I can optimize it.