Solved

SQL Query to Find Earliest of 'Sets' of Data

Posted on 2014-01-20
15
238 Views
Last Modified: 2014-01-22
I have the following data:
CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity)

There can be multiple ItemID items issued to a Customer on a date. I want to identify the earliest (MIN()) Date where the items, and SUM(Quantity) issued on that date match subsequently issued items and Quantity.
Thus, with the given data:
CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130712          20       182        100
123       20130712          20       145         50
123       20130712          20       111         62
123       20130812          20       182        100
123       20130812          20       145         50
123       20130812          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62

Open in new window

You will note that for 6/12/2013, 7/12/2013 and 8/12/2013 the exact same order was placed. Then on 9/12/2013 it changed.

What I want to do is return the earliest (e.g. 6/12/2013 and 9/12/2013) list of data associated with a customer. If ANY of the values change from order to order, then that becomes the start of a new set and the items on that date are returned. Thus, the results from my query will return:
CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62

Open in new window

0
Comment
Question by:dbbishop
  • 6
  • 6
  • 2
  • +1
15 Comments
 
LVL 73

Expert Comment

by:sdstuber
ID: 39795142
Something like this?

select CustomerID, MIN(IssueDate) as IssueDate, ClassID, ItemID, Quantity from
(select CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity) as Quantity from yourtable
group by CustomerID, IssueDate, ClassID, ItemID) as x
group by CustomerID, ClassID, ItemID, Quantity
order by CustomerID, IssueDate, ClassID, ItemID, Quantity
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39795164
this article will explain how exactly you have to approach this kind of questions:
http://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html
0
 
LVL 11

Expert Comment

by:Guru Ji
ID: 39795166
you can try something like this

Select CustID, IssueDate, ClassID, ItemID, Sum(Quantity) as SUM
from TABLENAME
group by CustID, IssueDate, ClassID, ItemID, Quantity

Order by Convert(DATETIME, LEFT(IssueDate, 8)) desc

Open in new window

0
 
LVL 73

Expert Comment

by:sdstuber
ID: 39795212
My first attempt wouldn't pick up the   row

123       20130912          20       111         62

The problem is determining a change in the order as a whole.  In order to do that we'll create an arbitrary new  as a checksum.  We'll sum the product of the classid, itemid and the quantity for the order.  These numbers don't really mean anything except to serve as indicators of an order group.

  SELECT customerid,
         issuedate,
         classid,
         itemid,
         quantity
    FROM (SELECT customerid,
                 issuedate,
                 classid,
                 itemid,
                 quantity,
                 checksum,
                 ROW_NUMBER()
                     OVER(
                         PARTITION BY customerid,
                                      classid,
                                      itemid,
                                      quantity,
                                      checksum
                         ORDER BY issuedate
                     )
                     AS rn
            FROM (SELECT customerid,
                         issuedate,
                         classid,
                         itemid,
                         SUM(quantity)
                             OVER(
                                 PARTITION BY customerid,
                                              issuedate,
                                              classid,
                                              itemid
                             )
                             AS quantity,
                         SUM(classid * itemid * quantity) OVER (PARTITION BY customerid, issuedate)
                             checksum
                    FROM yourtable) as x) as y
   WHERE rn = 1
ORDER BY customerid,
         issuedate,
         classid,
         itemid DESC,
         quantity

Open in new window


Note, the checksums aren't guaranteed.  You could possibly miss some changes if the checksums of two different orders happen to calculate out to the same value but it should be somewhat rare.
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39795346
sdstuber: There are a finite number of values for ClassID and ItemID.
In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39795365
Guy,

I read your article and I am not certain there is anything in it that pertains to what I am looking for. You define how to get the last based on different criteria. I may want the first, middle and last, just the first or the first and last. The 'KEY' is the CustomerID. I ALWAYS want the first set of records (I.e. all records associated with the earliest date, which would be the first three in my example above). Then, whenever any of the other associated data changes, I want the set of records associated with the date of change (the last set of three records in my example). That set becomes the new 'base' set of data, and anytime in the future anything changes in that set, I would then capture that set of data.

A 'change' that would identify the start of a new set of data could be a change in quantity or class, an item added, an item removed, etc.
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 39795411
>>> In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.

It's ok if individual items and quantities stay the same,  in fact, that's what makes it work.

The problem would be if you had
Class 10 Item 10 Qty 20
Class 10 Item 12 Qty 5

For a checksum of 2600


Then that changed to

Class 10 Item 25 Qty 10
Class 10 Item 2 Qty 5

For a checksum of 2600
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39795465
I understand your doubt.
and indeed it's not easy, as your "set" to be identical are over several rows.

hence, I would suggest to group by the first fields, and do a listagg of the 2 last fiels, and then apply my article ...
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39795964
I was originally thinking I could use window functions to do this, but doesn't seem to be coming together.
0
 
LVL 73

Accepted Solution

by:
sdstuber earned 500 total points
ID: 39797139
you do use window functions, but the problem is creating a groupable/partionable value that is an aggregate of the individual values within an order.  That's what the check sum is for.

Unfortuntately, sqlserver doesn't have a listagg function like Oracle; but you can fake it with stuff and xmlpath query

try this...


SELECT customerid,
         issuedate,
         classid,
         itemid,
         quantity
    FROM (SELECT customerid,
                 issuedate,
                 classid,
                 itemid,
                 quantity,
                 RANK() OVER(PARTITION BY customerid, orderlist ORDER BY issuedate) rn
            FROM (SELECT customerid,
                         issuedate,
                         classid,
                         itemid,
                         quantity,
                         stuff(
           (  SELECT ', ' + cast(classid as varchar(5)) +'|' + cast(itemid as varchar(5)) + '|' + cast(quantity as varchar(5))
                FROM yourtable t1
               WHERE t1.customerid = t2.customerid AND t1.issuedate = t2.issuedate
            ORDER BY classid, itemid, quantity
            FOR XML PATH('')
           ),
           1,
           2,
           ''
       )
           AS orderlist
                    FROM yourtable as t2) as x) as y
   WHERE rn = 1
ORDER BY customerid,
         issuedate,
         classid,
         itemid DESC,
         quantity

Open in new window

0
 
LVL 15

Author Comment

by:dbbishop
ID: 39798210
sdstuber: Works like a charm, with a small number of customers (10-20). When I ran it against the production database (250K rows), I killed the query after 2 and a half hours.
I then tried to 'pre-filter' (not very accurate, but for timing, it was close enough) and restricted the query to a list of about 300 customer ID (where CustomerID IN (...)), and it ran in a little over 5 minutes. However, the pre-filtering is not accurate enough for my needs and a 2-1/2 hour+ run time is not acceptable.

I may just have to accept your answer (it does work, just too much data for my needs) and try to figure out another way to do it. YUCK! I had such hopes.
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 39798270
You could run it in batches of distinct customers and kick off several of them at a time.

Customers 1-100 in one query
Customers 101-200 in another
etc
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 39798342
I wonder...

You might be able to do a simpler query for big changes and then the stuff/xml for more subtle.

By simple changes I mean where it's obvious "something" changed, even if you don't dig into the details of exactly which line items did.  For example:  sum(qty)  or  count(distinct item) or min(item) or max(item)

By subtle changes, I mean all orders that aren't identified as obvious by previous criteria.
So that would entail a not in or not exists type subquery using previous results.

I know it would be possible, I'm not sure it would actually be more efficient to triple-dip into the table like that
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39799996
The actual structure of the database is rather more complex than I made the question (why muddle the point?) There are in fact, six tables involved in a join structure to pull together the five columns mentioned above.

Indexes are in place, although not the most efficient for this particular task. This is basically a data conversion effort that will be run 4-5 times in the next few months into a production environment, so having a clustered index definition change is not going to happen. The 250K rows mentioned above are just from 'stage 1' of the conversion effort. I expect that number to jump well over 1.5 million by the final conversion, although the 'set' of data I am interested in will still be roughly the same.

To some degree I can create additional indexes on the fly for the conversion effort, then drop them afterwards, but I believe I've done about the best I can to optimize from that standpoint.

I may look at executing the FOR XML stand-alone into a temp table and working off it, creating sufficient indexes, and then do the ranking off the temp table.
0
 
LVL 15

Author Closing Comment

by:dbbishop
ID: 39799999
We're on the right track if I can optimize it.
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Join & Write a Comment

Suggested Solutions

Confronted with some SQL you don't know can be a daunting task. It can be even more daunting if that SQL carries some of the old secret codes used in the Ye Olde query syntax, such as: (+)     as used in Oracle;     *=     =*    as used in Sybase …
In this article I will describe the Copy Database Wizard method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now