Solved

SQL Query to Find Earliest of 'Sets' of Data

Posted on 2014-01-20
15
244 Views
Last Modified: 2014-01-22
I have the following data:
CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity)

There can be multiple ItemID items issued to a Customer on a date. I want to identify the earliest (MIN()) Date where the items, and SUM(Quantity) issued on that date match subsequently issued items and Quantity.
Thus, with the given data:
CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130712          20       182        100
123       20130712          20       145         50
123       20130712          20       111         62
123       20130812          20       182        100
123       20130812          20       145         50
123       20130812          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62

Open in new window

You will note that for 6/12/2013, 7/12/2013 and 8/12/2013 the exact same order was placed. Then on 9/12/2013 it changed.

What I want to do is return the earliest (e.g. 6/12/2013 and 9/12/2013) list of data associated with a customer. If ANY of the values change from order to order, then that becomes the start of a new set and the items on that date are returned. Thus, the results from my query will return:
CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62

Open in new window

0
Comment
Question by:dbbishop
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 6
  • 2
  • +1
15 Comments
 
LVL 74

Expert Comment

by:sdstuber
ID: 39795142
Something like this?

select CustomerID, MIN(IssueDate) as IssueDate, ClassID, ItemID, Quantity from
(select CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity) as Quantity from yourtable
group by CustomerID, IssueDate, ClassID, ItemID) as x
group by CustomerID, ClassID, ItemID, Quantity
order by CustomerID, IssueDate, ClassID, ItemID, Quantity
0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39795164
this article will explain how exactly you have to approach this kind of questions:
http://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html
0
 
LVL 11

Expert Comment

by:Guru Ji
ID: 39795166
you can try something like this

Select CustID, IssueDate, ClassID, ItemID, Sum(Quantity) as SUM
from TABLENAME
group by CustID, IssueDate, ClassID, ItemID, Quantity

Order by Convert(DATETIME, LEFT(IssueDate, 8)) desc

Open in new window

0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 74

Expert Comment

by:sdstuber
ID: 39795212
My first attempt wouldn't pick up the   row

123       20130912          20       111         62

The problem is determining a change in the order as a whole.  In order to do that we'll create an arbitrary new  as a checksum.  We'll sum the product of the classid, itemid and the quantity for the order.  These numbers don't really mean anything except to serve as indicators of an order group.

  SELECT customerid,
         issuedate,
         classid,
         itemid,
         quantity
    FROM (SELECT customerid,
                 issuedate,
                 classid,
                 itemid,
                 quantity,
                 checksum,
                 ROW_NUMBER()
                     OVER(
                         PARTITION BY customerid,
                                      classid,
                                      itemid,
                                      quantity,
                                      checksum
                         ORDER BY issuedate
                     )
                     AS rn
            FROM (SELECT customerid,
                         issuedate,
                         classid,
                         itemid,
                         SUM(quantity)
                             OVER(
                                 PARTITION BY customerid,
                                              issuedate,
                                              classid,
                                              itemid
                             )
                             AS quantity,
                         SUM(classid * itemid * quantity) OVER (PARTITION BY customerid, issuedate)
                             checksum
                    FROM yourtable) as x) as y
   WHERE rn = 1
ORDER BY customerid,
         issuedate,
         classid,
         itemid DESC,
         quantity

Open in new window


Note, the checksums aren't guaranteed.  You could possibly miss some changes if the checksums of two different orders happen to calculate out to the same value but it should be somewhat rare.
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39795346
sdstuber: There are a finite number of values for ClassID and ItemID.
In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39795365
Guy,

I read your article and I am not certain there is anything in it that pertains to what I am looking for. You define how to get the last based on different criteria. I may want the first, middle and last, just the first or the first and last. The 'KEY' is the CustomerID. I ALWAYS want the first set of records (I.e. all records associated with the earliest date, which would be the first three in my example above). Then, whenever any of the other associated data changes, I want the set of records associated with the date of change (the last set of three records in my example). That set becomes the new 'base' set of data, and anytime in the future anything changes in that set, I would then capture that set of data.

A 'change' that would identify the start of a new set of data could be a change in quantity or class, an item added, an item removed, etc.
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 39795411
>>> In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.

It's ok if individual items and quantities stay the same,  in fact, that's what makes it work.

The problem would be if you had
Class 10 Item 10 Qty 20
Class 10 Item 12 Qty 5

For a checksum of 2600


Then that changed to

Class 10 Item 25 Qty 10
Class 10 Item 2 Qty 5

For a checksum of 2600
0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39795465
I understand your doubt.
and indeed it's not easy, as your "set" to be identical are over several rows.

hence, I would suggest to group by the first fields, and do a listagg of the 2 last fiels, and then apply my article ...
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39795964
I was originally thinking I could use window functions to do this, but doesn't seem to be coming together.
0
 
LVL 74

Accepted Solution

by:
sdstuber earned 500 total points
ID: 39797139
you do use window functions, but the problem is creating a groupable/partionable value that is an aggregate of the individual values within an order.  That's what the check sum is for.

Unfortuntately, sqlserver doesn't have a listagg function like Oracle; but you can fake it with stuff and xmlpath query

try this...


SELECT customerid,
         issuedate,
         classid,
         itemid,
         quantity
    FROM (SELECT customerid,
                 issuedate,
                 classid,
                 itemid,
                 quantity,
                 RANK() OVER(PARTITION BY customerid, orderlist ORDER BY issuedate) rn
            FROM (SELECT customerid,
                         issuedate,
                         classid,
                         itemid,
                         quantity,
                         stuff(
           (  SELECT ', ' + cast(classid as varchar(5)) +'|' + cast(itemid as varchar(5)) + '|' + cast(quantity as varchar(5))
                FROM yourtable t1
               WHERE t1.customerid = t2.customerid AND t1.issuedate = t2.issuedate
            ORDER BY classid, itemid, quantity
            FOR XML PATH('')
           ),
           1,
           2,
           ''
       )
           AS orderlist
                    FROM yourtable as t2) as x) as y
   WHERE rn = 1
ORDER BY customerid,
         issuedate,
         classid,
         itemid DESC,
         quantity

Open in new window

0
 
LVL 15

Author Comment

by:dbbishop
ID: 39798210
sdstuber: Works like a charm, with a small number of customers (10-20). When I ran it against the production database (250K rows), I killed the query after 2 and a half hours.
I then tried to 'pre-filter' (not very accurate, but for timing, it was close enough) and restricted the query to a list of about 300 customer ID (where CustomerID IN (...)), and it ran in a little over 5 minutes. However, the pre-filtering is not accurate enough for my needs and a 2-1/2 hour+ run time is not acceptable.

I may just have to accept your answer (it does work, just too much data for my needs) and try to figure out another way to do it. YUCK! I had such hopes.
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 39798270
You could run it in batches of distinct customers and kick off several of them at a time.

Customers 1-100 in one query
Customers 101-200 in another
etc
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 39798342
I wonder...

You might be able to do a simpler query for big changes and then the stuff/xml for more subtle.

By simple changes I mean where it's obvious "something" changed, even if you don't dig into the details of exactly which line items did.  For example:  sum(qty)  or  count(distinct item) or min(item) or max(item)

By subtle changes, I mean all orders that aren't identified as obvious by previous criteria.
So that would entail a not in or not exists type subquery using previous results.

I know it would be possible, I'm not sure it would actually be more efficient to triple-dip into the table like that
0
 
LVL 15

Author Comment

by:dbbishop
ID: 39799996
The actual structure of the database is rather more complex than I made the question (why muddle the point?) There are in fact, six tables involved in a join structure to pull together the five columns mentioned above.

Indexes are in place, although not the most efficient for this particular task. This is basically a data conversion effort that will be run 4-5 times in the next few months into a production environment, so having a clustered index definition change is not going to happen. The 250K rows mentioned above are just from 'stage 1' of the conversion effort. I expect that number to jump well over 1.5 million by the final conversion, although the 'set' of data I am interested in will still be roughly the same.

To some degree I can create additional indexes on the fly for the conversion effort, then drop them afterwards, but I believe I've done about the best I can to optimize from that standpoint.

I may look at executing the FOR XML stand-alone into a temp table and working off it, creating sufficient indexes, and then do the ranking off the temp table.
0
 
LVL 15

Author Closing Comment

by:dbbishop
ID: 39799999
We're on the right track if I can optimize it.
0

Featured Post

Free Backup Tool for VMware and Hyper-V

Restore full virtual machine or individual guest files from 19 common file systems directly from the backup file. Schedule VM backups with PowerShell scripts. Set desired time, lean back and let the script to notify you via email upon completion.  

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
This is my first video review of Microsoft Bookings, I will be doing a part two with a bit more information, but wanted to get this out to you folks.

622 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question