Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
Solved

# SQL Query to Find Earliest of 'Sets' of Data

Posted on 2014-01-20
Medium Priority
245 Views
I have the following data:
CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity)

There can be multiple ItemID items issued to a Customer on a date. I want to identify the earliest (MIN()) Date where the items, and SUM(Quantity) issued on that date match subsequently issued items and Quantity.
Thus, with the given data:
``````CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130712          20       182        100
123       20130712          20       145         50
123       20130712          20       111         62
123       20130812          20       182        100
123       20130812          20       145         50
123       20130812          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62
``````
You will note that for 6/12/2013, 7/12/2013 and 8/12/2013 the exact same order was placed. Then on 9/12/2013 it changed.

What I want to do is return the earliest (e.g. 6/12/2013 and 9/12/2013) list of data associated with a customer. If ANY of the values change from order to order, then that becomes the start of a new set and the items on that date are returned. Thus, the results from my query will return:
``````CustID    IssueDate    ClassID    ItemID   Quantity
------    --------     -------    ------   --------
123       20130612          20       182        100
123       20130612          20       145         50
123       20130612          20       111         62
123       20130912          20       182         60
123       20130912          20       132         18
123       20130912          20       111         62
``````
0
Question by:dbbishop
[X]
###### Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 6
• 6
• 2
• +1

LVL 74

Expert Comment

ID: 39795142
Something like this?

select CustomerID, MIN(IssueDate) as IssueDate, ClassID, ItemID, Quantity from
(select CustomerID, IssueDate, ClassID, ItemID, SUM(Quantity) as Quantity from yourtable
group by CustomerID, IssueDate, ClassID, ItemID) as x
group by CustomerID, ClassID, ItemID, Quantity
order by CustomerID, IssueDate, ClassID, ItemID, Quantity
0

LVL 143

Expert Comment

ID: 39795164
this article will explain how exactly you have to approach this kind of questions:
http://www.experts-exchange.com/Database/Miscellaneous/A_3203-DISTINCT-vs-GROUP-BY-and-why-does-it-not-work-for-my-query.html
0

LVL 11

Expert Comment

ID: 39795166
you can try something like this

``````Select CustID, IssueDate, ClassID, ItemID, Sum(Quantity) as SUM
from TABLENAME
group by CustID, IssueDate, ClassID, ItemID, Quantity

Order by Convert(DATETIME, LEFT(IssueDate, 8)) desc
``````
0

LVL 74

Expert Comment

ID: 39795212
My first attempt wouldn't pick up the   row

123       20130912          20       111         62

The problem is determining a change in the order as a whole.  In order to do that we'll create an arbitrary new  as a checksum.  We'll sum the product of the classid, itemid and the quantity for the order.  These numbers don't really mean anything except to serve as indicators of an order group.

``````  SELECT customerid,
issuedate,
classid,
itemid,
quantity
FROM (SELECT customerid,
issuedate,
classid,
itemid,
quantity,
checksum,
ROW_NUMBER()
OVER(
PARTITION BY customerid,
classid,
itemid,
quantity,
checksum
ORDER BY issuedate
)
AS rn
FROM (SELECT customerid,
issuedate,
classid,
itemid,
SUM(quantity)
OVER(
PARTITION BY customerid,
issuedate,
classid,
itemid
)
AS quantity,
SUM(classid * itemid * quantity) OVER (PARTITION BY customerid, issuedate)
checksum
FROM yourtable) as x) as y
WHERE rn = 1
ORDER BY customerid,
issuedate,
classid,
itemid DESC,
quantity
``````

Note, the checksums aren't guaranteed.  You could possibly miss some changes if the checksums of two different orders happen to calculate out to the same value but it should be somewhat rare.
0

LVL 15

Author Comment

ID: 39795346
sdstuber: There are a finite number of values for ClassID and ItemID.
In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.
0

LVL 15

Author Comment

ID: 39795365
Guy,

I read your article and I am not certain there is anything in it that pertains to what I am looking for. You define how to get the last based on different criteria. I may want the first, middle and last, just the first or the first and last. The 'KEY' is the CustomerID. I ALWAYS want the first set of records (I.e. all records associated with the earliest date, which would be the first three in my example above). Then, whenever any of the other associated data changes, I want the set of records associated with the date of change (the last set of three records in my example). That set becomes the new 'base' set of data, and anytime in the future anything changes in that set, I would then capture that set of data.

A 'change' that would identify the start of a new set of data could be a change in quantity or class, an item added, an item removed, etc.
0

LVL 74

Expert Comment

ID: 39795411
>>> In fact, it is VERY likely that a majority of customers in a Class will have the same ItemID and Quantity, so I don't think your solution is workable.

It's ok if individual items and quantities stay the same,  in fact, that's what makes it work.

The problem would be if you had
Class 10 Item 10 Qty 20
Class 10 Item 12 Qty 5

For a checksum of 2600

Then that changed to

Class 10 Item 25 Qty 10
Class 10 Item 2 Qty 5

For a checksum of 2600
0

LVL 143

Expert Comment

ID: 39795465
and indeed it's not easy, as your "set" to be identical are over several rows.

hence, I would suggest to group by the first fields, and do a listagg of the 2 last fiels, and then apply my article ...
0

LVL 15

Author Comment

ID: 39795964
I was originally thinking I could use window functions to do this, but doesn't seem to be coming together.
0

LVL 74

Accepted Solution

sdstuber earned 2000 total points
ID: 39797139
you do use window functions, but the problem is creating a groupable/partionable value that is an aggregate of the individual values within an order.  That's what the check sum is for.

Unfortuntately, sqlserver doesn't have a listagg function like Oracle; but you can fake it with stuff and xmlpath query

try this...

``````SELECT customerid,
issuedate,
classid,
itemid,
quantity
FROM (SELECT customerid,
issuedate,
classid,
itemid,
quantity,
RANK() OVER(PARTITION BY customerid, orderlist ORDER BY issuedate) rn
FROM (SELECT customerid,
issuedate,
classid,
itemid,
quantity,
stuff(
(  SELECT ', ' + cast(classid as varchar(5)) +'|' + cast(itemid as varchar(5)) + '|' + cast(quantity as varchar(5))
FROM yourtable t1
WHERE t1.customerid = t2.customerid AND t1.issuedate = t2.issuedate
ORDER BY classid, itemid, quantity
FOR XML PATH('')
),
1,
2,
''
)
AS orderlist
FROM yourtable as t2) as x) as y
WHERE rn = 1
ORDER BY customerid,
issuedate,
classid,
itemid DESC,
quantity
``````
0

LVL 15

Author Comment

ID: 39798210
sdstuber: Works like a charm, with a small number of customers (10-20). When I ran it against the production database (250K rows), I killed the query after 2 and a half hours.
I then tried to 'pre-filter' (not very accurate, but for timing, it was close enough) and restricted the query to a list of about 300 customer ID (where CustomerID IN (...)), and it ran in a little over 5 minutes. However, the pre-filtering is not accurate enough for my needs and a 2-1/2 hour+ run time is not acceptable.

I may just have to accept your answer (it does work, just too much data for my needs) and try to figure out another way to do it. YUCK! I had such hopes.
0

LVL 74

Expert Comment

ID: 39798270
You could run it in batches of distinct customers and kick off several of them at a time.

Customers 1-100 in one query
Customers 101-200 in another
etc
0

LVL 74

Expert Comment

ID: 39798342
I wonder...

You might be able to do a simpler query for big changes and then the stuff/xml for more subtle.

By simple changes I mean where it's obvious "something" changed, even if you don't dig into the details of exactly which line items did.  For example:  sum(qty)  or  count(distinct item) or min(item) or max(item)

By subtle changes, I mean all orders that aren't identified as obvious by previous criteria.
So that would entail a not in or not exists type subquery using previous results.

I know it would be possible, I'm not sure it would actually be more efficient to triple-dip into the table like that
0

LVL 15

Author Comment

ID: 39799996
The actual structure of the database is rather more complex than I made the question (why muddle the point?) There are in fact, six tables involved in a join structure to pull together the five columns mentioned above.

Indexes are in place, although not the most efficient for this particular task. This is basically a data conversion effort that will be run 4-5 times in the next few months into a production environment, so having a clustered index definition change is not going to happen. The 250K rows mentioned above are just from 'stage 1' of the conversion effort. I expect that number to jump well over 1.5 million by the final conversion, although the 'set' of data I am interested in will still be roughly the same.

To some degree I can create additional indexes on the fly for the conversion effort, then drop them afterwards, but I believe I've done about the best I can to optimize from that standpoint.

I may look at executing the FOR XML stand-alone into a temp table and working off it, creating sufficient indexes, and then do the ranking off the temp table.
0

LVL 15

Author Closing Comment

ID: 39799999
We're on the right track if I can optimize it.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
An alternative to the "For XML" way of pivoting and concatenating result sets into strings, and an easy introduction to "common table expressions" (CTEs). Being someone who is always looking for alternatives to "work your data", I came across this â€¦
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as formâ€¦
Weâ€™ve all felt that sense of false security beforeâ€”locking down external access to a database or component and feeling like weâ€™ve done all we need to do to secure company data. But that feeling is fleeting. Attacks these days can happen in many wâ€¦
###### Suggested Courses
Course of the Month5 days, 23 hours left to enroll