Link to home
Start Free TrialLog in
Avatar of Steve Berger
Steve BergerFlag for United States of America

asked on

How to remove the overlapping data?

Hi,
      I have data like below.
MODEL           PRICE            START_DATE_ACTIVE            END_DATE_ACTIVE
SONY             100                         1/1/1901                           12/31/2005
SONY             200                         1/1/2006                           12/31/2007
SONY             300                         1/1/2008                           12/31/2008
SONY             400                         2/1/2008                           12/31/2009
SONY             500                         1/1/2009                           12/31/2010
                     The above the model SONY is having the price range that depends on the start_date_active and end_date_active column. For example in the first row, the model SONY price is 100 from 1/1/2006 to 12/31/2007. The same model SONY price is 200 from 1/1/2008 to 12/31/2008. So the model price will come like this. But in some cases the date will be overlapping. See the row number 3 and 4. I should remove the rownumber 4. How can i remove the overlapping date data?

Thanks
Avatar of Sean Stuber
Sean Stuber

try this....
DELETE FROM   your_table
      WHERE   ROWID IN
                      (SELECT   rid
                         FROM   (SELECT   ROWID rid,
                                          start_date_active sda,
                                          end_date_active eda,
                                          LAG(end_date_active)
                                              OVER (
                                                  PARTITION BY model
                                                  ORDER BY start_date_active, end_date_active
                                              )
                                              prevend,
                                          LEAD(start_date_active)
                                              OVER (
                                                  PARTITION BY model
                                                  ORDER BY start_date_active, end_date_active
                                              )
                                              nextstart
                                   FROM   your_table)
                        WHERE   sda < prevend AND eda > nextstart)

Open in new window

Avatar of Steve Berger

ASKER

Hi thanks for your reply. It is returning no rows.

Can you clarify:
Row number 3 and 4 are overlapping but rownumber 4 exceeds the end dat of rownumber 3. Also rownumber 4 and 5 are overlapping but rownumber 5 end date exceeds the end date of row number4.
What is your ouptut if you don't have rownumber 5 in your sample set?
Hi sdstuber, i added one more condition in this query. It is not fetching any rows. In my table, i have 10000 rows. I want to see only the particular item. So i have added one AND condition in that query like below. Please verify my below query is correct or not.

SELECT * FROM   sample_table
      WHERE   ROWID IN
                      (SELECT   rid
                         FROM   (SELECT   ROWID rid,
                                          new_start_date sda,
                                          new_end_date eda,
                                          LAG(new_end_date)
                                              OVER (
                                                  PARTITION BY ITEM_NAME
                                                  ORDER BY new_start_date, new_end_date
                                              )
                                              prevend,
                                          LEAD(new_start_date)
                                              OVER (
                                                  PARTITION BY ITEM_NAME
                                                  ORDER BY new_start_date, new_end_date
                                              )
                                              nextstart
                                   FROM   sample_table)
                        WHERE   sda < prevend AND eda > nextstart)                                                
                        AND ITEM_NAME = 'PHILIPS'
that will work or you can put it in the inner query
SELECT   *
  FROM   sample_table
 WHERE   ROWID IN
                 (SELECT   rid
                    FROM   (SELECT   ROWID rid,
                                     new_start_date sda,
                                     new_end_date eda,
                                     LAG(new_end_date)
                                         OVER (
                                             PARTITION BY item_name
                                             ORDER BY new_start_date, new_end_date
                                         )
                                         prevend,
                                     LEAD(new_start_date)
                                         OVER (
                                             PARTITION BY item_name
                                             ORDER BY new_start_date, new_end_date
                                         )
                                         nextstart
                              FROM   sample_table
                             WHERE   item_name = 'PHILIPS')
                   WHERE   sda < prevend AND eda > nextstart)

Open in new window

Hi Sharath,
      you are correct. In my case overlapping means, it should not start with same year again with another row. For the above scenario, rownum 3 is ended with 12/31/2008. So that particular model price is deactivated after this date. So it has to start the new price from  1/1/2009  date. So nownum 5 is valid. In between rownum 3 and 5, again the start date started from the year which is already mentioned in rownum 3. So we should not consider this right. I thing now you can understand right.

Thanks
Hi sdstuber,
         If i change the condition like below, it is giving the overlapping data for me. See my attached file for my actual data.
WHERE   sda < prevend AND eda < nextstart). Can you tell me what is the reason?
TESTING-DATA.xls
the problem was your original sample violated both ends
so,  sda < prevend AND eda > nextstart  found it.

but your new sample only violates on one end.
since all "good data will have eda < nextstart  that's not really a helpful condition.
so  sda < prevend AND eda < nextstart
would fail to find bad data like your original sample


changing the AND to OR will work except then you'll get 2 rows returns for each overlap
because a violation on either end will guarantee the other row will fail on the other end.

from your last example

2008/12/31 (end) on line 6 > 2008/12/01 (start) on line 7 so that's an overlap
but by the same logic  
2008/12/01 (start) on line 7 < 2008/12/31 (end) on line 6 so that's an overlap
we could check on just one end, but then if you have cases like in your original example then you'll get false positives around the problem row.

To illustrate, from your first example, the 400 line is the problem.
if we check only the end then the 300 line will show up as a problem as well as the 400 line.
if we check only the start then the 500 line will show up as a problem as well as the 400 line

SONY             300                         1/1/2008                           12/31/2008
SONY             400                         2/1/2008                           12/31/2009
SONY             500                         1/1/2009                           12/31/2010
I don't want to remove 2 overlapping records. I want remove only the line number 7. becasue that start date is not ended to other rows. For example, if the end_date is 2008/12/31, the next row should start from 1/1/2009. But line number 6 is starting with 2008/12/01 right. So we need to consider this as a overlap. So i need to remove only those rows. But my first example also applicable for some cases. It should ended with after the end date of previous row. We need to consider both the cases.
Unless i can give only like below right.

WHERE   sda < prevend
easiest solution is probably to run two deletes...

first delete all rows that have a double violation
then delete rows where the new_start_date < previous row's end date
DELETE FROM sample_table
      WHERE ROWID IN
                    (SELECT rid
                       FROM (SELECT ROWID rid,
                                    new_start_date,
                                    new_end_date,
                                    LAG(new_start_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        prevstart,
                                    LAG(new_end_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        prevend,
                                    LEAD(new_start_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        nextstart,
                                    LEAD(new_end_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        nextend
                               FROM sample_table
                              WHERE item_name = 'SONY')
                      WHERE new_start_date < prevend AND new_end_date > nextstart);
 
DELETE FROM sample_table
      WHERE ROWID IN
                    (SELECT rid
                       FROM (SELECT ROWID rid,
                                    new_start_date,
                                    new_end_date,
                                    LAG(new_start_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        prevstart,
                                    LAG(new_end_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        prevend,
                                    LEAD(new_start_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        nextstart,
                                    LEAD(new_end_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        nextend
                               FROM sample_table
                              WHERE item_name = 'SONY')
                      WHERE new_start_date < prevend );

Open in new window

you can't "just" do the delete where next_start_date < prevend  because rows with double violation will cause a double row delete.  themselves plus the following row.

that's why I suggest doing it in two steps.
Ya it was great. I am putting those two queries in union and it is working for both condition. But if i make it as UNION, how should i delete?
don't combine them.

do two deletes

if you do both at once then you'll get extra deletes like I described above.

first delete the rows that violate on both ends
then delete the rows that violate on one end.
I mean how should i delete those records from the table?
if you want to try to combine them then you'll have to write a much more complicated set of filters in order to determine if rows that violate on only one end should be kept anyway because their violation is caused by another row's double ended violation.

much simpler to just do two deletes
oh sorry, didn't see you posted again.

how to delete?
with the two deletes I posted above.
can you try this?
select t1.MODEL,t1.PRICE,t1.START_DATE_ACTIVE,t1.END_DATE_ACTIVE
  from (select *,ROW_NUMBER() over (partition by Model order by Price) RN from your_table) t1
  left join (select *,ROW_NUMBER() over (partition by Model order by Price) RN from your_table) t2
    on t1.Model = t2.Model and t1.RN - 1 = t2.RN
  left join (select *,ROW_NUMBER() over (partition by Model order by Price) RN from your_table ) t3
    on t1.Model = t3.Model and t1.RN + 1 = t3.RN
 where case when (t1.START_DATE_ACTIVE < t2.END_DATE_ACTIVE) and 
                 (t1.END_DATE_ACTIVE > t3.START_DATE_ACTIVE) and 
                 (t1.END_DATE_ACTIVE < t3.END_DATE_ACTIVE)  then 1 else 0 end = 0
 order by t1.Model,t1.RN

Open in new window

why query the table 3 times?

I don't understand the strategy behind the query,
I do understand the functionality of it, just not the why.


but, speaking of "why",  in my previously posted deletes, I included a couple of extra analytics that aren't really necessary.
I was experimenting and didn't take them out before posting.  They aren't harmful but aren't helpful either.

here are cleaner versions.  I'll admit, still "bigger" code than the previous post, but it should be much more efficient.
DELETE FROM sample_table
      WHERE ROWID IN
                    (SELECT rid
                       FROM (SELECT ROWID rid,
                                    new_start_date,
                                    new_end_date,
                                    LAG(new_end_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        prevend,
                                    LEAD(new_start_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        nextstart
                               FROM sample_table
                              WHERE item_name = 'SONY')
                      WHERE new_start_date < prevend AND new_end_date > nextstart);
 
DELETE FROM sample_table
      WHERE ROWID IN
                    (SELECT rid
                       FROM (SELECT ROWID rid,
                                    new_start_date,
                                    new_end_date,                                   
                                    LAG(new_end_date)
                                        OVER (
                                            PARTITION BY item_name
                                            ORDER BY new_start_date, new_end_date
                                        )
                                        prevend
                               FROM sample_table
                              WHERE item_name = 'SONY')
                      WHERE new_start_date < prevend );

Open in new window

<<I don't understand the strategy behind the query,
I do understand the functionality of it, just not the why. >>
Do you want explanation?


select t1.MODEL,t1.PRICE,t1.START_DATE_ACTIVE,t1.END_DATE_ACTIVE
  from (select *,ROW_NUMBER() over (partition by Model order by Price) RN from your_table) t1
  left join (select *,ROW_NUMBER() over (partition by Model order by Price) RN from your_table) t2
    on t1.Model = t2.Model and t1.RN - 1 = t2.RN
  left join (select *,ROW_NUMBER() over (partition by Model order by Price) RN from your_table ) t3
    on t1.Model = t3.Model and t1.RN + 1 = t3.RN
 where case when (t1.START_DATE_ACTIVE < t2.END_DATE_ACTIVE) and 
                 (t1.END_DATE_ACTIVE > t3.START_DATE_ACTIVE) and 
                 (t1.END_DATE_ACTIVE < t3.END_DATE_ACTIVE)  then 1 else 0 end = 0
 order by t1.Model,t1.RN
/* query result
MODEL	PRICE	START_DATE_ACTIVE	END_DATE_ACTIVE
SONY	100	1901-01-01 00:00:00.000	2005-12-31 00:00:00.000
SONY	200	2006-01-01 00:00:00.000	2007-12-31 00:00:00.000
SONY	300	2008-01-01 00:00:00.000	2008-12-31 00:00:00.000
SONY	500	2009-01-01 00:00:00.000	2010-12-31 00:00:00.000
*/

Open in new window

not an explanation of the functionality (results) no.  I understand what it does and how it works.

but what are you trying to do?  it clearly produces different results than what the asker is looking for.
I assume you are doing something different intentionally, so what is it/why that approach?

Particularly since it's so "heavy" with 3 outer joins.

So, I know what it does, and even if I didn't, I ran it myself (after fixing small syntax errors) and saw the results.
I'm just not sure what direction you're going.

It might be a good approach but "as is" I don't know how to use what you've posted.
>> it clearly produces different results than what the asker is looking for.

Is it? Can you provide some example set where you are getting different results.
Asker - can you test this on your whole sample data and let me know where its failing. I don' say that its the perfect solution, i just tested this with the sample set provided but check against your complete table.
>> Particularly since it's so "heavy" with 3 outer joins.

JOINing 3 times may be over head, but to achieve the required output, i used this approach.
>> I assume you are doing something different intentionally, so what is it/why that approach?

I don't understand your meaning of "intentionally" here. I am trying to achieve the asker's expected result set.  
To remove the overlapping records, i am checking previous record and next records Start /End dates with current record's Start/End dates. (The condition in my CASE statement).
Try this...

I've modified your query to fit the most recent structure posted by the asker
I populated it with the data from the spreadsheet, but changed the first row to not go back to 1901 then added more historical data


the goal is to delete the rows with overlaps
with the sample data below there are 2 overlaps
the row with a price of 400 and the second row with a price of 17500

If you run the 2 deletes I posted earlier you'll be left with 10 non-overlapping rows.

Your query doesn't return the 2 rows that should be deleted

I tried looking at in reverse, maybe you were trying to return the data that would be left after deletes; but it doesn't do that either.  

That's what I meant by "doing something intentionally".  It didn't appear to be a small mistake pursuing the same goal I had ben and simply got some unexpected results.  Your approach appeared so radically different from where the thread had been going that it seemed like you must be doing something different and on purpose and I simply don't understand.


drop table sample_table;
 
CREATE TABLE sample_table
AS
    SELECT price_list_name,
           item_name,
           price,
           TO_DATE(nsd, 'yyyy/mm/dd') new_start_date,
           TO_DATE(ned, 'yyyy/mm/dd') new_end_date
      FROM (SELECT 'Global Price List All' price_list_name,
                   'SONY' item_name,
                   15500 price,
                   '2005/01/01' nsd,
                   '2005/05/14' ned
              FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 16000, '2005/05/15', '2006/02/04' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 16800, '2006/02/05', '2007/06/09' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 17000, '2007/06/10', '2007/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 17500, '2008/01/01', '2008/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 17500, '2008/12/01', '2008/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 18000, '2009/01/01', '2010/12/31' FROM DUAL
            union all
            SELECT 'Global Price List All', 'SONY', 100, '1901/01/01', '1995/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 200, '1996/01/01', '1997/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 300, '1998/01/01', '1998/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 400, '1998/02/01', '1999/12/31' FROM DUAL
            UNION ALL
            SELECT 'Global Price List All', 'SONY', 500, '1999/01/01', '2000/12/31' FROM DUAL);
            
select t1.item_name,t1.PRICE,t1.new_START_DATE,t1.new_END_DATE
  from (select sample_table.*,ROW_NUMBER() over (partition by item_name order by Price) RN from sample_table) t1
  left join (select sample_table.*,ROW_NUMBER() over (partition by item_name order by Price) RN from sample_table) t2
    on t1.item_name = t2.item_name and t1.RN - 1 = t2.RN
  left join (select sample_table.*,ROW_NUMBER() over (partition by item_name order by Price) RN from sample_table ) t3
    on t1.item_name = t3.item_name and t1.RN + 1 = t3.RN
 where case when (t1.new_START_DATE < t2.new_END_DATE) and 
                 (t1.new_END_DATE > t3.new_START_DATE) and 
                 (t1.new_END_DATE < t3.new_END_DATE)  then 1 else 0 end = 0
 order by t1.item_name,t1.RN            

Open in new window

I missed one more condition in my previous post.
yes, I am trying to return the data that would be left after deletes.

select t1.item_name,t1.PRICE,t1.new_START_DATE,t1.new_END_DATE
  from (select sample_table.*,ROW_NUMBER() over (partition by item_name order by Price) RN from sample_table) t1
  left join (select sample_table.*,ROW_NUMBER() over (partition by item_name order by Price) RN from sample_table) t2
    on t1.item_name = t2.item_name and t1.RN - 1 = t2.RN
  left join (select sample_table.*,ROW_NUMBER() over (partition by item_name order by Price) RN from sample_table ) t3
    on t1.item_name = t3.item_name and t1.RN + 1 = t3.RN
 where case when t1.new_START_DATE <= t2.new_END_DATE 
            and ((t1.new_END_DATE <= t2.new_END_DATE) or 
                (t1.new_END_DATE >= t2.new_END_DATE and t1.new_END_DATE >= t3.new_START_DATE)) then 1 else 0 end = 0
 order by t1.item_name,t1.RN 

Open in new window

ok, maybe I'm misunderstanding the request

I thought the goal was to actually "do" the deletes of the invalid rows

Suriyaraj_Sudalaiappan, can you please clarify?
ASKER CERTIFIED SOLUTION
Avatar of Sean Stuber
Sean Stuber

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi,
  Shall i use the query like below. In my table, i have 9000 rows. I did not give any where condition to select the particular model.

SELECT rid
FROM (SELECT ROWID rid,
            price_list_name,
            item_name,
            uom_code,
            new_start_date,
            new_end_date,
            formula_name,
            LAG(new_end_date)
                OVER (
                    PARTITION BY item_name
                    ORDER BY new_start_date, new_end_date
                )
                prevend,
            LEAD(new_start_date)
                OVER (
                    PARTITION BY item_name
                    ORDER BY new_start_date, new_end_date
                )
                nextstart
        FROM SAMPLE_TABLE)
        WHERE (new_start_date < prevend)
        OR (new_start_date < prevend AND new_end_date > nextstart);
what do you want to do with that query?


as you can see from the threads above there is some confusion as to what you are really trying to accomplish.
I have two guesses as to your goal and I've posted answers to both of them.


if you want to do a delete of data from your table, then no
use the two deletes I posted before in 24034314, run them in that same order

if you want to show records that will be kept if you were to do a delete then use the query I posted in 24035849 (you could also use Sharath's but it's less efficient)


if neither of those goals is really what you're trying to do, please try explaining again.
you can remove the model or item_name condition, of course
essdbdu33 $cp l3109359.dbg $XXOM_TOP/log
essdbdu33 $ls -l $XXOM_TOP/log
total 800
-rw-r--r--   1 rhuggi   aosimp    401791 Apr  1 12:53 l3109359.dbg
essdbdu33 $>l3109359.dbg
bash: l3109359.dbg: Permission denied
essdbdu33 $pwd
/apps/betsyp1/betsyp1db/10.2.0/temp/betsyp1_essdbdu33

Hi,
Actually i want to update my custom table if any overlapping data are there.
        Actually my requirement is, we need to identify the overlapping data from the table and update the same. Here overlapping means, we have to consider only the start date and end date columns. We have 'n' number of items in my table and it contains different prices. Price is different for same item with different start and end date. That means, price is applicable for only with in that period. So we should be start from one date and end with other date. That date should not be repeated multiple times. For example my sample data, it start with 1/1/2000 and end with 12/31/2001. Then the next date should start with 1/1/2002. It should not start 1/2/2002 and more than that. because my data is like that. If any start like 1/2/2002 and more than that we should update that row in the table as overlapping.
      And i got two scenarios from my data sheet like below. The start date is same for two rows and end date is different. Some data start date is different and end date is same. I will give the example. In this we don't know which one we should remove. For that discussion is going on right now. I will let you know after they confirmed.

For first scenario:
START_DATE      END_DATE
10/28/2006      12/31/2006
10/28/2006      12/31/2010
For second scenario:
START_DATE      END_DATE
1/1/2008            12/31/2010
1/2/2008            12/31/2010

 Below query i modified from your query. Please verify that and let me know if any changes. And i included one more condition here like pps1.new_end_date <> prevend. It will helpful for my second scenario to remove same end date and consider only one. But not decided this one.
UPDATE /*+index(pps) */
                XXQP_PAM_PRICE_STG pps
            SET
                pps.process_flag = 3,
                pps.error_msg = 'Overlapping data, not consider for Import'
            WHERE
                pps.process_flag = 1
                AND pps.ROWID IN (
                            SELECT /*+index(pps1) */ rid
                            FROM (
                                SELECT ROWID rid,
                                        price_list_name,
                                        item_name,
                                        uom_code,
                                        new_start_date,
                                        new_end_date,
                                        formula_name,
                                        LAG(new_end_date)
                                            OVER (
                                                PARTITION BY item_name
                                                ORDER BY new_start_date, new_end_date
                                            )
                                            prevend,
                                        LEAD(new_start_date)
                                            OVER (
                                                PARTITION BY item_name
                                                ORDER BY new_start_date, new_end_date
                                            )
                                            nextstart
                                    FROM XXQP_PAM_PRICE_STG
                                    ) pps1
                            WHERE ((pps1.new_start_date < prevend AND pps1.new_end_date <> prevend) OR (pps1.new_start_date < prevend AND pps1.new_end_date > nextstart))
                            AND pps1.item_name = pps.item_name
                            );

Thanks      
don't consider from the staring in my previous comment. some copy pase mistake. Please read from "HI"
so,  this entire question has been wrong then it seems.

you're not trying to do a delete at all
and, you're not just trying to find overlaps, but you're also trying to find holes?

your description of one record ending 12/31/2001 but the next record should not start 1/2/2002
is that because it would leave a "hole" of 1/1/2001?


please post complete example of data (not a query)
please use a create table like I posted above with your entire table structure and enough sample data to show all scenarios (good and bad) you want to consider.

then post what you want the table to look like afterward.  Don't worry about trying to guess what query might be used, just manually create the data in its final form and we'll try to figure out what query will produce the results you are looking for






by the way, please close your old questions, you have open questions, including apparent repeats (I just posted to one of them) over 2 months old.
GOOD ONE
glad I could help