Avatar of Steve Berger
Steve Berger
Flag for United States of America asked on

How to find the duplicate records in oracle SQL?

Hi,
   I have table which contains huge number of data. Attached is the sample data for my scenario. If you see the sample data file,  the column name ( NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) which is contains the same data. It is only differ in ADJUSTMENT_FACTOR. Thing is, all the columns except ADJUSTMENT_FACTOR should not repeat again, if the ADJUSTMENT_FACTOR value is different also. See the LINE_INDEX 1,2 data. I want to get the combination of column data which are not present again with different ADJUSTMENT_FACTOR value. If it comes again both the combination are duplicate. For identifying the combination we can use the LINE_INDEX column. For example LINE_INDEX 1 is one combination, 2 is another combination like that. Another thing is the ADJUSTMENT_FACTOR value is same for different combination also. If it is there, that is unique data. See the line_index value 3. Here the ADJUSTMENT_FACTOR value is same 0.4. But it is differ in ATTR_VALUE. So this is unique record. I want to write a SQL Query to achieve that. I tried so many times. Below is my query. can any one share with your thoughts?

Thanks
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   NOT EXISTS (SELECT 
                    'x'
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

sample-date.xls
Oracle Database

Avatar of undefined
Last Comment
Steve Berger

8/22/2022 - Mon
johnsone

I would hope there is an index that contains at leas a few of the most selective fields.

You are missing 1 thing from your sub query.  There is no way to distinguish the original row you want to keep.  This is a way to do it, but by no means the only way.
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   rowid != (SELECT 
                    min(rowid)
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

Steve Berger

ASKER
Hi thanks for your reply. This query also is returning the same result. I thing we need to identify the combination first then we nee to check the table whether the same combination are available or not. I also trying this.
johnsone

The query I posted should give you the duplicate rows.

This one should give you the unique rows.
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   rowid = (SELECT 
                    min(rowid)
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Steve Berger

ASKER
Hi, this query is getting no rows. The output will be, it will eliminate both line_index 1 and 2. And it will return only the line_index 3 combination. Because LINE_INDEX 1 and 2 contains the same rows with different ADJUSTMENT_FACTOR value. So this duplicate. In LINE_INDEX 3 the ATTR_VALUE is different right. So it will not eliminate.
Steve Berger

ASKER
Hi, this query is getting no rows. i will tell you, how the output is? The output will be, it will eliminate both line_index 1 and 2. And it will return only the line_index 3 combination. Because LINE_INDEX 1 and 2 contains the same rows with different ADJUSTMENT_FACTOR value. So this duplicate. In LINE_INDEX 3 the ATTR_VALUE is different right. So it will not eliminate.
Sean Stuber

your attachment is 0 bytes, can you post sample data again?
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
johnsone

Sorry, I cut an pasted your code incorrectly.  It should be an =, not a <>.

This should give you the unique values.

FYI - your attached sample file is empty.
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   rowid = (SELECT 
                    min(rowid)
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) = NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

awking00

Please re-post your sample data (as indicated by sdstuber, the file was empty) and which records you want to see in the result set.
Steve Berger

ASKER
Sorry for the wrong file. I have added the sample data file again. Then again it is giving the same output.
sample-data.xls
Your help has saved me hundreds of hours of internet surfing.
fblack61
Steve Berger

ASKER
See some of the data in ADJUSTMENT_FACTOR column null. The output i am getting like only the not null rows in ADJUSTMENT_FACTOR column.
johnsone

I loaded your sample data and the last query I posted gives back rows 2, 3, 4, 5 and 9.  That seems correct to me.
SQL> SELECT a.* FROM
  2      FACTOR_STG_SAM a
  3  WHERE
  4     rowid = (SELECT
  5                      min(rowid)
  6                  FROM
  7                      FACTOR_STG_SAM b
  8                  WHERE
  9                     a.NAME = b.NAME
 10                      AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEX
 11                      AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
 12                      AND a.ATTR_VALUE = b.ATTR_VALUE
 13                      AND NVL(a.ADJUSTMENT_FACTOR,0) = NVL(b.ADJUSTMENR,0)
 14                      AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
 15                      AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
 16                  );
 
NAME
--------------------------------------------------
FACTOR_ATTR_CONTEXT
--------------------------------------------------
FACTOR_ATTRIBUTE
--------------------------------------------------
ATTR_VALUE                                         ADJUSTMENT_FACTOR
-------------------------------------------------- -----------------
START_DATE_ACTIVE    END_DATE_ACTIVE      B LINE_INDEX
-------------------- -------------------- - ----------
LEVEL3.2
CUST_HIERARCHY
CUST_LEVEL3
CHW                                                               .4
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 B          1
 
LEVEL3.2
PRODUCT_ LINE
PRODUCT_LINE
AM
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A          1
 
LEVEL3.2
WRLD_REGION_HIERARCHY
WRLDREGION_LEVEL2
AP
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A          1
 
LEVEL3.2
WRLD_REGION_HIERARCHY
WRLDREGION_LEVEL2
NASA
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A          3
 
LEVEL3.2
CUST_HIERARCHY
CUST_LEVEL3
CHW                                                            .4834
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 B          2

Open in new window

Steve Berger

ASKER
Hi all,
    Thanks for your replies.

If you see my data file, column (NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) data are same in some cases. Some cases these data will be different. And other thing these are the combination of records. We can identify the combination by use of line_index column. Line_index 1 is one combination, 2 is other combination like that. My requirement is, the data should not return the same combination again with different ADJUSTMENT_FACTOR or same ADJUSTMENT_FACTOR also. Because one combination must contain only one adjustment factor value. It should not contain multiple value. But It may contain the same combination with different start date active and end date active. My data file Line_index 1 and 2 contains the same combination with different ADJUSTMENT_FACTOR value. So we need to identify those two group of combination as a duplicate. See line_index 3, the ATTR_VALUE is different compare with line_index 1,2. So it is a unique combination.

Now you guys understand my requirement clearly right. Let me know how can i handle that?
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Sean Stuber

can you post what you want the results to look like?
rather than explaining in words, can you post a spreadsheet that shows what you want the output of the query to be?
Steve Berger

ASKER
I should get only line_index 1 and 2. Because this is the combination is repeated again with different ADJUSTMENT_FACTOR value. Line_index 3 will not come in to the picture. Because that combination does not exist in the data file.
Steve Berger

ASKER
Hi all,
      I will give you one clear example. Below is some sample data.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              1
  XX           A            AB                                      1
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3

                The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same expect index. My requirement is one index should have to only one set of data. If it comes again in some other index, that is duplicate.

               Can anyone share with me the solution?

Thanks
Suriya
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Steve Berger

ASKER
Hi all,
      I will give you one clear example. Below is some sample data.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              1
  XX           A            AB                                      1
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3

                The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same except index. My requirement is one index should have only one set of data. If it comes again in some other index, that is duplicate.

               Can anyone share with me the solution?

Thanks
Suriya
Sean Stuber

ok, that's more sample input data.

What do you want the output to look like?
Steve Berger

ASKER
The output looks like only index (1 or 2) and 3. Because index 3 is not a duplicate. Only index 1 and 2 contains the duplicate data. The output like below.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
ASKER CERTIFIED SOLUTION
Sean Stuber

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
awking00

See attached.
dupes.txt
awking00

sdstuber,
The old attach file method got me again! :-)
Sean Stuber

:)

i was in a conference call and didn't see the email come in right away.  I could have had it answered faster and maybe saved you some time.  :)

I'm not sure the partition clause is correct.  Should the date column really be part of the partition?

This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
Steve Berger

ASKER
No it is giving some different output. I want to partition with date column also.
Sean Stuber

how are you testing?

both queries return extactly the same output for the sample data you provided.
the output is exactly what you requested.

I can believe you that you need the date column in your partition by clause, but based on your input and output it doesn't match.  Can you provide an example that shows the problem?
Steve Berger

ASKER
Hi,
  Attached is my testing data what i am using here. Please run this query against these data. I need the output only line_index 2,3, and 4. In my real time i have 1600 line_index. Can you check this?

TESTING-DATA.xls
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Steve Berger

ASKER
If i add one more same rows for line_index 4, it is giving some different output.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              1
  XX           A            AB                                      1
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4
OUTPUT is:
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4
But i need the output like below:
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4

Steve Berger

ASKER
Sorry, my previous comment is wrong. i need the output like below.
 XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4
Sean Stuber

that's what I get with either of the queries posted before


for the previous xls file,  what output are you expecting?
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
awking00

>>Sorry, my previous comment is wrong. i need the output like below.
 XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4<<
Which seems to be precisely what the two queries return?
Steve Berger

ASKER
Hi,
  Attached is the another sample data to check the above query. It is the real data which i am using in my system. Please check this data with the prevoius queries.

Thanks
TESTING-DATA-1.xls
Steve Berger

ASKER
From my above comments, i need to fetch the output only line_index 2,3 and 4. Because line_index 1 is duplicate.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
awking00

Are the values shown on sheet 2 of the attached file what you want to see and are they based on the different values shown for factor, start_date, or both?
TESTING-DATA-1.xls
Steve Berger

ASKER
Ya, that is correct. The data must be differ for factor, start_date and end_date. It means the same data should not repeat again with same start_date and end_date. It should differ by different start_date and end_date. In sheet 2 the line_index 1 is repeated again with same date. So that is duplication. We need to remove any one of the line_index.

Then, what ever attached query result that is correct. I need the same output.
Steve Berger

ASKER
Hi,
  In my above comment (23979380). That data is bit wrong. I have modified that data and attached here. This is my actual scenario. The previous data is wrong. Sorry for the wrong file. If you run your query for this data sheet. It is getting duplicate data as well as other data. I know, why? For example for all line_index column FLAG = 'A' is common. See my attached data sheet. So it is returning all FLAG='A' values. But i need only line_index 1 or 2. I tell again my requirement. In my attached data sheet some of the combination of data are duplicate. I need to identify those combination and delete those data. Combination based on the line_index column. Can you try now for the solution?

Thanks
TESTING-DATA-1.xls
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Steve Berger

ASKER
Hi sdstuber and awking00, did you see my post? Can you find the solution for my comment ID 24104782?
Thanks
Steve Berger

ASKER
GOOD ONE