Link to home
Create AccountLog in
Avatar of Steve Berger
Steve BergerFlag for United States of America

asked on

How to find the duplicate records in oracle SQL?

Hi,
   I have table which contains huge number of data. Attached is the sample data for my scenario. If you see the sample data file,  the column name ( NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) which is contains the same data. It is only differ in ADJUSTMENT_FACTOR. Thing is, all the columns except ADJUSTMENT_FACTOR should not repeat again, if the ADJUSTMENT_FACTOR value is different also. See the LINE_INDEX 1,2 data. I want to get the combination of column data which are not present again with different ADJUSTMENT_FACTOR value. If it comes again both the combination are duplicate. For identifying the combination we can use the LINE_INDEX column. For example LINE_INDEX 1 is one combination, 2 is another combination like that. Another thing is the ADJUSTMENT_FACTOR value is same for different combination also. If it is there, that is unique data. See the line_index value 3. Here the ADJUSTMENT_FACTOR value is same 0.4. But it is differ in ATTR_VALUE. So this is unique record. I want to write a SQL Query to achieve that. I tried so many times. Below is my query. can any one share with your thoughts?

Thanks
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   NOT EXISTS (SELECT 
                    'x'
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

sample-date.xls
Avatar of johnsone
johnsone
Flag of United States of America image

I would hope there is an index that contains at leas a few of the most selective fields.

You are missing 1 thing from your sub query.  There is no way to distinguish the original row you want to keep.  This is a way to do it, but by no means the only way.
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   rowid != (SELECT 
                    min(rowid)
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

Avatar of Steve Berger

ASKER

Hi thanks for your reply. This query also is returning the same result. I thing we need to identify the combination first then we nee to check the table whether the same combination are available or not. I also trying this.
The query I posted should give you the duplicate rows.

This one should give you the unique rows.
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   rowid = (SELECT 
                    min(rowid)
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

Hi, this query is getting no rows. The output will be, it will eliminate both line_index 1 and 2. And it will return only the line_index 3 combination. Because LINE_INDEX 1 and 2 contains the same rows with different ADJUSTMENT_FACTOR value. So this duplicate. In LINE_INDEX 3 the ATTR_VALUE is different right. So it will not eliminate.
Hi, this query is getting no rows. i will tell you, how the output is? The output will be, it will eliminate both line_index 1 and 2. And it will return only the line_index 3 combination. Because LINE_INDEX 1 and 2 contains the same rows with different ADJUSTMENT_FACTOR value. So this duplicate. In LINE_INDEX 3 the ATTR_VALUE is different right. So it will not eliminate.
Avatar of Sean Stuber
Sean Stuber

your attachment is 0 bytes, can you post sample data again?
Sorry, I cut an pasted your code incorrectly.  It should be an =, not a <>.

This should give you the unique values.

FYI - your attached sample file is empty.
SELECT a.* FROM 
    FACTOR_STG_SAM a
WHERE 
   rowid = (SELECT 
                    min(rowid)
                FROM 
                    FACTOR_STG_SAM b
                WHERE 
                   a.NAME = b.NAME
                    AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
                    AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
                    AND a.ATTR_VALUE = b.ATTR_VALUE
                    AND NVL(a.ADJUSTMENT_FACTOR,0) = NVL(b.ADJUSTMENT_FACTOR,0)
                    AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
                    AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
                );

Open in new window

Please re-post your sample data (as indicated by sdstuber, the file was empty) and which records you want to see in the result set.
Sorry for the wrong file. I have added the sample data file again. Then again it is giving the same output.
sample-data.xls
See some of the data in ADJUSTMENT_FACTOR column null. The output i am getting like only the not null rows in ADJUSTMENT_FACTOR column.
I loaded your sample data and the last query I posted gives back rows 2, 3, 4, 5 and 9.  That seems correct to me.
SQL> SELECT a.* FROM
  2      FACTOR_STG_SAM a
  3  WHERE
  4     rowid = (SELECT
  5                      min(rowid)
  6                  FROM
  7                      FACTOR_STG_SAM b
  8                  WHERE
  9                     a.NAME = b.NAME
 10                      AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEX
 11                      AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
 12                      AND a.ATTR_VALUE = b.ATTR_VALUE
 13                      AND NVL(a.ADJUSTMENT_FACTOR,0) = NVL(b.ADJUSTMENR,0)
 14                      AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
 15                      AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
 16                  );
 
NAME
--------------------------------------------------
FACTOR_ATTR_CONTEXT
--------------------------------------------------
FACTOR_ATTRIBUTE
--------------------------------------------------
ATTR_VALUE                                         ADJUSTMENT_FACTOR
-------------------------------------------------- -----------------
START_DATE_ACTIVE    END_DATE_ACTIVE      B LINE_INDEX
-------------------- -------------------- - ----------
LEVEL3.2
CUST_HIERARCHY
CUST_LEVEL3
CHW                                                               .4
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 B          1
 
LEVEL3.2
PRODUCT_ LINE
PRODUCT_LINE
AM
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A          1
 
LEVEL3.2
WRLD_REGION_HIERARCHY
WRLDREGION_LEVEL2
AP
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A          1
 
LEVEL3.2
WRLD_REGION_HIERARCHY
WRLDREGION_LEVEL2
NASA
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A          3
 
LEVEL3.2
CUST_HIERARCHY
CUST_LEVEL3
CHW                                                            .4834
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 B          2

Open in new window

Hi all,
    Thanks for your replies.

If you see my data file, column (NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) data are same in some cases. Some cases these data will be different. And other thing these are the combination of records. We can identify the combination by use of line_index column. Line_index 1 is one combination, 2 is other combination like that. My requirement is, the data should not return the same combination again with different ADJUSTMENT_FACTOR or same ADJUSTMENT_FACTOR also. Because one combination must contain only one adjustment factor value. It should not contain multiple value. But It may contain the same combination with different start date active and end date active. My data file Line_index 1 and 2 contains the same combination with different ADJUSTMENT_FACTOR value. So we need to identify those two group of combination as a duplicate. See line_index 3, the ATTR_VALUE is different compare with line_index 1,2. So it is a unique combination.

Now you guys understand my requirement clearly right. Let me know how can i handle that?
can you post what you want the results to look like?
rather than explaining in words, can you post a spreadsheet that shows what you want the output of the query to be?
I should get only line_index 1 and 2. Because this is the combination is repeated again with different ADJUSTMENT_FACTOR value. Line_index 3 will not come in to the picture. Because that combination does not exist in the data file.
Hi all,
      I will give you one clear example. Below is some sample data.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              1
  XX           A            AB                                      1
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3

                The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same expect index. My requirement is one index should have to only one set of data. If it comes again in some other index, that is duplicate.

               Can anyone share with me the solution?

Thanks
Suriya
Hi all,
      I will give you one clear example. Below is some sample data.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              1
  XX           A            AB                                      1
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3

                The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same except index. My requirement is one index should have only one set of data. If it comes again in some other index, that is duplicate.

               Can anyone share with me the solution?

Thanks
Suriya
ok, that's more sample input data.

What do you want the output to look like?
The output looks like only index (1 or 2) and 3. Because index 3 is not a duplicate. Only index 1 and 2 contains the duplicate data. The output like below.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
ASKER CERTIFIED SOLUTION
Avatar of Sean Stuber
Sean Stuber

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
See attached.
dupes.txt
sdstuber,
The old attach file method got me again! :-)
:)

i was in a conference call and didn't see the email come in right away.  I could have had it answered faster and maybe saved you some time.  :)

I'm not sure the partition clause is correct.  Should the date column really be part of the partition?

No it is giving some different output. I want to partition with date column also.
how are you testing?

both queries return extactly the same output for the sample data you provided.
the output is exactly what you requested.

I can believe you that you need the date column in your partition by clause, but based on your input and output it doesn't match.  Can you provide an example that shows the problem?
Hi,
  Attached is my testing data what i am using here. Please run this query against these data. I need the output only line_index 2,3, and 4. In my real time i have 1600 line_index. Can you check this?

TESTING-DATA.xls
If i add one more same rows for line_index 4, it is giving some different output.
NAME      TEXT      PATH      DATE                  INDEX
  XX           A            AB        12/3/2009              1
  XX           A            AB                                      1
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4
OUTPUT is:
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4
But i need the output like below:
  XX           A            AB        12/3/2009              2
  XX           A            AB                                      2
  XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4

Sorry, my previous comment is wrong. i need the output like below.
 XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4
that's what I get with either of the queries posted before


for the previous xls file,  what output are you expecting?
>>Sorry, my previous comment is wrong. i need the output like below.
 XX           B            AB         12/3/2009              3
  XX           B            AB                                       3
  XX           A            AB        12/3/2009              4
  XX           A            AB                                      4<<
Which seems to be precisely what the two queries return?
Hi,
  Attached is the another sample data to check the above query. It is the real data which i am using in my system. Please check this data with the prevoius queries.

Thanks
TESTING-DATA-1.xls
From my above comments, i need to fetch the output only line_index 2,3 and 4. Because line_index 1 is duplicate.
Are the values shown on sheet 2 of the attached file what you want to see and are they based on the different values shown for factor, start_date, or both?
TESTING-DATA-1.xls
Ya, that is correct. The data must be differ for factor, start_date and end_date. It means the same data should not repeat again with same start_date and end_date. It should differ by different start_date and end_date. In sheet 2 the line_index 1 is repeated again with same date. So that is duplication. We need to remove any one of the line_index.

Then, what ever attached query result that is correct. I need the same output.
Hi,
  In my above comment (23979380). That data is bit wrong. I have modified that data and attached here. This is my actual scenario. The previous data is wrong. Sorry for the wrong file. If you run your query for this data sheet. It is getting duplicate data as well as other data. I know, why? For example for all line_index column FLAG = 'A' is common. See my attached data sheet. So it is returning all FLAG='A' values. But i need only line_index 1 or 2. I tell again my requirement. In my attached data sheet some of the combination of data are duplicate. I need to identify those combination and delete those data. Combination based on the line_index column. Can you try now for the solution?

Thanks
TESTING-DATA-1.xls
Hi sdstuber and awking00, did you see my post? Can you find the solution for my comment ID 24104782?
Thanks
GOOD ONE