Steve Berger
asked on
How to find the duplicate records in oracle SQL?
Hi,
  I have table which contains huge number of data. Attached is the sample data for my scenario. If you see the sample data file,  the column name ( NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) which is contains the same data. It is only differ in ADJUSTMENT_FACTOR. Thing is, all the columns except ADJUSTMENT_FACTOR should not repeat again, if the ADJUSTMENT_FACTOR value is different also. See the LINE_INDEX 1,2 data. I want to get the combination of column data which are not present again with different ADJUSTMENT_FACTOR value. If it comes again both the combination are duplicate. For identifying the combination we can use the LINE_INDEX column. For example LINE_INDEX 1 is one combination, 2 is another combination like that. Another thing is the ADJUSTMENT_FACTOR value is same for different combination also. If it is there, that is unique data. See the line_index value 3. Here the ADJUSTMENT_FACTOR value is same 0.4. But it is differ in ATTR_VALUE. So this is unique record. I want to write a SQL Query to achieve that. I tried so many times. Below is my query. can any one share with your thoughts?
Thanks
  I have table which contains huge number of data. Attached is the sample data for my scenario. If you see the sample data file,  the column name ( NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) which is contains the same data. It is only differ in ADJUSTMENT_FACTOR. Thing is, all the columns except ADJUSTMENT_FACTOR should not repeat again, if the ADJUSTMENT_FACTOR value is different also. See the LINE_INDEX 1,2 data. I want to get the combination of column data which are not present again with different ADJUSTMENT_FACTOR value. If it comes again both the combination are duplicate. For identifying the combination we can use the LINE_INDEX column. For example LINE_INDEX 1 is one combination, 2 is another combination like that. Another thing is the ADJUSTMENT_FACTOR value is same for different combination also. If it is there, that is unique data. See the line_index value 3. Here the ADJUSTMENT_FACTOR value is same 0.4. But it is differ in ATTR_VALUE. So this is unique record. I want to write a SQL Query to achieve that. I tried so many times. Below is my query. can any one share with your thoughts?
Thanks
SELECT a.* FROM
FACTOR_STG_SAM a
WHERE
NOT EXISTS (SELECT
'x'
FROM
FACTOR_STG_SAM b
WHERE
a.NAME = b.NAME
AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
AND a.ATTR_VALUE = b.ATTR_VALUE
AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
);
sample-date.xls
ASKER
Hi thanks for your reply. This query also is returning the same result. I thing we need to identify the combination first then we nee to check the table whether the same combination are available or not. I also trying this.
The query I posted should give you the duplicate rows.
This one should give you the unique rows.
This one should give you the unique rows.
SELECT a.* FROM
FACTOR_STG_SAM a
WHERE
rowid = (SELECT
min(rowid)
FROM
FACTOR_STG_SAM b
WHERE
a.NAME = b.NAME
AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
AND a.ATTR_VALUE = b.ATTR_VALUE
AND NVL(a.ADJUSTMENT_FACTOR,0) <> NVL(b.ADJUSTMENT_FACTOR,0)
AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
);
ASKER
Hi, this query is getting no rows. The output will be, it will eliminate both line_index 1 and 2. And it will return only the line_index 3 combination. Because LINE_INDEX 1 and 2 contains the same rows with different ADJUSTMENT_FACTOR value. So this duplicate. In LINE_INDEX 3 the ATTR_VALUE is different right. So it will not eliminate.
ASKER
Hi, this query is getting no rows. i will tell you, how the output is? The output will be, it will eliminate both line_index 1 and 2. And it will return only the line_index 3 combination. Because LINE_INDEX 1 and 2 contains the same rows with different ADJUSTMENT_FACTOR value. So this duplicate. In LINE_INDEX 3 the ATTR_VALUE is different right. So it will not eliminate.
your attachment is 0 bytes, can you post sample data again?
Sorry, I cut an pasted your code incorrectly. Â It should be an =, not a <>.
This should give you the unique values.
FYI - your attached sample file is empty.
This should give you the unique values.
FYI - your attached sample file is empty.
SELECT a.* FROM
FACTOR_STG_SAM a
WHERE
rowid = (SELECT
min(rowid)
FROM
FACTOR_STG_SAM b
WHERE
a.NAME = b.NAME
AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEXT
AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
AND a.ATTR_VALUE = b.ATTR_VALUE
AND NVL(a.ADJUSTMENT_FACTOR,0) = NVL(b.ADJUSTMENT_FACTOR,0)
AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
);
Please re-post your sample data (as indicated by sdstuber, the file was empty) and which records you want to see in the result set.
ASKER
Sorry for the wrong file. I have added the sample data file again. Then again it is giving the same output.
sample-data.xls
sample-data.xls
ASKER
See some of the data in ADJUSTMENT_FACTOR column null. The output i am getting like only the not null rows in ADJUSTMENT_FACTOR column.
I loaded your sample data and the last query I posted gives back rows 2, 3, 4, 5 and 9. Â That seems correct to me.
SQL> SELECT a.* FROM
2 FACTOR_STG_SAM a
3 WHERE
4 rowid = (SELECT
5 min(rowid)
6 FROM
7 FACTOR_STG_SAM b
8 WHERE
9 a.NAME = b.NAME
10 AND a.FACTOR_ATTR_CONTEXT = b.FACTOR_ATTR_CONTEX
11 AND a.FACTOR_ATTRIBUTE = b.FACTOR_ATTRIBUTE
12 AND a.ATTR_VALUE = b.ATTR_VALUE
13 AND NVL(a.ADJUSTMENT_FACTOR,0) = NVL(b.ADJUSTMENR,0)
14 AND a.START_DATE_ACTIVE = b.START_DATE_ACTIVE
15 AND a.END_DATE_ACTIVE = b.END_DATE_ACTIVE
16 );
NAME
--------------------------------------------------
FACTOR_ATTR_CONTEXT
--------------------------------------------------
FACTOR_ATTRIBUTE
--------------------------------------------------
ATTR_VALUE ADJUSTMENT_FACTOR
-------------------------------------------------- -----------------
START_DATE_ACTIVE END_DATE_ACTIVE B LINE_INDEX
-------------------- -------------------- - ----------
LEVEL3.2
CUST_HIERARCHY
CUST_LEVEL3
CHW .4
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 B 1
LEVEL3.2
PRODUCT_ LINE
PRODUCT_LINE
AM
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A 1
LEVEL3.2
WRLD_REGION_HIERARCHY
WRLDREGION_LEVEL2
AP
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A 1
LEVEL3.2
WRLD_REGION_HIERARCHY
WRLDREGION_LEVEL2
NASA
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 A 3
LEVEL3.2
CUST_HIERARCHY
CUST_LEVEL3
CHW .4834
10-SEP-1996 00:00:00 31-DEC-2010 00:00:00 B 2
ASKER
Hi all,
  Thanks for your replies.
If you see my data file, column (NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) data are same in some cases. Some cases these data will be different. And other thing these are the combination of records. We can identify the combination by use of line_index column. Line_index 1 is one combination, 2 is other combination like that. My requirement is, the data should not return the same combination again with different ADJUSTMENT_FACTOR or same ADJUSTMENT_FACTOR also. Because one combination must contain only one adjustment factor value. It should not contain multiple value. But It may contain the same combination with different start date active and end date active. My data file Line_index 1 and 2 contains the same combination with different ADJUSTMENT_FACTOR value. So we need to identify those two group of combination as a duplicate. See line_index 3, the ATTR_VALUE is different compare with line_index 1,2. So it is a unique combination.
Now you guys understand my requirement clearly right. Let me know how can i handle that?
  Thanks for your replies.
If you see my data file, column (NAME, FACTOR_ATTR_CONTEXT, FACTOR_ATTRIBUTE, ATTR_VALUE, START_DATE_ACTIVE, END_DATE_ACTIVE) data are same in some cases. Some cases these data will be different. And other thing these are the combination of records. We can identify the combination by use of line_index column. Line_index 1 is one combination, 2 is other combination like that. My requirement is, the data should not return the same combination again with different ADJUSTMENT_FACTOR or same ADJUSTMENT_FACTOR also. Because one combination must contain only one adjustment factor value. It should not contain multiple value. But It may contain the same combination with different start date active and end date active. My data file Line_index 1 and 2 contains the same combination with different ADJUSTMENT_FACTOR value. So we need to identify those two group of combination as a duplicate. See line_index 3, the ATTR_VALUE is different compare with line_index 1,2. So it is a unique combination.
Now you guys understand my requirement clearly right. Let me know how can i handle that?
can you post what you want the results to look like?
rather than explaining in words, can you post a spreadsheet that shows what you want the output of the query to be?
rather than explaining in words, can you post a spreadsheet that shows what you want the output of the query to be?
ASKER
I should get only line_index 1 and 2. Because this is the combination is repeated again with different ADJUSTMENT_FACTOR value. Line_index 3 will not come in to the picture. Because that combination does not exist in the data file.
ASKER
Hi all,
   I will give you one clear example. Below is some sample data.
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        1
 XX      A       AB                    1
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
        The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same expect index. My requirement is one index should have to only one set of data. If it comes again in some other index, that is duplicate.
        Can anyone share with me the solution?
Thanks
Suriya
   I will give you one clear example. Below is some sample data.
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        1
 XX      A       AB                    1
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
        The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same expect index. My requirement is one index should have to only one set of data. If it comes again in some other index, that is duplicate.
        Can anyone share with me the solution?
Thanks
Suriya
ASKER
Hi all,
   I will give you one clear example. Below is some sample data.
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        1
 XX      A       AB                    1
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
        The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same except index. My requirement is one index should have only one set of data. If it comes again in some other index, that is duplicate.
        Can anyone share with me the solution?
Thanks
Suriya
   I will give you one clear example. Below is some sample data.
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        1
 XX      A       AB                    1
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
        The above sample data, in index colum 1 and 2 has same data. But index 3 is different data. Now i want to delete any one of the index like 1 or 2. Because all the rows are same except index. My requirement is one index should have only one set of data. If it comes again in some other index, that is duplicate.
        Can anyone share with me the solution?
Thanks
Suriya
ok, that's more sample input data.
What do you want the output to look like?
What do you want the output to look like?
ASKER
The output looks like only index (1 or 2) and 3. Because index 3 is not a duplicate. Only index 1 and 2 contains the duplicate data. The output like below.
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See attached.
dupes.txt
dupes.txt
sdstuber,
The old attach file method got me again! :-)
The old attach file method got me again! :-)
:)
i was in a conference call and didn't see the email come in right away. Â I could have had it answered faster and maybe saved you some time. Â :)
I'm not sure the partition clause is correct. Â Should the date column really be part of the partition?
i was in a conference call and didn't see the email come in right away. Â I could have had it answered faster and maybe saved you some time. Â :)
I'm not sure the partition clause is correct. Â Should the date column really be part of the partition?
ASKER
No it is giving some different output. I want to partition with date column also.
how are you testing?
both queries return extactly the same output for the sample data you provided.
the output is exactly what you requested.
I can believe you that you need the date column in your partition by clause, but based on your input and output it doesn't match. Â Can you provide an example that shows the problem?
both queries return extactly the same output for the sample data you provided.
the output is exactly what you requested.
I can believe you that you need the date column in your partition by clause, but based on your input and output it doesn't match. Â Can you provide an example that shows the problem?
ASKER
Hi,
 Attached is my testing data what i am using here. Please run this query against these data. I need the output only line_index 2,3, and 4. In my real time i have 1600 line_index. Can you check this?
TESTING-DATA.xls
 Attached is my testing data what i am using here. Please run this query against these data. I need the output only line_index 2,3, and 4. In my real time i have 1600 line_index. Can you check this?
TESTING-DATA.xls
ASKER
If i add one more same rows for line_index 4, it is giving some different output.
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        1
 XX      A       AB                    1
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
OUTPUT is:
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
But i need the output like below:
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
NAME Â Â Â TEXT Â Â Â PATH Â Â Â DATE Â Â Â Â Â Â Â Â Â INDEX
 XX      A       AB     12/3/2009        1
 XX      A       AB                    1
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
OUTPUT is:
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
But i need the output like below:
 XX      A       AB     12/3/2009        2
 XX      A       AB                    2
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
ASKER
Sorry, my previous comment is wrong. i need the output like below.
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4
that's what I get with either of the queries posted before
for the previous xls file, Â what output are you expecting?
for the previous xls file, Â what output are you expecting?
>>Sorry, my previous comment is wrong. i need the output like below.
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4<<
Which seems to be precisely what the two queries return?
 XX      B       AB     12/3/2009        3
 XX      B       AB                    3
 XX      A       AB     12/3/2009        4
 XX      A       AB                    4<<
Which seems to be precisely what the two queries return?
ASKER
Hi,
 Attached is the another sample data to check the above query. It is the real data which i am using in my system. Please check this data with the prevoius queries.
Thanks
TESTING-DATA-1.xls
 Attached is the another sample data to check the above query. It is the real data which i am using in my system. Please check this data with the prevoius queries.
Thanks
TESTING-DATA-1.xls
ASKER
From my above comments, i need to fetch the output only line_index 2,3 and 4. Because line_index 1 is duplicate.
Are the values shown on sheet 2 of the attached file what you want to see and are they based on the different values shown for factor, start_date, or both?
TESTING-DATA-1.xls
TESTING-DATA-1.xls
ASKER
Ya, that is correct. The data must be differ for factor, start_date and end_date. It means the same data should not repeat again with same start_date and end_date. It should differ by different start_date and end_date. In sheet 2 the line_index 1 is repeated again with same date. So that is duplication. We need to remove any one of the line_index.
Then, what ever attached query result that is correct. I need the same output.
Then, what ever attached query result that is correct. I need the same output.
ASKER
Hi,
 In my above comment (23979380). That data is bit wrong. I have modified that data and attached here. This is my actual scenario. The previous data is wrong. Sorry for the wrong file. If you run your query for this data sheet. It is getting duplicate data as well as other data. I know, why? For example for all line_index column FLAG = 'A' is common. See my attached data sheet. So it is returning all FLAG='A' values. But i need only line_index 1 or 2. I tell again my requirement. In my attached data sheet some of the combination of data are duplicate. I need to identify those combination and delete those data. Combination based on the line_index column. Can you try now for the solution?
Thanks
TESTING-DATA-1.xls
 In my above comment (23979380). That data is bit wrong. I have modified that data and attached here. This is my actual scenario. The previous data is wrong. Sorry for the wrong file. If you run your query for this data sheet. It is getting duplicate data as well as other data. I know, why? For example for all line_index column FLAG = 'A' is common. See my attached data sheet. So it is returning all FLAG='A' values. But i need only line_index 1 or 2. I tell again my requirement. In my attached data sheet some of the combination of data are duplicate. I need to identify those combination and delete those data. Combination based on the line_index column. Can you try now for the solution?
Thanks
TESTING-DATA-1.xls
ASKER
Hi sdstuber and awking00, did you see my post? Can you find the solution for my comment ID 24104782?
Thanks
Thanks
ASKER
GOOD ONE
You are missing 1 thing from your sub query. Â There is no way to distinguish the original row you want to keep. Â This is a way to do it, but by no means the only way.
Open in new window