Solved

SQL- identify duplicate records by analysis

Posted on 2010-09-14
7
211 Views
Last Modified: 2012-05-10
Find duplicates in a table "Users" where you must analyze certain fields like "first_name", "last_name", "address", "email" to conclude if they are indeed valid duplicates (obviously, the unique field "user_identification_number" has different values, hence the need to find and identify duplicate accounts)
The only criteria that is s MUST (where clause) is that at least one of the duplicate record/s must have a value of 0 in the field "criteria".

 Examples: "email" in a record is identical to "email" in another record is a guaranteed duplicate.
"FirstName" in a record in a record is identical to "FirstName" in another record is NOT a guaranteed duplicate (for this the "LastName" must also be identical).


Bottom line: there is no sure fire way to query for accurate duplicates- I just want a generic guidance so that I can eyeball and identify the duplicates myself after setting some basic criteria.

Thank you.

0
Comment
Question by:spirose
  • 4
  • 2
7 Comments
 
LVL 92

Expert Comment

by:Patrick Matthews
ID: 33674803
spirose,With respect, this question has the look and feel of an academic assignment.  Per the Member Agreement, Members are not allowed to ask for and the Experts are not allowed to offer to do your academic assignments for you.What you *can* do, however, is post the code/answers you have so far, and ask the Experts for tips and hints to complete the answer.  You can also ask the Experts for help in understanding key concepts and principles, but in either case, you have to do the heavy lifting yourself.If I am mistaken and this is not an academic assignment, then please post a clarification of your situation and intention to put the Experts at ease.Patrick
0
 

Author Comment

by:spirose
ID: 33674869
This is not an academic assignment- sorry if it came across that way. This work involves major data sleuthing and I just wanted to narrow down my options using some sort of query(from thousands of records to possibly hundreds) so that I can identify them more efficiently. I hope that makes sense.

Let me try and make the question more clear:

select first, last, address, email from Users A
inner join Users B
on (??what do I join this to)
where A.criteria = 0
and A.email like .email (??)


I am taking an example of emails since email is a guaranteed way of identifying duplicates.
0
 

Author Comment

by:spirose
ID: 33674888
EDIT: last line of the code should be
A.email = B.email (I do not think like is an appropriate operator here).

Please correct me if I am wrong..
0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 
LVL 41

Expert Comment

by:ralmada
ID: 33674892
Agree with matthewspatrick. No to answer your question below:
>>Bottom line: there is no sure fire way to query for accurate duplicates- I just want a generic guidance so that I can eyeball and identify the duplicates myself after setting some basic criteria.<<
Hint for you: You can use the count() over (partition by ... ) to get the counts of same first_name, last_name combination on one side and the email on the other....
0
 

Author Comment

by:spirose
ID: 33674939
Again, this is NOT an academic assignment. I do not know where I gave that vibe.
0
 
LVL 41

Accepted Solution

by:
ralmada earned 500 total points
ID: 33674950
so basically follow this structure
select first, last, address, email from (
 select first, last, address, email, count() over (partition by first, last) cn1, count() over (partition by .... something else you complete this...) cn2  from Users
) A
where cn1 > 1 or cn2 > 1

Open in new window

0
 

Author Comment

by:spirose
ID: 33675115
Thank you for providing an efficient concept of count() over (partition by...)

P.S. I hope I provided enough clarification as to this NOT being an academic assignment to put the Experts at ease.
0

Featured Post

NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
Composite queries are used to retrieve the results from joining multiple queries after applying any filters. UNION, INTERSECT, MINUS, and UNION ALL are some of the operators used to get certain desired results.​
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question