SQL Duplicates Help

I have a database full of employees and their associated information which is output from our ERP system on a daily basis as a csv file, then imported into the MySQL database on a Linux server using the mysqlimport function.  The data coming from the ERP system cannot be modified in any other way so this process has to be completed either within the MySQL database itself or a php script to run at time of import.

The problem I am having is that there are duplicates being imported into the database and I am wanting to remove them.  The issue I am running into is I cannot seem to find a common generalized way to search, find, and delete these duplicates.

The username standard used to be firstNameLastName and now it is firstNameEmpID with a few exceptions to the first rule as seen in example 2.

A few examples:
empID      empFirstName      empLastName      empUserName
5801      TI                          SIMMONS           TI5801
5801      TI                          SIMMONS           HUDOP008

So in this case I want to eliminate the 2nd one.

empID      empFirstName      empLastName      empUserName
2875      LYNNE                TEETERS      LYNNE2
2875      LYNNE                TEETERS      LYNNE

So in this case I want to eliminate the 1st one.

empID      empFirstName      empLastName      empUserName
5598      MORGAN               WENTWORTH         MORGANW
5598      MORGAN               WENTWORTH         MORGAN5598

In this case I want to remove the first one.

Any ideas on how to overcome all of these special cases would be greatly appreciated!
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Guy Hengel [angelIII / a3]Billing EngineerCommented:
please check this article on how to do this:
PhillipsPlasticsAuthor Commented:
Not really sure how that article would relate to what I am attempting to do here but I don't think it applies since I cannot add a limit of 1 or use min or max due to the changing nature of the end result.
Frankly neither do I.. can't wait for the explanation..

Have you considered defining a UNIQUE INDEX for the columns you want to be unique?  One INDEX can cover multiple colums, e.g. FirstName, LastName, etc..

By defining a unique index, when an attempt is made to insert a new record that matches it, it will give you an error, which you can handle.. and decide what to do from that?
- Do you ignore it and continue?  (I already have that one)
- Do you delete the existing one and reattempt an insert?
- ... ?

Here is the MySQL documentation page for how to define it: MySQL: CREATE INDEX Syntax

Does that look about like what you're looking to do?  If so, let me know, and we'll explore that.
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

Guy Hengel [angelIII / a3]Billing EngineerCommented:
actually, I now see your problem.
still the article explains that you need some clear rule. in the examples, you just posted that: examples, but no clear "rule".
WHY do you want to keep/remove a line over the other one.
examples would be:
* longer/st value to be kept
* numeric data over non-numeric
* ...

and then, when the rules are defined, you can try to implement that into sql.
so, the first step is yours: define (and post) the exact rules.
if you then don't get that translated into SQL (mysql), we can try to help
Ray PaseurCommented:
If you make a UNIQUE index, MySQL will throw error #1062 upon an attempt to insert a row with a duplicate UNIQUE value.  But that will not help you with LYNNE2, which violates both the old and new rules.  

This makes sense:
empFirstName + empID = empUserName

And these rows obey the pattern:
5801 TI  SIMMONS yields TI5801

My general sense about this would be that you might consider flagging duplicates for manual intervention if you cannot get conformity to the rule patterns.


Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
what you can also do without creating an index is prior to any insert do something like
DELETE FROM table WHERE empID = '$empId'

as empID is likely unique in itself, there should not be two employees with the same ID

which will erase the old record prior to importing the more up to date record

i use this method when synchronizing a very large oracle table into mysql
^^^ what i said above will also remove the duplicates on the first sync
here is some code that should work for your implementation

//load CSV and convert into two dimensional array
$csv_rows = file("file.csv");
foreach($csv_rows as $row_num => $data) {
  $cols = explode(',',$data);
  foreach($cols as $col_num => $col_data) {
    $cols[$col_num] = trim($col_data);
  $csv_rows[$row_num] = $cols;

//convert rows to standard strings and import
foreach($csv_rows as $current_row) {
   $employee_id = $current_row[0];
   $import_string = '\''.implode('\',\'',$current_row).'\'';
   mysql_query("DELETE FROM employees where empID='{$employee_id}'");
   mysql_query("INSERT INTO employees ($import_string)");
PhillipsPlasticsAuthor Commented:
@ZHaver: Unfortunately that is the biggest issue I am having with regard to this problem, the empID is not unique, only the usernames are unique with duplicated empIDs.

@angelll: Unfortunately I cannot define rules to fit every situation I have out there, part of the issue is nonconformity in the data as standards have changed over the years.  I am thinking I may take the route of making rules for the majority of instances then flag the rest for manual inclusion and deletion.

@Ray_Paseur: Is there a way you would recommend for flagging duplicates?

So here are the rules as I see them currently:
Username =
firstname+last initial OR
firstname+empID OR
Flag for manual deletion

Currently I use the SQL Statement:
SELECT empid FROM peopleTEST group by empid having (count(empid) > 1 );
to obtain all of the duplicated empIDs.
Is there a way within MySQL to iterate through the result set and do the above rules or am I better off using a different script outside of the DB to accomplish this?
Guy Hengel [angelIII / a3]Billing EngineerCommented:
If you cannot put rules, you cannot expect SQL alone to solve this.
PhillipsPlasticsAuthor Commented:
Alright PHP it is then.  Thank you everyone for the help.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
MySQL Server

From novice to tech pro — start learning today.