Solved

Find near duplicates in mysql table based on specific criteria, merge data (or update  one row with data from other row, delete 2nd row)

Posted on 2011-03-07
16
378 Views
Last Modified: 2012-05-11
Hello,

I need to work with data that will be used to produce a printed class schedule. For bureaucratic reasons, a class with a lab and a lecture component is split into 2 separate class rows, one for the lab, and one for the lecture. But it is essentially the same class. So for the purposes of display, I want to merge the two rows, either in the mysql database itself, or when selecting the data for the display/download query. The criteria is this: The Lecture Row and the Lab row have the same section number, except that the Lab row has an X at the end of the Section, and, the Lab ClassNbr is always 1 greater than the Lecture class number. The Lab Start time is always one minute after the Lecture end time. So for display purposes I want to take the Class End time (MtgEnd) from the Lab row, suppress all the other data in the row, and merge the Class End Time (MtgEnd) to the Class End time for the Lecture row. I have attached a sample bit of data showing the before and after pictures.

Not all of the rows have a lab/lecture pairing, so those should be left alone.

Thanks for any help you can provide.

Hope this is clear. If you have any questions, please let me know.
ee2.xlsx
0
Comment
Question by:aberns
  • 8
  • 4
  • 3
  • +1
16 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 35065334
For pairings between two different but related elements, you usually want to use a thing called a "junction table" or a "pivot table".  Normally (in normalized data bases) the junction is used to coordinate information in two separate tables.  You might have a table of class lectures and a separate table of labs.  Each of these tables has a key for every row.   A relational link exists when the junction table contains a row with both a class key and a lab key.  Check this search for more information on normalization of data bases.  It is part art and part science:
http://lmgtfy.com?q=Should+I+normalize+my+database

If I were to have to deal with your current rule set, I would first add new table of labs to complement the lectures table.  One of the most important facts to know would be whether "this is a lecture" or "this is a lab" and that kind of information would make sense (to me, at least) to keep in different tables.  Then I would run the old table and test all the rules.  For those that are labs, I would add a row to the lab table with the appropriate and accurate lab information.  And I would add a row to the junction table that tied this lab to the lecture.

What you will be creating in the junction table is a "many-to-many relationship" between lectures and labs.  In that way zero, one, or more lecturers can send their students to the same labs, and lab results can be sent back to zero, one, or more lecturers.  It is a very flexible design.

Does that make sense for your needs?  HTH, ~Ray
0
 
LVL 18

Expert Comment

by:BigSchmuh
ID: 35068980
I won't modify the db structure for this need...

I suggest you to query 2 times the same table. First one is a "base" and the eventually joined one can be the Lecture(Lab) when base is Lab(Lecture) or null row otherwise

SELECT MainC.ClassNbr, MainC.Subject, MainC.Catalog, MainC.Section, MainC.Begin Date, Coalesce(SubC.End Date, MainC.End Date), MainC.Lname, MainC.Fname, MainC.MtgStart, MainC.MtdEnd, MainC.Mon, MainC.Tues, MainC.Wed, MainC.Thur, MainC.Fri, MainC.Sat, MainC.Sun, MainC.Descr, MainC.Min CH, MainC.Max CH, MainC.PreReq, MainC.Room, MainC.IAI, MainC.Notes
FROM EE2 MainC
   Left join EE2 SubC
      ON   (MainC.ClassNbr = SubC.ClassNbr-1 AND Left(SubC.Section,1)="X")
        OR (MainC.ClassNbr = SubC.ClassNbr+1 AND Left(MainC.Section,1)="X")
WHERE SubC.ClassNbr IS NULL OR MainC.ClassNbr = SubC.ClassNbr-1

Coalesce is a classic function that returns the first non empty field
If your DBMS does not support it, you can replace it easily (or ask)
0
 

Author Comment

by:aberns
ID: 35070046
Hello Ray,
Thank you for your thoughts. I just spoke with the person who is getting me the data (which I am going to import into MySQL), and he can separate the labs from the lectures. I can join the rows on Catalog=Catalog and Subject=Subject and Section = Section Contact X at the end, but I think I will still need help just pulling the MtgEnd from the lab table and putting that in place of the end section of the lecture table. Based on what I just said, assuming that the data is put into the two tables, can you help me with the subsequent query?
0
 
LVL 18

Expert Comment

by:BigSchmuh
ID: 35070163
You don't need to modify the db structure to get your expected results...
Please have a look at my post above
0
 

Author Comment

by:aberns
ID: 35070174
@bigschmuh,
OK....let me try it and get back to you. Thanks!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 35070191
How much data are we talking about here?  Penn State or a small teachers' college?
0
 
LVL 18

Expert Comment

by:BigSchmuh
ID: 35070216
@Ray: you don't care of the rows count because the above query can handle many WHERE criterias and INDEXes based on the MainC table
0
 

Author Comment

by:aberns
ID: 35070258
Between 5800 - 10,000 K rows of data. I can also break the data into subsets, emptying it out each time if need be. I am already joining two other tables, one containing notes, and one containing prereqs.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 18

Expert Comment

by:BigSchmuh
ID: 35070480
...in my SQL above, you should have the join criteria for "a X at the end" uses a RIGHT() operator instead of LEFT() one

SELECT MainC.ClassNbr, MainC.Subject, MainC.Catalog, MainC.Section, MainC.Begin Date, Coalesce(SubC.End Date, MainC.End Date), MainC.Lname, MainC.Fname, MainC.MtgStart, MainC.MtdEnd, MainC.Mon, MainC.Tues, MainC.Wed, MainC.Thur, MainC.Fri, MainC.Sat, MainC.Sun, MainC.Descr, MainC.Min CH, MainC.Max CH, MainC.PreReq, MainC.Room, MainC.IAI, MainC.Notes
FROM EE2 MainC
   Left join EE2 SubC
      ON   (MainC.ClassNbr = SubC.ClassNbr-1 AND Right(SubC.Section,1)='X')
        OR (MainC.ClassNbr = SubC.ClassNbr+1 AND Right(MainC.Section,1)='X')
WHERE (SubC.ClassNbr IS NULL OR MainC.ClassNbr = SubC.ClassNbr-1)
0
 

Author Comment

by:aberns
ID: 35070502
alright...I will not be able to try this until about 2 pm CST, but I will post back at that time. Thanks.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 35076953
Hi, @BigSchmuh.  I have no objection to doing it your way for this query, so I will sign off on this question now.  Have fun.  In my initial post, I was taking the longer view of whether the data base should be normalized and relational.  My belief is that a one-time modification that encourages "best practices" is a good step in the right direction.  Eventually issues will arise with disorganized data - sorting, gouping, etc.  The queries will get complex and complicated and difficult to debug or explain to other programmers.  Any time you have a column in a table that can have more than one meaning, you are asking for trouble.  And trouble has a persistent and nasty way of finding you.

If a programmer had come to me in a job interview and said that her design included this: The Lecture Row and the Lab row have the same section number, except that the Lab row has an X at the end of the Section, and, the Lab ClassNbr is always 1 greater than the Lecture class number. she would be thanked for her time and escorted to the exit.  You can program around that kind of design for a while.  But eventually the house of cards gets caught in a breeze.  As good as Martin Fowler's "Refactoring" book might be, I have never been able to make a business case for refactoring!  It is far easier to use the simple and pedestrian designs that lazy programmers prefer.

best to all, ~Ray
0
 

Author Comment

by:aberns
ID: 35078147
ahem, just want to put in that I did not choose the data structure, but was trying to work with what I was given on a very tight time frame. Ray, since your suggestion I have been trying to get more normalized data from the providers. At the time I posted the question, I did not know that was an option.
0
 
LVL 10

Accepted Solution

by:
Mathiyazhagan earned 500 total points
ID: 35082003
Hi,

Try the below query for update & delete.

UPDATE test_duplicate a, test_duplicate b 
SET  a.MtdEnd = b.MtdEnd 
WHERE	a.subject = b.subject AND a.catalog = b.catalog
	AND b.section = CONCAT(a.section, 'X')
	AND a.begin_date = b.begin_date
	AND a.end_date = b.end_date
    AND b.id = (a.id+1) 

Open in new window


It will update end time of the theory class with end time of lab class. You can use additional conditions in the ON clause. Assumed the Lab Id (b.id) follows immediately after theory Id(a.id).

Delete the Lab Row
DELETE b FROM test_duplicate b INNER JOIN  test_duplicate a ON a.subject = b.subject AND a.catalog = b.catalog
	AND b.section = CONCAT(a.section, 'X')
	AND a.begin_date = b.begin_date
	AND a.end_date = b.end_date
    AND b.id = (a.id+1)

Open in new window


Have tested the query in mysql 5+.
0
 

Author Comment

by:aberns
ID: 35086420
Mathiyazhagan:,

By Jove, I think it's working! I just tried it on a subset of data and it seemed to work. I will do a more extensive test later and get back to you.

Thank you so much.
0
 

Author Comment

by:aberns
ID: 35088058
Mathiyazhagan:,

It seems to be working beautifully. I'm going to have another set of eyes look at the data tomorrow before signing off.

Thanks to everyone who chimed in.

0
 

Author Closing Comment

by:aberns
ID: 35144022
Sorry for the delay in replying. I was implementing and reimplementing this query to try and meet our crazy deadline...anyway, it's simple and does the trick. Thanks so much.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
spacing 5 27
Data center mess 4 46
How do I access property of nested object in PHP? 3 21
session dropped in IE 10 19
This article describes how to use the timestamp of existing data in a database to allow Tableau to calculate the prior work day instead of relying on case statements or if statements to calculate the days of the week.
Many companies are looking to get out of the datacenter business and to services like Microsoft Azure to provide Infrastructure as a Service (IaaS) solutions for legacy client server workloads, rather than continuing to make capital investments in h…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now