Find near duplicates in mysql table based on specific criteria, merge data (or update one row with data from other row, delete 2nd row)

Hello,

I need to work with data that will be used to produce a printed class schedule. For bureaucratic reasons, a class with a lab and a lecture component is split into 2 separate class rows, one for the lab, and one for the lecture. But it is essentially the same class. So for the purposes of display, I want to merge the two rows, either in the mysql database itself, or when selecting the data for the display/download query. The criteria is this: The Lecture Row and the Lab row have the same section number, except that the Lab row has an X at the end of the Section, and, the Lab ClassNbr is always 1 greater than the Lecture class number. The Lab Start time is always one minute after the Lecture end time. So for display purposes I want to take the Class End time (MtgEnd) from the Lab row, suppress all the other data in the row, and merge the Class End Time (MtgEnd) to the Class End time for the Lecture row. I have attached a sample bit of data showing the before and after pictures.

Not all of the rows have a lab/lecture pairing, so those should be left alone.

Thanks for any help you can provide.

Hope this is clear. If you have any questions, please let me know.
ee2.xlsx
abernsAsked:
Who is Participating?
 
MathiyazhaganConnect With a Mentor Commented:
Hi,

Try the below query for update & delete.

UPDATE test_duplicate a, test_duplicate b 
SET  a.MtdEnd = b.MtdEnd 
WHERE	a.subject = b.subject AND a.catalog = b.catalog
	AND b.section = CONCAT(a.section, 'X')
	AND a.begin_date = b.begin_date
	AND a.end_date = b.end_date
    AND b.id = (a.id+1) 

Open in new window


It will update end time of the theory class with end time of lab class. You can use additional conditions in the ON clause. Assumed the Lab Id (b.id) follows immediately after theory Id(a.id).

Delete the Lab Row
DELETE b FROM test_duplicate b INNER JOIN  test_duplicate a ON a.subject = b.subject AND a.catalog = b.catalog
	AND b.section = CONCAT(a.section, 'X')
	AND a.begin_date = b.begin_date
	AND a.end_date = b.end_date
    AND b.id = (a.id+1)

Open in new window


Have tested the query in mysql 5+.
0
 
Ray PaseurCommented:
For pairings between two different but related elements, you usually want to use a thing called a "junction table" or a "pivot table".  Normally (in normalized data bases) the junction is used to coordinate information in two separate tables.  You might have a table of class lectures and a separate table of labs.  Each of these tables has a key for every row.   A relational link exists when the junction table contains a row with both a class key and a lab key.  Check this search for more information on normalization of data bases.  It is part art and part science:
http://lmgtfy.com?q=Should+I+normalize+my+database

If I were to have to deal with your current rule set, I would first add new table of labs to complement the lectures table.  One of the most important facts to know would be whether "this is a lecture" or "this is a lab" and that kind of information would make sense (to me, at least) to keep in different tables.  Then I would run the old table and test all the rules.  For those that are labs, I would add a row to the lab table with the appropriate and accurate lab information.  And I would add a row to the junction table that tied this lab to the lecture.

What you will be creating in the junction table is a "many-to-many relationship" between lectures and labs.  In that way zero, one, or more lecturers can send their students to the same labs, and lab results can be sent back to zero, one, or more lecturers.  It is a very flexible design.

Does that make sense for your needs?  HTH, ~Ray
0
 
BigSchmuhCommented:
I won't modify the db structure for this need...

I suggest you to query 2 times the same table. First one is a "base" and the eventually joined one can be the Lecture(Lab) when base is Lab(Lecture) or null row otherwise

SELECT MainC.ClassNbr, MainC.Subject, MainC.Catalog, MainC.Section, MainC.Begin Date, Coalesce(SubC.End Date, MainC.End Date), MainC.Lname, MainC.Fname, MainC.MtgStart, MainC.MtdEnd, MainC.Mon, MainC.Tues, MainC.Wed, MainC.Thur, MainC.Fri, MainC.Sat, MainC.Sun, MainC.Descr, MainC.Min CH, MainC.Max CH, MainC.PreReq, MainC.Room, MainC.IAI, MainC.Notes
FROM EE2 MainC
   Left join EE2 SubC
      ON   (MainC.ClassNbr = SubC.ClassNbr-1 AND Left(SubC.Section,1)="X")
        OR (MainC.ClassNbr = SubC.ClassNbr+1 AND Left(MainC.Section,1)="X")
WHERE SubC.ClassNbr IS NULL OR MainC.ClassNbr = SubC.ClassNbr-1

Coalesce is a classic function that returns the first non empty field
If your DBMS does not support it, you can replace it easily (or ask)
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
abernsAuthor Commented:
Hello Ray,
Thank you for your thoughts. I just spoke with the person who is getting me the data (which I am going to import into MySQL), and he can separate the labs from the lectures. I can join the rows on Catalog=Catalog and Subject=Subject and Section = Section Contact X at the end, but I think I will still need help just pulling the MtgEnd from the lab table and putting that in place of the end section of the lecture table. Based on what I just said, assuming that the data is put into the two tables, can you help me with the subsequent query?
0
 
BigSchmuhCommented:
You don't need to modify the db structure to get your expected results...
Please have a look at my post above
0
 
abernsAuthor Commented:
@bigschmuh,
OK....let me try it and get back to you. Thanks!
0
 
Ray PaseurCommented:
How much data are we talking about here?  Penn State or a small teachers' college?
0
 
BigSchmuhCommented:
@Ray: you don't care of the rows count because the above query can handle many WHERE criterias and INDEXes based on the MainC table
0
 
abernsAuthor Commented:
Between 5800 - 10,000 K rows of data. I can also break the data into subsets, emptying it out each time if need be. I am already joining two other tables, one containing notes, and one containing prereqs.
0
 
BigSchmuhCommented:
...in my SQL above, you should have the join criteria for "a X at the end" uses a RIGHT() operator instead of LEFT() one

SELECT MainC.ClassNbr, MainC.Subject, MainC.Catalog, MainC.Section, MainC.Begin Date, Coalesce(SubC.End Date, MainC.End Date), MainC.Lname, MainC.Fname, MainC.MtgStart, MainC.MtdEnd, MainC.Mon, MainC.Tues, MainC.Wed, MainC.Thur, MainC.Fri, MainC.Sat, MainC.Sun, MainC.Descr, MainC.Min CH, MainC.Max CH, MainC.PreReq, MainC.Room, MainC.IAI, MainC.Notes
FROM EE2 MainC
   Left join EE2 SubC
      ON   (MainC.ClassNbr = SubC.ClassNbr-1 AND Right(SubC.Section,1)='X')
        OR (MainC.ClassNbr = SubC.ClassNbr+1 AND Right(MainC.Section,1)='X')
WHERE (SubC.ClassNbr IS NULL OR MainC.ClassNbr = SubC.ClassNbr-1)
0
 
abernsAuthor Commented:
alright...I will not be able to try this until about 2 pm CST, but I will post back at that time. Thanks.
0
 
Ray PaseurCommented:
Hi, @BigSchmuh.  I have no objection to doing it your way for this query, so I will sign off on this question now.  Have fun.  In my initial post, I was taking the longer view of whether the data base should be normalized and relational.  My belief is that a one-time modification that encourages "best practices" is a good step in the right direction.  Eventually issues will arise with disorganized data - sorting, gouping, etc.  The queries will get complex and complicated and difficult to debug or explain to other programmers.  Any time you have a column in a table that can have more than one meaning, you are asking for trouble.  And trouble has a persistent and nasty way of finding you.

If a programmer had come to me in a job interview and said that her design included this: The Lecture Row and the Lab row have the same section number, except that the Lab row has an X at the end of the Section, and, the Lab ClassNbr is always 1 greater than the Lecture class number. she would be thanked for her time and escorted to the exit.  You can program around that kind of design for a while.  But eventually the house of cards gets caught in a breeze.  As good as Martin Fowler's "Refactoring" book might be, I have never been able to make a business case for refactoring!  It is far easier to use the simple and pedestrian designs that lazy programmers prefer.

best to all, ~Ray
0
 
abernsAuthor Commented:
ahem, just want to put in that I did not choose the data structure, but was trying to work with what I was given on a very tight time frame. Ray, since your suggestion I have been trying to get more normalized data from the providers. At the time I posted the question, I did not know that was an option.
0
 
abernsAuthor Commented:
Mathiyazhagan:,

By Jove, I think it's working! I just tried it on a subset of data and it seemed to work. I will do a more extensive test later and get back to you.

Thank you so much.
0
 
abernsAuthor Commented:
Mathiyazhagan:,

It seems to be working beautifully. I'm going to have another set of eyes look at the data tomorrow before signing off.

Thanks to everyone who chimed in.

0
 
abernsAuthor Commented:
Sorry for the delay in replying. I was implementing and reimplementing this query to try and meet our crazy deadline...anyway, it's simple and does the trick. Thanks so much.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.