Checking for similarity in records

Posted on 2005-04-02
Medium Priority
Last Modified: 2009-08-11
Im new to MySQL and Im looking for some advice.
Im creating a database of job advertisements from several different websites. The job advertisements are parsed from webpages using Coldfusion and then entered into a MySQ database.
All the jos have a Primary jodID (AUTO_INCREMENT) and a unique identifier that consists of websitename and jobref (the job ref is taken from the website).

Problem 1: The same job may be advertised at one or more websites, but with one or more different jobrefs. This means I can have duplicates in the database and Im not sure how to address the problem
I need to know if there is a method that will allow me to compare the similarity of the data I have parsed with the data already in the database.

Question by:VHSB
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 17

Expert Comment

ID: 13688301
Solving this problem is not easy without understanding how exactly you are parsing the job description and what are you storing in the database.  However, I would check if in your database a job already exists that has the same company, job title, and same description. However, I am not sure how effective this would be.

Author Comment

ID: 13688555
I have a set of variable that store the parsed results, eg titleresults, descriptionresults, refresults etc. Those variables are then inserted into the database. EG
<cfquery name="insJobRecord" datasource="project">
                      insert into job (ref, Description, Title)
               (<cfqueryparam value="#titleresults#" cfsqltype="CF_SQL_VARCHAR">
               <cfqueryparam value="#descriptionresults#" cfsqltype="CF_SQL_LONGTEXT">,
      <cfqueryparam value="#refresults#" cfsqltype="CF_SQL_VARCHAR">)
Im not sure how i could compare the individual database records against the parse results. Am I making sense?
LVL 16

Accepted Solution

ellandrd earned 2000 total points
ID: 13691021
try this:

DELETE tblname
     FROM tblname,
         ( SELECT MIN(ident) AS minIdent, name
             FROM tblname
             GROUP BY name
             HAVING COUNT(1) > 1 ) AS derived
     WHERE tblname.name = derived.name
     AND ident > minIdent      
Either Min or Max could have been used.  The idea is to find just one of the "ident" values out of the group of rows that are otherwise duplicates.

"having" can be thought of as a "where" clause for "group by"... in this case we're selecting only those rows that have a count of "ident" values > 1 (just the ones which are duplicated).  The "1" refers to the first column in the query, "minIdent".  Column references are 1-based so "0" would be an error.

This decides which of the selected rows will be deleted... all of the ones with an ident greater than the smallest ident in the group.  If we'd used "max(ident)" then this statement would have been "AND ident < maxIdent"  (the other references to minIdent would have been changed to maxIdent... the name is arbitrary and could have been anything so long as it was the same throughout the query).

LVL 22

Expert Comment

ID: 13694207
Hmmmm, that code looks familiar :-).  

I don't think it applies to this question however since it was designed to remove duplicate records where the "name" field is identical.  The trick here is to recognize "sameness" that might not be "identical" before the row is inserted.  Of course, once "sameness" has been determined, this query might be useful with the proper modifications to remove those "similarly duplicate" rows which have already been inserted.

-- Lynn
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 13696130
I don't think you can really determine from outside which jobs are "identical" and which are not.
Of course, if a company searches via several sites, it might also ask from a head-hunting or body-shopping which could also generate other entries etc.

I think, you should "solve" the problem by listing all the offers found, group them by either source site or offering company (users's choice), and a clear indication on YOUR site that some of the lines could refer to in fact the same job.


Featured Post

Get your Disaster Recovery as a Service basics

Disaster Recovery as a Service is one go-to solution that revolutionizes DR planning. Implementing DRaaS could be an efficient process, easily accessible to non-DR experts. Learn about monitoring, testing, executing failovers and failbacks to ensure a "healthy" DR environment.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this blog post, we’ll look at how using thread_statistics can cause high memory usage.
In this blog, we’ll look at how improvements to Percona XtraDB Cluster improved IST performance.
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question