Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 327
  • Last Modified:

Checking for similarity in records

Im new to MySQL and Im looking for some advice.
Im creating a database of job advertisements from several different websites. The job advertisements are parsed from webpages using Coldfusion and then entered into a MySQ database.
All the jos have a Primary jodID (AUTO_INCREMENT) and a unique identifier that consists of websitename and jobref (the job ref is taken from the website).

Problem 1: The same job may be advertised at one or more websites, but with one or more different jobrefs. This means I can have duplicates in the database and Im not sure how to address the problem
I need to know if there is a method that will allow me to compare the similarity of the data I have parsed with the data already in the database.

1 Solution
Solving this problem is not easy without understanding how exactly you are parsing the job description and what are you storing in the database.  However, I would check if in your database a job already exists that has the same company, job title, and same description. However, I am not sure how effective this would be.
VHSBAuthor Commented:
I have a set of variable that store the parsed results, eg titleresults, descriptionresults, refresults etc. Those variables are then inserted into the database. EG
<cfquery name="insJobRecord" datasource="project">
                      insert into job (ref, Description, Title)
               (<cfqueryparam value="#titleresults#" cfsqltype="CF_SQL_VARCHAR">
               <cfqueryparam value="#descriptionresults#" cfsqltype="CF_SQL_LONGTEXT">,
      <cfqueryparam value="#refresults#" cfsqltype="CF_SQL_VARCHAR">)
Im not sure how i could compare the individual database records against the parse results. Am I making sense?
try this:

DELETE tblname
     FROM tblname,
         ( SELECT MIN(ident) AS minIdent, name
             FROM tblname
             GROUP BY name
             HAVING COUNT(1) > 1 ) AS derived
     WHERE tblname.name = derived.name
     AND ident > minIdent      
Either Min or Max could have been used.  The idea is to find just one of the "ident" values out of the group of rows that are otherwise duplicates.

"having" can be thought of as a "where" clause for "group by"... in this case we're selecting only those rows that have a count of "ident" values > 1 (just the ones which are duplicated).  The "1" refers to the first column in the query, "minIdent".  Column references are 1-based so "0" would be an error.

This decides which of the selected rows will be deleted... all of the ones with an ident greater than the smallest ident in the group.  If we'd used "max(ident)" then this statement would have been "AND ident < maxIdent"  (the other references to minIdent would have been changed to maxIdent... the name is arbitrary and could have been anything so long as it was the same throughout the query).

Hmmmm, that code looks familiar :-).  

I don't think it applies to this question however since it was designed to remove duplicate records where the "name" field is identical.  The trick here is to recognize "sameness" that might not be "identical" before the row is inserted.  Of course, once "sameness" has been determined, this query might be useful with the proper modifications to remove those "similarly duplicate" rows which have already been inserted.

-- Lynn
Guy Hengel [angelIII / a3]Billing EngineerCommented:
I don't think you can really determine from outside which jobs are "identical" and which are not.
Of course, if a company searches via several sites, it might also ask from a head-hunting or body-shopping which could also generate other entries etc.

I think, you should "solve" the problem by listing all the offers found, group them by either source site or offering company (users's choice), and a clear indication on YOUR site that some of the lines could refer to in fact the same job.


Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now