Avatar of ROM
ROMFlag for United Kingdom of Great Britain and Northern Ireland

asked on 

PHP - compare strings and find the closest match and score them with a number.

Hi Everyone,


As part of an import routine I am working on I am trying to find the MOST likely data to match a bunch of fields from the items listed in the tables.


So really I am trying to do a sounds like with an integer value return to work out which value may be the closest to the input string.


Then I will sort a list and present it in MOST similar to LEAST similar order.


My actual project involves lots of fields and reading in lines by lines. So below I have put together a simple version that when I get this right then I can apply to my project.


I have tried the following so far:

<?php
$myinput = "jelly";
$mytest1 = "nelly";
$mytest2 = "jelly";
$mytest3 = "jellies";
$mytest4 = "Bellyse";

echo similar_text($mytest1,$myinput, $perc) . "perc: " .$perc. "<br>";
echo similar_text($mytest2,$myinput, $perc) . "perc: " .$perc. "<br>";
echo similar_text($mytest3,$myinput, $perc) . "perc: " .$perc. "<br>";
echo similar_text($mytest4,$myinput, $perc) . "perc: " .$perc. "<br>";
echo levenshtein($mytest1,$myinput) . "<br>";
echo levenshtein($mytest2,$myinput) . "<br>";
echo levenshtein($mytest3,$myinput) . "<br>";
echo levenshtein($mytest4,$myinput) . "<br>";
echo strcmp($mytest1,$myinput) . "<br>";
echo strcmp($mytest2,$myinput) . "<br>";
echo strcmp($mytest3,$myinput) . "<br>";
echo strcmp($mytest4,$myinput) . "<br>";
?>

Open in new window


similar_text with a percentage comparison seems the best so far. 

Anyone else have any better examples of comparison and other functions please ?


Many thanks in advance 


R


PHP

Avatar of undefined
Last Comment
ROM
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Using similar_text is not brilliant.

If I search for Tin I get the following ranking:

Tin
Thin
Tin Hat

Where Tin and Tin Hat are the top results.

Any ideas please? I have seen reference to sounds like.... Please advise.

Thanks
R
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Really finding poor results here.. so let me know of your ideas, history and experience on this please.
Its a big part of an interpretation module I need to do. So far the results are not that favourable.

Thank you

R
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

I am confused by
Tin
Thin
Tin Hat

That is not in your input?

Said differently, you want to search for, "jelly" and then look through a list like below and come up with the best match?  
nelly
jelly  
jellies
Bellyse  

In the above case, that would be jelly.

What if the list did not contain 'jelly'?
nelly
jellies
Bellyse  

Are you expecting jellies to be returned?
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

hi Scott,

that was an example showing how poor the ranking is.

I want the ranking to not skew so badly when the length of th vames changes.

so the ranking for Denim should give something like the following order

denim

denim belt

denim trousers

zenim

zenith

blenin

or something similar.

thanks

R
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

To keep this simple. let's work with one set of examples. When we have Jelly, Tin and Denim, it is going to make this very confusing.

Does it have to rank in order or can it just pick the best option?
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

I want it to rank in most likely to less likely.. just like similar_text with a percent or difference value so I can sort it.

pick one example.. happy for that. my current results are very unfavorable using the functions I mention so I think I need to do more. the length of the values I think is skewing the comparison.

thanks in advance

R
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

I am also thinking this may be best put to the database like MySQL or SQLServer.
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

The environment is MySQL and PHP.

And trying to see if soundex and a difference might be better.

However, not got there yet :)

Thanks in advance

R
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

I tried playing around with https://www.php.net/manual/en/function.metaphone.php and that didn't get much better.

To keep it from getting complicated, I think you have to narrow down the search in your code.

If you have below:

$myinput = "Jelly";


$strings = [
    "nelly", "jellies", "jelly", "nelly", "kelly", "JELLY", "jilly", "jel", "jell"
];

Open in new window


You could first check if anything has the first several characters. In these algorithms,  kelly and jelly appear as similar as you are finding out.

What is the actual use case?
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Hi Scott,

Yes I tried Metaphone.. was no good.
Use Case is that several files get imported every single day and they provide key information for this organisation.

The data is NOT accessible digitally apart from text files.

I import and then I need to marry up several fields of data and make database records out of this for a larger system that provides daily operations.

So for a location I need to for example match: Warehouse, Recreational, Smith's Flat, Smith's Supermarket, The Smith's yard, Ned's Yard.

With the input being possibly: "Smith's apartment". So nothing will match up here on this occasion 100%.
So I need to decide on a likeness threshold and put the list of options in order for a data clerk to review and make a decision on as part of the continual importation task.

So for example if the match was 95% .. I would set that in the database record as I am fairly certain and the data clerk will validate. If I only get 94%, 92%, 89%, 77% I want to present these as options within the application and the data clerk at a fast rate can just select and not start searching etc...

Many thanks in advance.

R


Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

I think I will make my own engine by stripping the punctuation and exploding the search location into an array.

Then iterate through each location and strip punc and then explode into array the location name.

Then do an in_array on each element cycling through the words including in the location and then rank them by matches whilst also maintaining an array of excluded words like THE etc..

This will at least get me matches for when Office three and office kitchen are found.. They will at least get a score of 1. 

Best way probably unless anyone has anything else?

Many thanks in advance

R
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

maybe also rate on number of characters for the second and third word.. maybe a soundex then.. who knows

R
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

Romolo, keep this open for now.  I would ask this question using either the MySQL or SQL Server topic depending on which database you are using. There may be a more elegant solution using the database?
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

Hi Scott,
I did look into like I planned and still not brilliant. Just like doing similar_text with percentage.

So I have embarked upon my own engine.. working quite well so far.
Thanks

R
ASKER CERTIFIED SOLUTION
Avatar of ROM
ROM
Flag of United Kingdom of Great Britain and Northern Ireland image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
PHP
PHP

PHP is a widely-used server-side scripting language especially suited for web development, powering tens of millions of sites from Facebook to personal WordPress blogs. PHP is often paired with the MySQL relational database, but includes support for most other mainstream databases. By utilizing different Server APIs, PHP can work on many different web servers as a server-side scripting language.

125K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo