Avatar of ROM
ROMFlag for United Kingdom of Great Britain and Northern Ireland asked on

PHP - compare strings and find the closest match and score them with a number.

Hi Everyone,


As part of an import routine I am working on I am trying to find the MOST likely data to match a bunch of fields from the items listed in the tables.


So really I am trying to do a sounds like with an integer value return to work out which value may be the closest to the input string.


Then I will sort a list and present it in MOST similar to LEAST similar order.


My actual project involves lots of fields and reading in lines by lines. So below I have put together a simple version that when I get this right then I can apply to my project.


I have tried the following so far:

<?php
$myinput = "jelly";
$mytest1 = "nelly";
$mytest2 = "jelly";
$mytest3 = "jellies";
$mytest4 = "Bellyse";

echo similar_text($mytest1,$myinput, $perc) . "perc: " .$perc. "<br>";
echo similar_text($mytest2,$myinput, $perc) . "perc: " .$perc. "<br>";
echo similar_text($mytest3,$myinput, $perc) . "perc: " .$perc. "<br>";
echo similar_text($mytest4,$myinput, $perc) . "perc: " .$perc. "<br>";
echo levenshtein($mytest1,$myinput) . "<br>";
echo levenshtein($mytest2,$myinput) . "<br>";
echo levenshtein($mytest3,$myinput) . "<br>";
echo levenshtein($mytest4,$myinput) . "<br>";
echo strcmp($mytest1,$myinput) . "<br>";
echo strcmp($mytest2,$myinput) . "<br>";
echo strcmp($mytest3,$myinput) . "<br>";
echo strcmp($mytest4,$myinput) . "<br>";
?>

Open in new window


similar_text with a percentage comparison seems the best so far. 

Anyone else have any better examples of comparison and other functions please ?


Many thanks in advance 


R


PHP

Avatar of undefined
Last Comment
ROM

8/22/2022 - Mon
ASKER
ROM

Using similar_text is not brilliant.

If I search for Tin I get the following ranking:

Tin
Thin
Tin Hat

Where Tin and Tin Hat are the top results.

Any ideas please? I have seen reference to sounds like.... Please advise.

Thanks
R
ASKER
ROM

Really finding poor results here.. so let me know of your ideas, history and experience on this please.
Its a big part of an interpretation module I need to do. So far the results are not that favourable.

Thank you

R
Scott Fell

I am confused by
Tin
Thin
Tin Hat

That is not in your input?

Said differently, you want to search for, "jelly" and then look through a list like below and come up with the best match?  
nelly
jelly  
jellies
Bellyse  

In the above case, that would be jelly.

What if the list did not contain 'jelly'?
nelly
jellies
Bellyse  

Are you expecting jellies to be returned?
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
ASKER
ROM

hi Scott,

that was an example showing how poor the ranking is.

I want the ranking to not skew so badly when the length of th vames changes.

so the ranking for Denim should give something like the following order

denim

denim belt

denim trousers

zenim

zenith

blenin

or something similar.

thanks

R
Scott Fell

To keep this simple. let's work with one set of examples. When we have Jelly, Tin and Denim, it is going to make this very confusing.

Does it have to rank in order or can it just pick the best option?
ASKER
ROM

I want it to rank in most likely to less likely.. just like similar_text with a percent or difference value so I can sort it.

pick one example.. happy for that. my current results are very unfavorable using the functions I mention so I think I need to do more. the length of the values I think is skewing the comparison.

thanks in advance

R
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Scott Fell

I am also thinking this may be best put to the database like MySQL or SQLServer.
ASKER
ROM

The environment is MySQL and PHP.

And trying to see if soundex and a difference might be better.

However, not got there yet :)

Thanks in advance

R
Scott Fell

I tried playing around with https://www.php.net/manual/en/function.metaphone.php and that didn't get much better.

To keep it from getting complicated, I think you have to narrow down the search in your code.

If you have below:

$myinput = "Jelly";


$strings = [
    "nelly", "jellies", "jelly", "nelly", "kelly", "JELLY", "jilly", "jel", "jell"
];

Open in new window


You could first check if anything has the first several characters. In these algorithms,  kelly and jelly appear as similar as you are finding out.

What is the actual use case?
Your help has saved me hundreds of hours of internet surfing.
fblack61
ASKER
ROM

Hi Scott,

Yes I tried Metaphone.. was no good.
Use Case is that several files get imported every single day and they provide key information for this organisation.

The data is NOT accessible digitally apart from text files.

I import and then I need to marry up several fields of data and make database records out of this for a larger system that provides daily operations.

So for a location I need to for example match: Warehouse, Recreational, Smith's Flat, Smith's Supermarket, The Smith's yard, Ned's Yard.

With the input being possibly: "Smith's apartment". So nothing will match up here on this occasion 100%.
So I need to decide on a likeness threshold and put the list of options in order for a data clerk to review and make a decision on as part of the continual importation task.

So for example if the match was 95% .. I would set that in the database record as I am fairly certain and the data clerk will validate. If I only get 94%, 92%, 89%, 77% I want to present these as options within the application and the data clerk at a fast rate can just select and not start searching etc...

Many thanks in advance.

R


ASKER
ROM

I think I will make my own engine by stripping the punctuation and exploding the search location into an array.

Then iterate through each location and strip punc and then explode into array the location name.

Then do an in_array on each element cycling through the words including in the location and then rank them by matches whilst also maintaining an array of excluded words like THE etc..

This will at least get me matches for when Office three and office kitchen are found.. They will at least get a score of 1. 

Best way probably unless anyone has anything else?

Many thanks in advance

R
ASKER
ROM

maybe also rate on number of characters for the second and third word.. maybe a soundex then.. who knows

R
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Scott Fell

Romolo, keep this open for now.  I would ask this question using either the MySQL or SQL Server topic depending on which database you are using. There may be a more elegant solution using the database?
ASKER
ROM

Hi Scott,
I did look into like I planned and still not brilliant. Just like doing similar_text with percentage.

So I have embarked upon my own engine.. working quite well so far.
Thanks

R
ASKER CERTIFIED SOLUTION
ROM

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
See how we're fighting big data
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question