[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 943
  • Last Modified:

fuzzy matching an address using mysql's match against (if possible using weights for better results ranking)

Hi ,

I have a myISAM table with FULLTEXT index , trying to do

"SELECT
    lk.id,
    lk.address
FROM
    lk
WHERE MATCH
    lk.address
AGAINST('235 regent street, london w1b 2et');"
I get results but only the ones who got the word "london" inside, or ones who got the word "street" inside. I know that 3 (ft_min_word_len) character words aren't indexed so "235","w1b","2et" are ignored, but
1) what about "regent" ?
2) What is the STANDARD way of doing this?
 fuzzy matching an address.
thanks
0
betaM
Asked:
betaM
  • 5
  • 3
2 Solutions
 
Jeffrey CoachmanCommented:
<What is the STANDARD way of doing this?>
There is no "standard" way of doing this.
There are dozens of "Fuzzy" search algorithms available.

The bottom line is that Fuzzy means just that, ...Fuzzy, ...or not exact.
The tighter (more exact) you make the search, the fewer results you will get, possibly excluding the ones you want.

If you make the search to loose you may start to get too many results, some to far afield from what you wanted.

Your requirement of having this work with multiple words is your main roadblock, as most of these algorithms are designed to search for one word.


One of the most popular Fuzzy Search algorithms is SOUNDEX
http://msdn.microsoft.com/en-us/library/ms187384.aspx
For more info see here:
http://en.wikipedia.org/wiki/Soundex

There are also many sites that publish various SOUNDEX variants, so you can Google form more info

JeffCoachman
0
 
betaMAuthor Commented:
boag2000, thanks for the answer. Ok, so there is no standard and there are dozens of "Fuzzy" search algorithms available. I need a more specific answer.

Given mySQL as your tool-of-the-trade, what is the optimal way of querying from an address column, from your experience?

I know about soundex, but it is irrelevant here, as:
a) it only works with the English language (or is language specific).
b) addresses are built from various properties not sound-alike by their nature, unlike spoken language.

I've googled as much as i could, but couldn't find the solution thats why i came here, to ask the experts :)

 
0
 
Jeffrey CoachmanCommented:
Then please post a sample table with addresses, and then specify the results you want returned
0
Configuration Guide and Best Practices

Read the guide to learn how to orchestrate Data ONTAP, create application-consistent backups and enable fast recovery from NetApp storage snapshots. Version 9.5 also contains performance and scalability enhancements to meet the needs of the largest enterprise environments.

 
betaMAuthor Commented:
Ok, here is a sample 10 rows table name lk.
id | address
==================================================
2, '1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA'
3, 'citadel dr # 343'
4, '105 fulham rd, hammersmith, greater london sw6 1, uk'
5, '107 fulham rd, hammersmith, greater london sw6 1, uk'
19, 'champs elysees dr, bonne terre, mo 63628, usa'
21, 'champs elysee's rd, gould city, mi 49838, usa'
22, 'champs elysees, kenner, la 70065, usa'
25, ' san francisco'
27, 'brooklyn, ny 11210, usa'
34, '1821b atlantic ave, brooklyn, ny 11233, usa'

for '235 regent street, london w1b 2et' i would want the result of 0 rows and not getting row ids 4,5 like now.
0
 
Jeffrey CoachmanCommented:
Again, this may not be possible to do because of:
The multiple "strings" in your text
Some of your stings not being "Words"
Inherent limitations of any fuzzy matching algorithm or function.

If you want to query for "235 regent street, london w1b 2et" on the filed you listed above and get 0 rows, then why not just do this:

SELECT ID, Address
FROM YourTable
WHERE Address='235 regent street, london w1b 2et'

This will return 0 rows.

0
 
betaMAuthor Commented:
This isn't the solution and for getting 0 rows as result i could do simpler queries :)

Thanks anyway, i'll wait for other answer.
0
 
betaMAuthor Commented:
The solution i've figured out, is using:

SELECT lk.id,lk.address
FROM lk
MATCH(lk.address) AGAINST ('+129 +West +81st +street' IN BOOLEAN MODE);

so you need to go over the searched string and plus ("+") in each starting words, and then "send" it to the query.
http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html

Want to see if theres a better solution. BTW , its not like double quoting it.
MATCH(lk.address) AGAINST ('"129 West 81st street"' IN BOOLEAN MODE);
0
 
virmaiorCommented:
You can try:

SELECT lk.id,lk.address, MATCH(lk.address) AGAINST ('+129 +West +81st +street' ) as likelihood
FROM lk
MATCH(lk.address) AGAINST ('+129 +West +81st +street' IN BOOLEAN MODE);

ORDER BY
MATCH(lk.address) AGAINST ('+129 +West +81st +street' ) DESC;

He's right that there are no universal ways of doing this.  There are going to be three approaches:

(1) use LIKE rather than FULLTEXT and use wildcards (slow on large data sets)
(2) make a new field that simplifies the data so that it can be matched against a simplification of the search criteria (i.e. replace all numbers with 8 digit numbers, rewrite w. as west in all cases, etc).
(3) normalize stored data and searched data [a species of the above]
0
 
betaMAuthor Commented:
Hope to hear other ideas..
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

  • 5
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now