• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1024
  • Last Modified:

fuzzy matching an address using mysql's match against (if possible using weights for better results ranking)

Hi ,

I have a myISAM table with FULLTEXT index , trying to do

"SELECT
    lk.id,
    lk.address
FROM
    lk
WHERE MATCH
    lk.address
AGAINST('235 regent street, london w1b 2et');"
I get results but only the ones who got the word "london" inside, or ones who got the word "street" inside. I know that 3 (ft_min_word_len) character words aren't indexed so "235","w1b","2et" are ignored, but
1) what about "regent" ?
2) What is the STANDARD way of doing this?
 fuzzy matching an address.
thanks
0
betaM
Asked:
betaM
  • 5
  • 3
2 Solutions
 
Jeffrey CoachmanMIS LiasonCommented:
<What is the STANDARD way of doing this?>
There is no "standard" way of doing this.
There are dozens of "Fuzzy" search algorithms available.

The bottom line is that Fuzzy means just that, ...Fuzzy, ...or not exact.
The tighter (more exact) you make the search, the fewer results you will get, possibly excluding the ones you want.

If you make the search to loose you may start to get too many results, some to far afield from what you wanted.

Your requirement of having this work with multiple words is your main roadblock, as most of these algorithms are designed to search for one word.


One of the most popular Fuzzy Search algorithms is SOUNDEX
http://msdn.microsoft.com/en-us/library/ms187384.aspx
For more info see here:
http://en.wikipedia.org/wiki/Soundex

There are also many sites that publish various SOUNDEX variants, so you can Google form more info

JeffCoachman
0
 
betaMAuthor Commented:
boag2000, thanks for the answer. Ok, so there is no standard and there are dozens of "Fuzzy" search algorithms available. I need a more specific answer.

Given mySQL as your tool-of-the-trade, what is the optimal way of querying from an address column, from your experience?

I know about soundex, but it is irrelevant here, as:
a) it only works with the English language (or is language specific).
b) addresses are built from various properties not sound-alike by their nature, unlike spoken language.

I've googled as much as i could, but couldn't find the solution thats why i came here, to ask the experts :)

 
0
 
Jeffrey CoachmanMIS LiasonCommented:
Then please post a sample table with addresses, and then specify the results you want returned
0
Network Scalability - Handle Complex Environments

Monitor your entire network from a single platform. Free 30 Day Trial Now!

 
betaMAuthor Commented:
Ok, here is a sample 10 rows table name lk.
id | address
==================================================
2, '1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA'
3, 'citadel dr # 343'
4, '105 fulham rd, hammersmith, greater london sw6 1, uk'
5, '107 fulham rd, hammersmith, greater london sw6 1, uk'
19, 'champs elysees dr, bonne terre, mo 63628, usa'
21, 'champs elysee's rd, gould city, mi 49838, usa'
22, 'champs elysees, kenner, la 70065, usa'
25, ' san francisco'
27, 'brooklyn, ny 11210, usa'
34, '1821b atlantic ave, brooklyn, ny 11233, usa'

for '235 regent street, london w1b 2et' i would want the result of 0 rows and not getting row ids 4,5 like now.
0
 
Jeffrey CoachmanMIS LiasonCommented:
Again, this may not be possible to do because of:
The multiple "strings" in your text
Some of your stings not being "Words"
Inherent limitations of any fuzzy matching algorithm or function.

If you want to query for "235 regent street, london w1b 2et" on the filed you listed above and get 0 rows, then why not just do this:

SELECT ID, Address
FROM YourTable
WHERE Address='235 regent street, london w1b 2et'

This will return 0 rows.

0
 
betaMAuthor Commented:
This isn't the solution and for getting 0 rows as result i could do simpler queries :)

Thanks anyway, i'll wait for other answer.
0
 
betaMAuthor Commented:
The solution i've figured out, is using:

SELECT lk.id,lk.address
FROM lk
MATCH(lk.address) AGAINST ('+129 +West +81st +street' IN BOOLEAN MODE);

so you need to go over the searched string and plus ("+") in each starting words, and then "send" it to the query.
http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html

Want to see if theres a better solution. BTW , its not like double quoting it.
MATCH(lk.address) AGAINST ('"129 West 81st street"' IN BOOLEAN MODE);
0
 
virmaiorCommented:
You can try:

SELECT lk.id,lk.address, MATCH(lk.address) AGAINST ('+129 +West +81st +street' ) as likelihood
FROM lk
MATCH(lk.address) AGAINST ('+129 +West +81st +street' IN BOOLEAN MODE);

ORDER BY
MATCH(lk.address) AGAINST ('+129 +West +81st +street' ) DESC;

He's right that there are no universal ways of doing this.  There are going to be three approaches:

(1) use LIKE rather than FULLTEXT and use wildcards (slow on large data sets)
(2) make a new field that simplifies the data so that it can be matched against a simplification of the search criteria (i.e. replace all numbers with 8 digit numbers, rewrite w. as west in all cases, etc).
(3) normalize stored data and searched data [a species of the above]
0
 
betaMAuthor Commented:
Hope to hear other ideas..
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 5
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now