• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2136
  • Last Modified:

Similarity search through 10000 italian city names

I have a list of 10000 italian city names.
I would like to "fuzzy search" through them.
What's the most popular or fast algorithm for doing it?
0
lucavilla
Asked:
lucavilla
3 Solutions
 
harfangCommented:
Try the Levenshtein distance. See http://en.wikipedia.org/wiki/Levenshtein_distance
It's especially good for languages like Italian.

(°v°)
0
 
lucavillaAuthor Commented:
I need to contain every search within 2 seconds of CPU time. Wouldn't the Levenshtein algorithm be too slow to be used on 10000 words every time?
0
 
harfangCommented:
For English, you could store simplified pattern strings, removing all the orthographic variants. You would then apply Levenshtein only to those words with the same pattern. That doesn't work too well for Italian, though.

I'm sure there are preprocessing algorithms to reduce the number of strings for which you perform the exact Levenshtein distance, even for Italian, but I don't know them.

You could use the suggestion on the page, namely start your search with strings of about the same length, and expand until you have enough matches (you know the minimal distance of two strings of different lengths to be the difference in lengths).

But I do know that dictionary searches (which is your case) use some form of Levenshtein to show "relevant matches" or "suggested corrections". Multi-word searches like for internet search engines are different, naturally.

(°v°)
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
0xC0DEB07Commented:
Hi lucavilla!

I guess what you need is something based on the double metaphone algorithm.
http://en.wikipedia.org/wiki/Double_Metaphone.

If you use PHP you may want to take a look at these functions:
levenshtein(), soundex(), similar_text(), and metaphone().
0
 
ozoCommented:
agrep does a fuzzy search through 10000 names in less than 2 seconds
ftp://ftp.cs.arizona.edu/agrep/
0
 
ozoCommented:
0
 
ozoCommented:
agrep satisfies the requirements, but this may also be of interest  http://www-db.deis.unibo.it/Mtree/
0
 
Computer101Commented:
Forced accept.

Computer101
EE Admin
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now