Solved

MySQL similar strings SELECT

Posted on 2011-02-11
6
978 Views
Last Modified: 2012-05-11
Hello all,

I would like to find all similar strings on a table column given an input. For instance, say that I have the following strings:

cabernet sauvingon
cabermet sauvingon
caberet sauvingon
cabernet franc

In this case, if I search for Cabernet Sauvignon it should return all but cabernet franc which is not the same.

One way to achieve that is through character level biagrams. I wonder if there will be a function on MYSQL (not LIKE) in which we could do that.

A table would be:

wine { Description, quantity}

The wine table description column is what I am interested on. If I have those values and I need to do a select to everything that looks like a Cabernet Sauvignon despite the typos that might be in there, how can I do it?

The same goes for words like: Description and Desc that are similiar but the second is abbreviated compared to the first one.

This article here somewhat describes the issue http://www.nearinfinity.com/blogs/page/seths?entry=finding_similar_strings_using_character

Basically I would like to match similar strings. The strings I am passing are not phrases, are smaller than that. Basically like the wine scope.

Would a stored procedure be necessary for a problem like this?

Would a regex search do the trick? I somewhat doubt that.

0
Comment
Question by:CarlosScheidecker
  • 3
  • 2
6 Comments
 
LVL 20

Expert Comment

by:virmaior
ID: 34878814
you could probably accomplish many of these comparison with RLIKE (regular expressions), but I think there's a much faster way.

You should build a common "synonym" table that indexes things like cabermet, cabernet, caberet all to the same term and make sure your own data is normalized.

This way when you get a raw search you do a replace to the right terms when possible.

Another component could be using the SOUNDEX function which reduces a word into a number that reflects its expected pronunciation.  These functions are imprecise.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 34881501
It seems to me that Bigrams are the best way to go.
0
 
LVL 20

Expert Comment

by:virmaior
ID: 35012740
Carlos did not address any of the solutions I suggested.
0
 
LVL 1

Accepted Solution

by:
CarlosScheidecker earned 500 total points
ID: 35013142
yes I did. I have implemented Bigrams and that was the best way to go.


public class BigramGenerator {

	public static List<char[]> bigram(String input) {
		ArrayList<char[]> bigram = new ArrayList<char[]>();
		for (int i = 0; i < input.length() - 1; i++) {
			char[] chars = new char[2];
			chars[0] = input.charAt(i);
			chars[1] = input.charAt(i + 1);
			bigram.add(chars);
		}
		return bigram;
	}

	public static double dice(List<char[]> bigram1, List<char[]> bigram2) {
		List<char[]> copy = new ArrayList<char[]>(bigram2);
		int matches = 0;
		for (int i = bigram1.size(); --i >= 0;) {
			char[] bigram = bigram1.get(i);
			for (int j = copy.size(); --j >= 0;) {
				char[] toMatch = copy.get(j);
				if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1]) {
					copy.remove(j);
					matches += 2;
					break;
				}
			}
		}
		return (double) matches / (bigram1.size() + bigram2.size());
	}

	public static double compareString(String s1, String s2, boolean ignoreCase) {
		if (ignoreCase) {
			s1 = s1.toLowerCase();
			s2 = s2.toLowerCase();
		}
		return dice(bigram(s1), bigram(s2));
	}

Open in new window

0
 
LVL 20

Expert Comment

by:virmaior
ID: 35013207
still not dealt with.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Introduction In this article, I will by showing a nice little trick for MySQL similar to that of my previous EE Article for SQLite (http://www.sqlite.org/), A SQLite Tidbit: Quick Numbers Table Generation (http://www.experts-exchange.com/A_3570.htm…
I use MySQL for many of my development projects in a Windows environment. To manage my databases (and perform queries) for years I used a tool called MySQL administrator.  This tool has since been replaced by MySQL Workbench. So I decided to m…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now