Solved

MySQL similar strings SELECT

Posted on 2011-02-11
6
989 Views
Last Modified: 2012-05-11
Hello all,

I would like to find all similar strings on a table column given an input. For instance, say that I have the following strings:

cabernet sauvingon
cabermet sauvingon
caberet sauvingon
cabernet franc

In this case, if I search for Cabernet Sauvignon it should return all but cabernet franc which is not the same.

One way to achieve that is through character level biagrams. I wonder if there will be a function on MYSQL (not LIKE) in which we could do that.

A table would be:

wine { Description, quantity}

The wine table description column is what I am interested on. If I have those values and I need to do a select to everything that looks like a Cabernet Sauvignon despite the typos that might be in there, how can I do it?

The same goes for words like: Description and Desc that are similiar but the second is abbreviated compared to the first one.

This article here somewhat describes the issue http://www.nearinfinity.com/blogs/page/seths?entry=finding_similar_strings_using_character

Basically I would like to match similar strings. The strings I am passing are not phrases, are smaller than that. Basically like the wine scope.

Would a stored procedure be necessary for a problem like this?

Would a regex search do the trick? I somewhat doubt that.

0
Comment
Question by:CarlosScheidecker
  • 3
  • 2
6 Comments
 
LVL 20

Expert Comment

by:virmaior
ID: 34878814
you could probably accomplish many of these comparison with RLIKE (regular expressions), but I think there's a much faster way.

You should build a common "synonym" table that indexes things like cabermet, cabernet, caberet all to the same term and make sure your own data is normalized.

This way when you get a raw search you do a replace to the right terms when possible.

Another component could be using the SOUNDEX function which reduces a word into a number that reflects its expected pronunciation.  These functions are imprecise.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 34881501
It seems to me that Bigrams are the best way to go.
0
 
LVL 20

Expert Comment

by:virmaior
ID: 35012740
Carlos did not address any of the solutions I suggested.
0
 
LVL 1

Accepted Solution

by:
CarlosScheidecker earned 500 total points
ID: 35013142
yes I did. I have implemented Bigrams and that was the best way to go.


public class BigramGenerator {

	public static List<char[]> bigram(String input) {
		ArrayList<char[]> bigram = new ArrayList<char[]>();
		for (int i = 0; i < input.length() - 1; i++) {
			char[] chars = new char[2];
			chars[0] = input.charAt(i);
			chars[1] = input.charAt(i + 1);
			bigram.add(chars);
		}
		return bigram;
	}

	public static double dice(List<char[]> bigram1, List<char[]> bigram2) {
		List<char[]> copy = new ArrayList<char[]>(bigram2);
		int matches = 0;
		for (int i = bigram1.size(); --i >= 0;) {
			char[] bigram = bigram1.get(i);
			for (int j = copy.size(); --j >= 0;) {
				char[] toMatch = copy.get(j);
				if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1]) {
					copy.remove(j);
					matches += 2;
					break;
				}
			}
		}
		return (double) matches / (bigram1.size() + bigram2.size());
	}

	public static double compareString(String s1, String s2, boolean ignoreCase) {
		if (ignoreCase) {
			s1 = s1.toLowerCase();
			s2 = s2.toLowerCase();
		}
		return dice(bigram(s1), bigram(s2));
	}

Open in new window

0
 
LVL 20

Expert Comment

by:virmaior
ID: 35013207
still not dealt with.
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
MySQL query to show different levels from a table. 5 68
updating table data with inner join 9 37
mysql between clause 2 24
Create a MySQL table as easily as possible 6 38
Foreword In the years since this article was written, numerous hacking attacks have targeted password-protected web sites.  The storage of client passwords has become a subject of much discussion, some of it useful and some of it misguided.  Of cou…
Introduction In this article, I will by showing a nice little trick for MySQL similar to that of my previous EE Article for SQLite (http://www.sqlite.org/), A SQLite Tidbit: Quick Numbers Table Generation (http://www.experts-exchange.com/A_3570.htm…
Microsoft Active Directory, the widely used IT infrastructure, is known for its high risk of credential theft. The best way to test your Active Directory’s vulnerabilities to pass-the-ticket, pass-the-hash, privilege escalation, and malware attacks …
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question