Solved

MySQL similar strings SELECT

Posted on 2011-02-11
6
995 Views
Last Modified: 2012-05-11
Hello all,

I would like to find all similar strings on a table column given an input. For instance, say that I have the following strings:

cabernet sauvingon
cabermet sauvingon
caberet sauvingon
cabernet franc

In this case, if I search for Cabernet Sauvignon it should return all but cabernet franc which is not the same.

One way to achieve that is through character level biagrams. I wonder if there will be a function on MYSQL (not LIKE) in which we could do that.

A table would be:

wine { Description, quantity}

The wine table description column is what I am interested on. If I have those values and I need to do a select to everything that looks like a Cabernet Sauvignon despite the typos that might be in there, how can I do it?

The same goes for words like: Description and Desc that are similiar but the second is abbreviated compared to the first one.

This article here somewhat describes the issue http://www.nearinfinity.com/blogs/page/seths?entry=finding_similar_strings_using_character

Basically I would like to match similar strings. The strings I am passing are not phrases, are smaller than that. Basically like the wine scope.

Would a stored procedure be necessary for a problem like this?

Would a regex search do the trick? I somewhat doubt that.

0
Comment
Question by:CarlosScheidecker
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 20

Expert Comment

by:virmaior
ID: 34878814
you could probably accomplish many of these comparison with RLIKE (regular expressions), but I think there's a much faster way.

You should build a common "synonym" table that indexes things like cabermet, cabernet, caberet all to the same term and make sure your own data is normalized.

This way when you get a raw search you do a replace to the right terms when possible.

Another component could be using the SOUNDEX function which reduces a word into a number that reflects its expected pronunciation.  These functions are imprecise.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 34881501
It seems to me that Bigrams are the best way to go.
0
 
LVL 20

Expert Comment

by:virmaior
ID: 35012740
Carlos did not address any of the solutions I suggested.
0
 
LVL 1

Accepted Solution

by:
CarlosScheidecker earned 500 total points
ID: 35013142
yes I did. I have implemented Bigrams and that was the best way to go.


public class BigramGenerator {

	public static List<char[]> bigram(String input) {
		ArrayList<char[]> bigram = new ArrayList<char[]>();
		for (int i = 0; i < input.length() - 1; i++) {
			char[] chars = new char[2];
			chars[0] = input.charAt(i);
			chars[1] = input.charAt(i + 1);
			bigram.add(chars);
		}
		return bigram;
	}

	public static double dice(List<char[]> bigram1, List<char[]> bigram2) {
		List<char[]> copy = new ArrayList<char[]>(bigram2);
		int matches = 0;
		for (int i = bigram1.size(); --i >= 0;) {
			char[] bigram = bigram1.get(i);
			for (int j = copy.size(); --j >= 0;) {
				char[] toMatch = copy.get(j);
				if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1]) {
					copy.remove(j);
					matches += 2;
					break;
				}
			}
		}
		return (double) matches / (bigram1.size() + bigram2.size());
	}

	public static double compareString(String s1, String s2, boolean ignoreCase) {
		if (ignoreCase) {
			s1 = s1.toLowerCase();
			s2 = s2.toLowerCase();
		}
		return dice(bigram(s1), bigram(s2));
	}

Open in new window

0
 
LVL 20

Expert Comment

by:virmaior
ID: 35013207
still not dealt with.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Excel - SQL export question 3 53
mysql db 3 82
MySQL recovery 7 30
How to select record with In and condition together based on more then one coulmn 7 37
More Fun with XML and MySQL – Parsing Delimited String with a Single SQL Statement Are you ready for another of my SQL tidbits?  Hopefully so, as in this adventure, I will be covering a topic that comes up a lot which is parsing a comma (or other…
This article shows the steps required to install WordPress on Azure. Web Apps, Mobile Apps, API Apps, or Functions, in Azure all these run in an App Service plan. WordPress is no exception and requires an App Service Plan and Database to install
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question