Solved

MySQL similar strings SELECT

Posted on 2011-02-11
6
984 Views
Last Modified: 2012-05-11
Hello all,

I would like to find all similar strings on a table column given an input. For instance, say that I have the following strings:

cabernet sauvingon
cabermet sauvingon
caberet sauvingon
cabernet franc

In this case, if I search for Cabernet Sauvignon it should return all but cabernet franc which is not the same.

One way to achieve that is through character level biagrams. I wonder if there will be a function on MYSQL (not LIKE) in which we could do that.

A table would be:

wine { Description, quantity}

The wine table description column is what I am interested on. If I have those values and I need to do a select to everything that looks like a Cabernet Sauvignon despite the typos that might be in there, how can I do it?

The same goes for words like: Description and Desc that are similiar but the second is abbreviated compared to the first one.

This article here somewhat describes the issue http://www.nearinfinity.com/blogs/page/seths?entry=finding_similar_strings_using_character

Basically I would like to match similar strings. The strings I am passing are not phrases, are smaller than that. Basically like the wine scope.

Would a stored procedure be necessary for a problem like this?

Would a regex search do the trick? I somewhat doubt that.

0
Comment
Question by:CarlosScheidecker
  • 3
  • 2
6 Comments
 
LVL 20

Expert Comment

by:virmaior
ID: 34878814
you could probably accomplish many of these comparison with RLIKE (regular expressions), but I think there's a much faster way.

You should build a common "synonym" table that indexes things like cabermet, cabernet, caberet all to the same term and make sure your own data is normalized.

This way when you get a raw search you do a replace to the right terms when possible.

Another component could be using the SOUNDEX function which reduces a word into a number that reflects its expected pronunciation.  These functions are imprecise.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 34881501
It seems to me that Bigrams are the best way to go.
0
 
LVL 20

Expert Comment

by:virmaior
ID: 35012740
Carlos did not address any of the solutions I suggested.
0
 
LVL 1

Accepted Solution

by:
CarlosScheidecker earned 500 total points
ID: 35013142
yes I did. I have implemented Bigrams and that was the best way to go.


public class BigramGenerator {

	public static List<char[]> bigram(String input) {
		ArrayList<char[]> bigram = new ArrayList<char[]>();
		for (int i = 0; i < input.length() - 1; i++) {
			char[] chars = new char[2];
			chars[0] = input.charAt(i);
			chars[1] = input.charAt(i + 1);
			bigram.add(chars);
		}
		return bigram;
	}

	public static double dice(List<char[]> bigram1, List<char[]> bigram2) {
		List<char[]> copy = new ArrayList<char[]>(bigram2);
		int matches = 0;
		for (int i = bigram1.size(); --i >= 0;) {
			char[] bigram = bigram1.get(i);
			for (int j = copy.size(); --j >= 0;) {
				char[] toMatch = copy.get(j);
				if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1]) {
					copy.remove(j);
					matches += 2;
					break;
				}
			}
		}
		return (double) matches / (bigram1.size() + bigram2.size());
	}

	public static double compareString(String s1, String s2, boolean ignoreCase) {
		if (ignoreCase) {
			s1 = s1.toLowerCase();
			s2 = s2.toLowerCase();
		}
		return dice(bigram(s1), bigram(s2));
	}

Open in new window

0
 
LVL 20

Expert Comment

by:virmaior
ID: 35013207
still not dealt with.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Fore-Foreword Today (2016) Maxmind has a new approach to the distribution of its data sets.  This article may be obsolete.  Instead of using the examples here, have a look at the MaxMind API (https://www.maxmind.com/en/geolite2-developer-package). …
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
This Micro Tutorial will give you a basic overview how to record your screen with Microsoft Expression Encoder. This program is still free and open for the public to download. This will be demonstrated using Microsoft Expression Encoder 4.
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…

896 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now