Solved

In need for a Regex

Posted on 2006-10-23
8
357 Views
Last Modified: 2010-04-16
Hi,

This may be easy for somebody familiar with regex's, but I am not and it's hard for me.
I need a regex that will match 1 or more words in a string starting from the beginning of each word.
Words are separated with a space char.

For example, for this:

"one red apple"

the following should be a match since all these substrings exist in the source and they all start from the beginning of words:

one
app
red one
one red ap
re
apple red o
apple red on

the following should NOT be a match since one of more words don't start from the first character of a word or simply doesn't exist in the string:

ppl
one ed
blah
on red
one red orange

help is very appreciated!
Thanks.
0
Comment
Question by:alongoldshuv
8 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 17787494
#!/usr/bin/perl
for( split/^/m,<<END
one
app
red one
one red ap
re
apple red o
apple red on
ppl
one ed
blah
on red
one red orange
END
){
   print "match $_" if /^(\b(o(ne?)?|r(ed?)?|a(p(p(le?)?)?)?)\b\s*)*$/
}
0
 

Author Comment

by:alongoldshuv
ID: 17789417
ozo,

thanks for the answer, but I am sorry as it seems like I wasn't very clear in my question. These were only examples I gave in order to show how the regex should work. I need it to be generic and to apply the rules for any given string, and therefore will not have actuall alphabetical characters but mostly symbols.
The input strings will be matched with a regex against a database of 1000's of entries...


0
 
LVL 41

Expert Comment

by:HonorGod
ID: 17790648
 If you have 1000's of entries (I "think" that this means "words to be matched"), then I don't believe that regexp are what you need.
If you are trying to find which words from a given input text are in fact "valid", then what I think you need is to consider using something like a DAWG (Directed Acyclic Word Graph).

  Given the database of "words", you would build the DAWG.  Then you would use the DAWG to process each input word in order to find out of a match exists in the DAWG.  One that is available in Python can be found at

http://sourceforge.net/search/?type_of_search=soft&words=WordUtils
0
 

Author Comment

by:alongoldshuv
ID: 17791061
thanks. I need to use this functionality in a SQL query though, and therefore I'm looking into doing it with regexp. Not sure that DAWG will be the right answer (or am I wrong?)

I think I'll need to give a better example of what I am trying to achieve. This is really for an autocomplete field in a web app. The autocomplete currently works but the search logic is not as smart as I would like it to be.

Say my database has the following 4 rows of a single column table (it actually has thousands of entries, but for the sake of example...):

john doe
neville smith
jane doe
dean johnson mcdonald

then, for each input there will be a SELECT name FROM table REGEXP <regexp here>; and only when an input string (or strings separated by space) matches any of the DB entries the matches results should return from the query (this is pretty standard in many relational db systems).

and the results should be as I described initially:

input              results (matches)
------             -----------------------
ohn               none. because no word from the 4 entries start with "ohn"
doe               "john doe" and "jane doe"
doe j             "john doe" and "jane doe"
j doe             "john doe" and "jane doe"
john d           "dean johnson mcdonald" and "john doe"      
doe ne          none. because for the entries that has a word starting with "doe" there is no word starting with "ne" (even though "ne" is a part of "jane", it doesn't start with it)

hope this makes it clearer...

thanks!

0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 

Author Comment

by:alongoldshuv
ID: 17796651
if the input was always just one word, the regex: '\b<input>' would have done it. However, for 2 words or more it's more complicated as I want them to be unordered...
0
 
LVL 4

Accepted Solution

by:
chhokra_expert earned 250 total points
ID: 17800389
now this may not be exactly what you want or had in mind, but like you said, splitting this problem to single words really makes this simple. so in the spirit of "divide and conquer", can you break your search terms? e.g. if you had to search for "john d", could you first searc for all matches of \b"john" and then search those results for \b"d"?

i don't know sql or the driving language you are using, but breaking your search stirng would really make this simple for you. you could search for a leading space (\b) in your search terms to find the diff words. this way length of your search term wouldn't be important and you will work incrementally less with each term (as you search only on the results).

I am assuming conjunction in your search terms becaue "john d" didn't match "jane doe"

alternatively, maybe change your query to use conjunction for the terms or something e.g. "john d" would be broekn to search for (\b"john") AND (\b"d") and so on...
0
 

Author Comment

by:alongoldshuv
ID: 17800899
I actually did exactly this yesterday and it worked great. I was hoping to find a way to combine the arguments but I guess doing so and making them unordered is really hard in regexp, so breaking them up with AND clause per word is good enough.

I am using mysql in which '\b' is actaully something like '[:<:]' and '\b' closing word boundary is something like '[:>:]' (too lazy to look it up again now for the exact syntax) :-)

thanks!
0
 
LVL 4

Expert Comment

by:chhokra_expert
ID: 17804406
good to know you'd already found the answer :) thanks for the points!

anyways, i think doing what the unordered thing in regexp in not difficult. just darn impossible. the problem is that there is no way to specify permutations in a regexp and even if there were, it would give a horrendously long string so that you'd have a very computationally expensive search process. e.g. if you had "meet john doe aeon flux" as your search string, you are looking at 120 orderings (5! = 5 * 4 * 3 * 2 * 1) meaning your search string is about 120 times as long as what you started with (ignoring the regexp punctuation). That would lead to seriously big search / backtracking tables, losing out all efficiency.

cheers!
kg
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now