Link to home
Start Free TrialLog in
Avatar of alongoldshuv
alongoldshuv

asked on

In need for a Regex

Hi,

This may be easy for somebody familiar with regex's, but I am not and it's hard for me.
I need a regex that will match 1 or more words in a string starting from the beginning of each word.
Words are separated with a space char.

For example, for this:

"one red apple"

the following should be a match since all these substrings exist in the source and they all start from the beginning of words:

one
app
red one
one red ap
re
apple red o
apple red on

the following should NOT be a match since one of more words don't start from the first character of a word or simply doesn't exist in the string:

ppl
one ed
blah
on red
one red orange

help is very appreciated!
Thanks.
Avatar of ozo
ozo
Flag of United States of America image

#!/usr/bin/perl
for( split/^/m,<<END
one
app
red one
one red ap
re
apple red o
apple red on
ppl
one ed
blah
on red
one red orange
END
){
   print "match $_" if /^(\b(o(ne?)?|r(ed?)?|a(p(p(le?)?)?)?)\b\s*)*$/
}
Avatar of alongoldshuv
alongoldshuv

ASKER

ozo,

thanks for the answer, but I am sorry as it seems like I wasn't very clear in my question. These were only examples I gave in order to show how the regex should work. I need it to be generic and to apply the rules for any given string, and therefore will not have actuall alphabetical characters but mostly symbols.
The input strings will be matched with a regex against a database of 1000's of entries...


 If you have 1000's of entries (I "think" that this means "words to be matched"), then I don't believe that regexp are what you need.
If you are trying to find which words from a given input text are in fact "valid", then what I think you need is to consider using something like a DAWG (Directed Acyclic Word Graph).

  Given the database of "words", you would build the DAWG.  Then you would use the DAWG to process each input word in order to find out of a match exists in the DAWG.  One that is available in Python can be found at

http://sourceforge.net/search/?type_of_search=soft&words=WordUtils
thanks. I need to use this functionality in a SQL query though, and therefore I'm looking into doing it with regexp. Not sure that DAWG will be the right answer (or am I wrong?)

I think I'll need to give a better example of what I am trying to achieve. This is really for an autocomplete field in a web app. The autocomplete currently works but the search logic is not as smart as I would like it to be.

Say my database has the following 4 rows of a single column table (it actually has thousands of entries, but for the sake of example...):

john doe
neville smith
jane doe
dean johnson mcdonald

then, for each input there will be a SELECT name FROM table REGEXP <regexp here>; and only when an input string (or strings separated by space) matches any of the DB entries the matches results should return from the query (this is pretty standard in many relational db systems).

and the results should be as I described initially:

input              results (matches)
------             -----------------------
ohn               none. because no word from the 4 entries start with "ohn"
doe               "john doe" and "jane doe"
doe j             "john doe" and "jane doe"
j doe             "john doe" and "jane doe"
john d           "dean johnson mcdonald" and "john doe"      
doe ne          none. because for the entries that has a word starting with "doe" there is no word starting with "ne" (even though "ne" is a part of "jane", it doesn't start with it)

hope this makes it clearer...

thanks!

if the input was always just one word, the regex: '\b<input>' would have done it. However, for 2 words or more it's more complicated as I want them to be unordered...
ASKER CERTIFIED SOLUTION
Avatar of chhokra_expert
chhokra_expert

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I actually did exactly this yesterday and it worked great. I was hoping to find a way to combine the arguments but I guess doing so and making them unordered is really hard in regexp, so breaking them up with AND clause per word is good enough.

I am using mysql in which '\b' is actaully something like '[:<:]' and '\b' closing word boundary is something like '[:>:]' (too lazy to look it up again now for the exact syntax) :-)

thanks!
good to know you'd already found the answer :) thanks for the points!

anyways, i think doing what the unordered thing in regexp in not difficult. just darn impossible. the problem is that there is no way to specify permutations in a regexp and even if there were, it would give a horrendously long string so that you'd have a very computationally expensive search process. e.g. if you had "meet john doe aeon flux" as your search string, you are looking at 120 orderings (5! = 5 * 4 * 3 * 2 * 1) meaning your search string is about 120 times as long as what you started with (ignoring the regexp punctuation). That would lead to seriously big search / backtracking tables, losing out all efficiency.

cheers!
kg