Solved

# Regular Expression - Big Array!?!

Posted on 2004-04-26
390 Views
Hi friends

I have a VERY big array and a text file. I've to go thru every line of the text file and see if any ONE of the element of the array is in that. It is ENOUGH if atleast one match is there. Thats it. Now I use the following way:

\$Pattern=join ('|',@BigArray);
while (<IN>) { if (m#\$Pattern#) {print \$&;} }

I also tried this:
\$Pattern=join ('|',@BigArray);
undef(\$/);  \$_=<IN>;
if (m#\$Pattern#) {print \$&}

Both the method consumes some 37 - 45 seconds. Is there any way to optimize this?

Thanx
0
Question by:vi_srikanth
[X]
###### Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 8
• 4
• 3
• +4

LVL 3

Assisted Solution

PerlKing earned 50 total points
ID: 10925854
Ideally you should be using a HASH like this:
%patterns = map { \$_, 1 } @BigArray;
while (<IN>) { if ( exists \$patterns{\$_} ) {print \$&;} }

But your requirement is that "if you see if any ONE of the element" it it sufficient. So, using a HASH may not be all that beneficial but still this MAY BE BETTER than having a HUGE regular expression like that one that you created.

Did you already try this method?
0

LVL 84

Expert Comment

ID: 10926116
Are all the elements of the form \b\w+\b
Can there be more than one element per line?

Usiing \$& slows down all matching operations, you might try just removing all \$& in your program
0

LVL 84

Expert Comment

ID: 10926168
unless the elements in @BigArray end with \n
you may want to chomp; or map { "\$_\n", 1 } when using PerlKing's method.
0

LVL 4

Author Comment

ID: 10926597
Perl King> Ideally you should be using a HASH
But, even this consumes more time. Is there any other way?

ozo> Are all the elements of the form \b\w+\b
No. Not necessarily. Some elements can be a part of a word.

ozo> Can there be more than one element per line?
There can be. But, I just want to match any one of the element in a line. That is enough. Once any one of the element is found, I will just stop processing that line.

Thanx
0

LVL 4

Author Comment

ID: 10926746
I also tried splitting the whole array in to 20 parts, and then joining them, and then matching them. But, this consumes more time than the previous methods.
0

LVL 18

Assisted Solution

kandura earned 50 total points
ID: 10926789
You might benefit from using a Bloom Filter: http://www.perl.com/pub/a/2004/04/08/bloom_filters.html
Bloom filters are a very efficient way to represent sets, and it is very easy to test whether a given object is a member of that set.
So if all you're looking for is
if(\$value in @array)
then use a Bloom filter.
0

LVL 4

Author Comment

ID: 10926901
I think I haven't made my point clear. I'm looking for a code which will search/match the elements of an array in a file. For eg., if my array has qw(hi bye see) and if my file is:
"hi how r u?
this is just a test
see u"

the first and the second line have "hi" and "see" which are there in my array. I'm looking for a code that will find any one of the element of the array - in the file and print that. If I'm not clear plz tell me.
0

LVL 12

Assisted Solution

stefan73 earned 100 total points
ID: 10926906
Hi kandura,
If one of the requirements is a partial word match, a Bloom filter won't do.

Assuming you don't do anything fancy in your regex ( you just want to see if something matches somewhere), try a suffix tree instead. Once you reach a leaf, you have a match.

See:
http://cs.haifa.ac.il/~shlomo/suffix_tree/

this is C, but there is a Perl module that can use the C code.

http://datacompression.info/SuffixTrees.shtml

Cheers,
Stefan
0

LVL 4

Author Comment

ID: 10926909
Sorry. I typed it wrongly in my previous post. It is not
> the first and the second line have "hi" and "see"
but, it is
the first and the THIRD line have "hi" and "see"
0

LVL 18

Expert Comment

ID: 10927041
Stefan73, thanks for the reference. That definitely seems to be a good way to taclke this problem.
0

LVL 10

Assisted Solution

Mercantilum earned 150 total points
ID: 10927242
Hi vi_srikanth,

Talking about a C/tree program: this is the code I used for fast searching words in a similar tree ; two functions
1. tree_add (w) adds a word w to the tree (no pb with doubles)
2. tree_in (w) search for w and returns -1(not found), 0(incomplete) or 1(found)

The tree_add function uses malloc() to allocate a Ttrees item.
The tree_in function should be fast.
You need #include <stdlib.h>

#define LETTER(a)       (unsigned)((a)-'a')

typedef struct tree {
struct tree *tree[26];
int done;
} Ttrees;

Ttrees tree;
int NBtree=0;

// take a word as input

{
Ttrees *t,*n;
int i,x;

for (t=&tree ; *w ; w++)
{
x = LETTER(*w);
if ( ! t->tree[x])
{
n = malloc(sizeof(Ttrees));
if (!n) exit(printf("Malloc error Ttrees\n")); // error, you can change that :)
for (i=0 ; i<26 ; i++) n->tree[i] = NULL;  // could be replaced with memset, not much difference
n->done = 0;
t->tree[x] = n;
NBtree++;
}
t = t->tree[x];
if ( ! w[1]) t->done = 1;
}
}

// tree_in: return  -1:no such word  0:incomplete word  1:complete(found)
// take a *lowercase* word as input

int tree_in (char *w)
{
Ttrees *t;
int r=0;

for (t=&tree ; !r && *w ; w++)
{
if  ( ! (t=t->tree[LETTER(*w)])) r = -1;
else if (!w[1] && t->done) r = 1;
}
return (r);
}
0

LVL 84

Assisted Solution

ozo earned 100 total points
ID: 10927638
#this may shave a few seconds off
\$Pattern=join ('|',@BigArray);
\$Pattern=qr/\$Pattern/;
while (<IN>) { if (m#\$Pattern#) {print \$_;} }
0

LVL 4

Author Comment

ID: 10928070
Hi Stefan and Mercantilum, I haven't yet tried this suffix tree method. I will do it tommorow.

ozo, this qr// saves me around 4 seconds. Previously it runs for 40 seconds, but now it runs in 36 seconds.

Thanx
0

LVL 10

Expert Comment

ID: 10928152
Just ensure you feed  tree_add  and  tree_in  with lowercase  words  (only [a-z]+ ) while you may modify the code to allocate 256 pointers (instead of 26 :)
0

LVL 20

Assisted Solution

jmcg earned 50 total points
ID: 10931405
Do you have a version of the 'fgrep' utility available? How does the running time of your perl solution compare with the run time of the following command?

fgrep -f big-array-file text-file

0

LVL 4

Author Comment

ID: 10935503
Hi stefan73

I installed that SuffixTree module and trie the following:

use SuffixTree;
my \$str = "mississippi";
my @bigarray=qw(s z x);     my \$Pattern=join('|',@bigarray);
my \$tree=create_tree(\$str);
my \$position = find_substring(\$tree, \$Pattern);
print \$position;
delete_tree(\$tree);

But, it doesn't work. I dont know how to make it to work for me. I think, find_substring doesn't support regular expression. Is there anything that I'm doing wrong?

Hi Mercantilum, Do I have to build an .exe for the .c that u gave, and then use that in my perl? Could u tell me what I should do?

Hi jmcg, I'm working on Windows 2000, and my perl program will also work on Windows 2000 systems.  I don't have fgrep. Do I have to download it and use that in my perl?

Thank u all
0

LVL 10

Expert Comment

ID: 10935667
The question is : do you still need Perl ?

Anyway, yes, build an exe (with ms vc++ or see http://www.cygwin.com/ in you need gcc and co)
prog   arrayfile  textfile

Taking arrayfile arg as the content of previous @array

Algo:
for each word of arrayfile, do tree_add (word)
for each word of textfile, if (tree_in (textword)) print line
0

LVL 4

Author Comment

ID: 10936230
These trees are working fine for plain text. But, is there anyway by which it can handle a little regular expressions too? Bcos by using trees separately and then using regex on them, consumes more time - around 63 seconds.
0

LVL 10

Accepted Solution

Mercantilum earned 150 total points
ID: 10936319
Ok so if you want something fast, stay with C, do:

man regex

(not sure you have it with ms vc++, you can get it from  http://www.cygwin.com/)
0

LVL 4

Author Comment

ID: 10983600
As of now I'm not going to use C for this, but, later i might. Thank u all for ur comments.

Srik
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if Â when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with â€¦
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://wwwâ€¦
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: Tâ€¦
Six Sigma Control Plans
###### Suggested Courses
Course of the Month6 days, 23 hours left to enroll