Solved

Regular Expression - Big Array!?!

Posted on 2004-04-26
20
382 Views
Last Modified: 2012-05-04
Hi friends

I have a VERY big array and a text file. I've to go thru every line of the text file and see if any ONE of the element of the array is in that. It is ENOUGH if atleast one match is there. Thats it. Now I use the following way:

$Pattern=join ('|',@BigArray);
while (<IN>) { if (m#$Pattern#) {print $&;} }

I also tried this:
$Pattern=join ('|',@BigArray);
undef($/);  $_=<IN>;
if (m#$Pattern#) {print $&}

Both the method consumes some 37 - 45 seconds. Is there any way to optimize this?

Thanx
0
Comment
Question by:vi_srikanth
  • 8
  • 4
  • 3
  • +4
20 Comments
 
LVL 3

Assisted Solution

by:PerlKing
PerlKing earned 50 total points
ID: 10925854
Ideally you should be using a HASH like this:
%patterns = map { $_, 1 } @BigArray;
while (<IN>) { if ( exists $patterns{$_} ) {print $&;} }

But your requirement is that "if you see if any ONE of the element" it it sufficient. So, using a HASH may not be all that beneficial but still this MAY BE BETTER than having a HUGE regular expression like that one that you created.

Did you already try this method?
0
 
LVL 84

Expert Comment

by:ozo
ID: 10926116
Are all the elements of the form \b\w+\b
Can there be more than one element per line?

Usiing $& slows down all matching operations, you might try just removing all $& in your program
0
 
LVL 84

Expert Comment

by:ozo
ID: 10926168
unless the elements in @BigArray end with \n
you may want to chomp; or map { "$_\n", 1 } when using PerlKing's method.
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10926597
Perl King> Ideally you should be using a HASH
But, even this consumes more time. Is there any other way?

ozo> Are all the elements of the form \b\w+\b
No. Not necessarily. Some elements can be a part of a word.

ozo> Can there be more than one element per line?
There can be. But, I just want to match any one of the element in a line. That is enough. Once any one of the element is found, I will just stop processing that line.

Thanx
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10926746
I also tried splitting the whole array in to 20 parts, and then joining them, and then matching them. But, this consumes more time than the previous methods.
0
 
LVL 18

Assisted Solution

by:kandura
kandura earned 50 total points
ID: 10926789
You might benefit from using a Bloom Filter: http://www.perl.com/pub/a/2004/04/08/bloom_filters.html
Bloom filters are a very efficient way to represent sets, and it is very easy to test whether a given object is a member of that set.
So if all you're looking for is
 if($value in @array)
then use a Bloom filter.
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10926901
I think I haven't made my point clear. I'm looking for a code which will search/match the elements of an array in a file. For eg., if my array has qw(hi bye see) and if my file is:
"hi how r u?
this is just a test
see u"

the first and the second line have "hi" and "see" which are there in my array. I'm looking for a code that will find any one of the element of the array - in the file and print that. If I'm not clear plz tell me.
0
 
LVL 12

Assisted Solution

by:stefan73
stefan73 earned 100 total points
ID: 10926906
Hi kandura,
If one of the requirements is a partial word match, a Bloom filter won't do.

Assuming you don't do anything fancy in your regex ( you just want to see if something matches somewhere), try a suffix tree instead. Once you reach a leaf, you have a match.

See:
http://cs.haifa.ac.il/~shlomo/suffix_tree/

this is C, but there is a Perl module that can use the C code.

More links:
http://datacompression.info/SuffixTrees.shtml

Cheers,
Stefan
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10926909
Sorry. I typed it wrongly in my previous post. It is not
> the first and the second line have "hi" and "see"
but, it is
the first and the THIRD line have "hi" and "see"
0
 
LVL 18

Expert Comment

by:kandura
ID: 10927041
Stefan73, thanks for the reference. That definitely seems to be a good way to taclke this problem.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 10

Assisted Solution

by:Mercantilum
Mercantilum earned 150 total points
ID: 10927242
Hi vi_srikanth,

Talking about a C/tree program: this is the code I used for fast searching words in a similar tree ; two functions
1. tree_add (w) adds a word w to the tree (no pb with doubles)
2. tree_in (w) search for w and returns -1(not found), 0(incomplete) or 1(found)

The tree_add function uses malloc() to allocate a Ttrees item.
The tree_in function should be fast.
You need #include <stdlib.h>

#define LETTER(a)       (unsigned)((a)-'a')

typedef struct tree {
  struct tree *tree[26];
  int done;
} Ttrees;

Ttrees tree;
int NBtree=0;

// tree_add: add a *lowercase* word to the tree [a-z]+
// take a word as input

void tree_add (char *w)
{
  Ttrees *t,*n;
  int i,x;

  for (t=&tree ; *w ; w++)
  {
    x = LETTER(*w);
    if ( ! t->tree[x])
    {
      n = malloc(sizeof(Ttrees));
      if (!n) exit(printf("Malloc error Ttrees\n")); // error, you can change that :)
      for (i=0 ; i<26 ; i++) n->tree[i] = NULL;  // could be replaced with memset, not much difference
      n->done = 0;
      t->tree[x] = n;
      NBtree++;
    }
    t = t->tree[x];
    if ( ! w[1]) t->done = 1;
  }
}

// tree_in: return  -1:no such word  0:incomplete word  1:complete(found)
// take a *lowercase* word as input

int tree_in (char *w)
{
  Ttrees *t;
  int r=0;

  for (t=&tree ; !r && *w ; w++)
  {
    if  ( ! (t=t->tree[LETTER(*w)])) r = -1;
    else if (!w[1] && t->done) r = 1;
  }
  return (r);
}
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 100 total points
ID: 10927638
#this may shave a few seconds off
$Pattern=join ('|',@BigArray);
$Pattern=qr/$Pattern/;
while (<IN>) { if (m#$Pattern#) {print $_;} }
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10928070
Hi Stefan and Mercantilum, I haven't yet tried this suffix tree method. I will do it tommorow.

ozo, this qr// saves me around 4 seconds. Previously it runs for 40 seconds, but now it runs in 36 seconds.

Thanx
0
 
LVL 10

Expert Comment

by:Mercantilum
ID: 10928152
Just ensure you feed  tree_add  and  tree_in  with lowercase  words  (only [a-z]+ ) while you may modify the code to allocate 256 pointers (instead of 26 :)
0
 
LVL 20

Assisted Solution

by:jmcg
jmcg earned 50 total points
ID: 10931405
Do you have a version of the 'fgrep' utility available? How does the running time of your perl solution compare with the run time of the following command?

fgrep -f big-array-file text-file

0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10935503
Hi stefan73

I installed that SuffixTree module and trie the following:

use SuffixTree;
my $str = "mississippi";
my @bigarray=qw(s z x);     my $Pattern=join('|',@bigarray);
my $tree=create_tree($str);
my $position = find_substring($tree, $Pattern);
print $position;
delete_tree($tree);

But, it doesn't work. I dont know how to make it to work for me. I think, find_substring doesn't support regular expression. Is there anything that I'm doing wrong?

Hi Mercantilum, Do I have to build an .exe for the .c that u gave, and then use that in my perl? Could u tell me what I should do?

Hi jmcg, I'm working on Windows 2000, and my perl program will also work on Windows 2000 systems.  I don't have fgrep. Do I have to download it and use that in my perl?

Thank u all
0
 
LVL 10

Expert Comment

by:Mercantilum
ID: 10935667
The question is : do you still need Perl ?

Anyway, yes, build an exe (with ms vc++ or see http://www.cygwin.com/ in you need gcc and co)
Then your exe could be
   prog   arrayfile  textfile

Taking arrayfile arg as the content of previous @array
and textfile as your textile

Algo:
   for each word of arrayfile, do tree_add (word)
   for each word of textfile, if (tree_in (textword)) print line
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10936230
These trees are working fine for plain text. But, is there anyway by which it can handle a little regular expressions too? Bcos by using trees separately and then using regex on them, consumes more time - around 63 seconds.
0
 
LVL 10

Accepted Solution

by:
Mercantilum earned 150 total points
ID: 10936319
Ok so if you want something fast, stay with C, do:

  man regex

(not sure you have it with ms vc++, you can get it from  http://www.cygwin.com/)
0
 
LVL 4

Author Comment

by:vi_srikanth
ID: 10983600
As of now I'm not going to use C for this, but, later i might. Thank u all for ur comments.

Srik
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now