?
Solved

searchi script

Posted on 1999-07-05
8
Medium Priority
?
169 Views
Last Modified: 2010-03-04
hi there, i wan to prevent my search script from finding words in html tags. how can i do it ?
URgent !!
thanx
if u got an ans, pls reply asap..thanx
0
Comment
Question by:prinx
  • 3
  • 2
  • 2
  • +1
8 Comments
 

Author Comment

by:prinx
ID: 1213877
note : the script searches for strings in html files
0
 

Author Comment

by:prinx
ID: 1213878
note : the script searches for strings in html files
0
 

Expert Comment

by:Hugh_Jerpenis
ID: 1213879
One way would be to take away tags before searching, like:

open (FH, $file); # $file = file to search
while ($line = <FH>) {
  $line =~ s/<[^>]*>//g;  # take away anything between <>
# do your searching on $line
}

You could use a minimal match too, if you prefer...
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 

Expert Comment

by:Hugh_Jerpenis
ID: 1213880
Oh, another note (pressed submit too quickly), this won't work if the tags are spread out on multiple lines, like
<start of a multiline tag
that ends on next line>

but I'm sure you get the jist of it...
0
 
LVL 8

Accepted Solution

by:
jhurst earned 240 total points
ID: 1213881
Give some of the points to Hugh, he deserves them, he was on the right track.

There is a second problem in his idea that is that the lines with multiple tag, such as <b>hello</b> will be entirely eaten since the hello is included between the < and >.

So,

open(FILE,"<your_file");
@page=<FILE>;
$page=join("",@page); #one VERY long line
$page=~ s/>/>\n/g; #so each tage ends a line, never more than one per
$page=~ s/<.*>//g; #remove the tags

You now have a string with no tags but evrything else

0
 
LVL 85

Expert Comment

by:ozo
ID: 1213882
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
 ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
0
 

Expert Comment

by:Hugh_Jerpenis
ID: 1213883
1) Nope, my idea wouldn't bail on
<B>Hello</B>
since the regexp was
s/<[^>]*>//g;

As you see, it will take away anything from a < to a >. The * is of course greedy, so if I had used
s/<.*>//g;
it would have removed the Hello as well, so therefore I use[^>]* instead to take away anything that does not include a > from < to a >. This way it'll be a 'minimal match', it won't stretch over multiple tags.
However, neither of our proposals properly deal with multiline tags... Gimme a few hours, and I'll give you something neater tho.
0
 
LVL 85

Expert Comment

by:ozo
ID: 1213884
perldoc -q "remove HTML"
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

599 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question