searchi script

Posted on 1999-07-05
Medium Priority
Last Modified: 2010-03-04
hi there, i wan to prevent my search script from finding words in html tags. how can i do it ?
URgent !!
if u got an ans, pls reply asap..thanx
Question by:prinx
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +1

Author Comment

ID: 1213877
note : the script searches for strings in html files

Author Comment

ID: 1213878
note : the script searches for strings in html files

Expert Comment

ID: 1213879
One way would be to take away tags before searching, like:

open (FH, $file); # $file = file to search
while ($line = <FH>) {
  $line =~ s/<[^>]*>//g;  # take away anything between <>
# do your searching on $line

You could use a minimal match too, if you prefer...
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.


Expert Comment

ID: 1213880
Oh, another note (pressed submit too quickly), this won't work if the tags are spread out on multiple lines, like
<start of a multiline tag
that ends on next line>

but I'm sure you get the jist of it...

Accepted Solution

jhurst earned 240 total points
ID: 1213881
Give some of the points to Hugh, he deserves them, he was on the right track.

There is a second problem in his idea that is that the lines with multiple tag, such as <b>hello</b> will be entirely eaten since the hello is included between the < and >.


$page=join("",@page); #one VERY long line
$page=~ s/>/>\n/g; #so each tage ends a line, never more than one per
$page=~ s/<.*>//g; #remove the tags

You now have a string with no tags but evrything else

LVL 84

Expert Comment

ID: 1213882
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
 ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>

Expert Comment

ID: 1213883
1) Nope, my idea wouldn't bail on
since the regexp was

As you see, it will take away anything from a < to a >. The * is of course greedy, so if I had used
it would have removed the Hello as well, so therefore I use[^>]* instead to take away anything that does not include a > from < to a >. This way it'll be a 'minimal match', it won't stretch over multiple tags.
However, neither of our proposals properly deal with multiline tags... Gimme a few hours, and I'll give you something neater tho.
LVL 84

Expert Comment

ID: 1213884
perldoc -q "remove HTML"

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question