Link to home
Start Free TrialLog in
Avatar of kine
kine

asked on

Using REGEXP with VBScript to ignore HTML

Using Regular expressions I am trying to replace a pattern within a string but the replacement should have no effect on text between "<" and ">". The result should write a html uneffected by changes but text between the tags altered.
I am unformilary with the synax and use of the regexp object so your help would be appreciated.
Avatar of avner
avner

I don't think you can do that, the only way to do it is to get the data into  a string, then test char by char to verify you are not inside an HTML tag .
Avatar of kine

ASKER

I take your point but I think that I am getting close to my objective by using
dim objregx
set objregx = New RegExp
objregx.Pattern =  ">\w\b"& wordtochange&"\b\w<"
objregx.Global = True
back = objregx.Replace(stringtobechanged,replacement string)

Set objregx = nothing

any ideas
ASKER CERTIFIED SOLUTION
Avatar of avner
avner

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
instead of join and split, you can also use :

function changeTextOnly(obj, sFrom, sTo)
{
var sText = obj.innerText;

var iRegExp1 = new RegExp("(>[^<]*("+sFrom+")[^<]*<)","g");

          while(iRegExp1.test(sText))
               {
                    sText= sText.replace(RegExp.$1,RegExp.$1.replace(sFrom,sTo));
               }

obj.innerText=sText;
}
Avatar of kine

ASKER

I take your point but I think that I am getting close to my objective by using
dim objregx
set objregx = New RegExp
objregx.Pattern =  ">\w\b"& wordtochange&"\b\w<"
objregx.Global = True
back = objregx.Replace(stringtobechanged,replacement string)

Set objregx = nothing

any ideas
Click "reload this question" to avoid comment duplications.
Avatar of kine

ASKER

Thanks, that looks good. I will give it a spin and let you know how its works
Avatar of kine

ASKER

Thanks for those links. It does seem pretty easy to remove the html tags but its not so simple when you just want to ignore them and later specific text between them. I did try and convert the great script suggested by avner from client side javascript to server side VBScript. Unfortunately I could not get it work as well as it does in the above script.
kine , Please you post you vbscript,  I'll try to look at it.
A very convoluted way (using the replace html tags method) would be to replace all the HTML blocks with markers e.g #1#, and then to use your regexp on the remaining text, and then to replace the markers with the original HTML tags...
Avatar of kine

ASKER

Yeah, markhov a friend of mine says the same.  My biggest problem (amongst many) if the variable number of spaces before and after the piece of text that I wish to alter. this is the pattern
objregx.Pattern =  ">[^>]*\b[ ]{1}"&  wordtoreplace &"\b[ ]{1}[^<]*<"
regular expression look as if the will be simply to understand, if that was only the case.
Avatar of kine

ASKER

I see one of the problems that I'm having, the code is  replacing everything between ">" and "<" when it sees the first error.  So that the other mispellings, and correct words are being overwriten
Avatar of kine

ASKER

Here is the whole ugly thing

<html>
<head>
<title>spell checker</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

</head>
<!--#include file="dictionary/diction1.inc" -->
'all the words in the world ordered by they initial letters
<%
'

test= request.form("html")
back = test
'back=replace(back,">","> ")
'back=replace(back,"<"," <")
back=replace(back,"&nbsp;"," &nbsp; ")
  Dim objRegE
  Set objRegE = New RegExp
  objRegE.Pattern = "[0-9]?"
  objRegE.IgnoreCase = True
 objRegE.Global = true    
 test= objRegE.Replace(test, "")
  Set objRegE = Nothing


' Clears the string of "<>" and contents
Function clearTags(str)    
    Dim re    
    Set re = New RegExp    
    re.Pattern = "<[^>]+>"
    re.Global = true
    clearTags = re.Replace(str," ")
End Function

test = clearTags(test)

test=replace(test,"&nbsp;"," ")    
test=replace(test,Chr(9),"")
test=replace(test,Chr(10),"")
test=replace(test,Chr(11),"")
test=replace(test,Chr(12),"")
test=replace(test,Chr(13),"")
test=replace(test,"-"," ")
test=replace(test,".","")
test=replace(test,",","")
test=replace(test,"?","")
test=replace(test,"!","")
test=replace(test,";","")
test=replace(test,":","")
test=replace(test,"\","")
test=replace(test,"|","")
test=replace(test,"("," ")
test=replace(test,")"," ")
test=replace(test,"[","")
test=replace(test,"]","")
test=replace(test,"{","")
test=replace(test,Chr(34),"")
test=replace(test,"&lt;","")
'test=replace(test,">","")
'test=replace(test,"<","")
test=replace(test,"=","")
test=replace(test,">>","")
test=replace(test,"+","")
test=replace(test,"_","")
test=replace(test,"'s","")
'test=replace(test,Chr(32),"")


test=replace(test,"    "," ")
'replaces some other charactors
test=replace(test,"   "," ")
test=replace(test,"  "," ")

'creates an array from the string          

test2=split(test," ")
i=0

'Go through the contents of the array one by one then triming spaces and getting the initial letter and first two digits,
'these to be used to find the dictionary variables
do while i < ubound(test2)
lal=test2(i)
lal=trim(lal)
lala=lcase(left(lal,2))
initials=lcase(left(lal,1))
'response.write ubound(test2)-i &" ; lal="& lal &" test3 ="
'use the array.inc to find which varible in the diction.inc to check
%>
<!--#include file="dictionary/array.inc" -->

<%
' run through the dictionary variable and see if the current array item has a match
  Dim objRegExp
  Set objRegExp = New RegExp
  objRegExp.Pattern = "\b"& lal &"\b"
  objRegExp.IgnoreCase = True  
  objRegExp.Global = false
  Dim strStringToSearch
  strStringToSearch = dict  
  loo=objRegExp.Test(strStringToSearch)
  Set objRegExp = Nothing    


  Dim objRegEx
  Set objRegEx = New RegExp
  objRegEx.Pattern = "\b"& lal &"\b"
  objRegEx.IgnoreCase = True
  objRegEx.Global = false    
  zoo=objRegEx.Test(passed)
  Set objRegEx = Nothing
  if zoo = false then
'if InStr(passed,lal)=0 then

'boo=inStr (dict,test2(i))
if loo=false then

test3="<a href='#'>"& lal &"</a>"
'response.write test3 & lal &"<br>"



dim objregx
set objregx = New RegExp
'>[^<] *("+sFrom+")[^<]  *<)","[^<(.+?)>]>
'\b"&  lal &"\b
objregx.Pattern =  ">[^<>]*\b "&  lal &" \b[^<>]*<"
'objregx.Pattern =  ">[^>]*[ ]{1,5}\b"&  lal &"\b[ ]{1,5}[^<]*<"
objregx.Global = True
back = objregx.Replace(back,test3)

Set objregx = nothing




end if
end if
i=i+1
passed =passed & lal & " " 
'response.write loo & lal &"<br>"

response.write  test3 &" loo = "& loo&"<br>"
loop

response.write back
'response.write now
'response.write "<br>"& passed
%>


</body>
</html>
kine , why can't use the JavaScript code ?

I'm unable to test your code since I am not running IIS.
Avatar of kine

ASKER

Learning experience I'm afraid. I have set myself a series of tasks to be completed with specific methods.  I'm getting closer anyway. I will post the results up here when its done.
Let us know if you need additional specific help with RegExp.
Try this:

Dim myRegExp
Set myRegExp = New RegExp
myRegExp.Pattern = "<[^\>]*>"
myRegExp.Global = True
myRegExp.IgnoreCase = True
Finding text that is NOT in HTML tags:

First the function:
--------------
Function RegExpReplace(strInput, strPattern, strReplace)
    ' Use <?> to indicate the match you wish to replace

    ' Create and setup several variables:
    Dim regEx, Match, Matches, Position, strReturn
    Position = 1
    strReturn = "" 

    ' Set up the regular expression:
    Set regEx = New RegExp
    regEx.Pattern = strPattern
    regEx.IgnoreCase = True
    regEx.Global = True

    ' Get all the matches for it:
    Set Matches = regEx.Execute(strInput)

    ' Go through the Matches collection
    ' and build the output string:
    For Each Match in Matches
    strReturn = strReturn & Mid(strInput, Position, Match.FirstIndex+1-Position)
    strReturn = strReturn & Replace(strReplace, "<?>", Match.Value)
    Position = Len(Match.Value) + Match.FirstIndex + 1
    Next

    ' Add any text after the last match
    strReturn = strReturn & Mid(strInput, Position, Len(strInput))

    RegExpReplace = strReturn
End Function
--------------
This was grabbed right from the previous post which came from this article: http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=66 

But the example in the previous article didn't quite work for me...
--------------
strHTML = RegExpReplace(strHTML, "strong(?![^<]+>), "*<?>*")
--------------
...it kept replacing text in html tags as well as text not in html tags. So, I banged around with it and came up with this modification:
--------------
strHTML = RegExpReplace(strHTML, "(?![^<]+>)" + strSearch + "(?![^<]+>)", strReplace)
--------------
strSearch is the text that I'm looking for and strReplace is what I want to replace it with. You might be able to modify this for your needs?



kine, do you need any further help with this question or can it be closed ?
Avatar of kine

ASKER

Yeah, its done and dusted, your code showed the way. Cheers