asked on

Using REGEXP with VBScript to ignore HTML

Using Regular expressions I am trying to replace a pattern within a string but the replacement should have no effect on text between "<" and ">". The result should write a html uneffected by changes but text between the tags altered.
I am unformilary with the synax and use of the regexp object so your help would be appreciated.

avner

I don't think you can do that, the only way to do it is to get the data into a string, then test char by char to verify you are not inside an HTML tag .

kine

ASKER

I take your point but I think that I am getting close to my objective by using
dim objregx
set objregx = New RegExp
objregx.Pattern = ">\w\b"& wordtochange&"\b\w<"
objregx.Global = True
back = objregx.Replace(stringtobechanged,replacement string)

Set objregx = nothing

any ideas

ASKER CERTIFIED SOLUTION

avner

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

avner

instead of join and split, you can also use :

function changeTextOnly(obj, sFrom, sTo)
{
var sText = obj.innerText;

var iRegExp1 = new RegExp("(>[^<]*("+sFrom+")[^<]*<)","g");

while(iRegExp1.test(sText))
{
sText= sText.replace(RegExp.$1,RegExp.$1.replace(sFrom,sTo));
}

obj.innerText=sText;
}

kine

ASKER

avner

Click "reload this question" to avoid comment duplications.

kine

ASKER

Thanks, that looks good. I will give it a spin and let you know how its works

markhoy

Have a look at these if you haven't already:

http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=155
http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=99

http://www.aspfaqs.com/aspfaqs/ShowCategory.asp?CatID=16

markhoy

http://www.codeproject.com/asp/removehtml.asp

kine

ASKER

Thanks for those links. It does seem pretty easy to remove the html tags but its not so simple when you just want to ignore them and later specific text between them. I did try and convert the great script suggested by avner from client side javascript to server side VBScript. Unfortunately I could not get it work as well as it does in the above script.

avner

kine , Please you post you vbscript, I'll try to look at it.

markhoy

A very convoluted way (using the replace html tags method) would be to replace all the HTML blocks with markers e.g #1#, and then to use your regexp on the remaining text, and then to replace the markers with the original HTML tags...

kine

ASKER

Yeah, markhov a friend of mine says the same. My biggest problem (amongst many) if the variable number of spaces before and after the piece of text that I wish to alter. this is the pattern
objregx.Pattern = ">[^>]*\b[ ]{1}"& wordtoreplace &"\b[ ]{1}[^<]*<"
regular expression look as if the will be simply to understand, if that was only the case.

kine

ASKER

I see one of the problems that I'm having, the code is replacing everything between ">" and "<" when it sees the first error. So that the other mispellings, and correct words are being overwriten

kine

ASKER

Here is the whole ugly thing

<html>
<head>
<title>spell checker</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

</head>

'all the words in the world ordered by they initial letters
<%
'

test= request.form("html")
back = test
'back=replace(back,">","> ")
'back=replace(back,"<"," <")
back=replace(back," ","   ")
Dim objRegE
Set objRegE = New RegExp
objRegE.Pattern = "[0-9]?"
objRegE.IgnoreCase = True
objRegE.Global = true
test= objRegE.Replace(test, "")
Set objRegE = Nothing

' Clears the string of "<>" and contents
Function clearTags(str)
Dim re
Set re = New RegExp
re.Pattern = "<[^>]+>"
re.Global = true
clearTags = re.Replace(str," ")
End Function

test = clearTags(test)

test=replace(test," "," ")
test=replace(test,Chr(9),"")
test=replace(test,Chr(10),"")
test=replace(test,Chr(11),"")
test=replace(test,Chr(12),"")
test=replace(test,Chr(13),"")
test=replace(test,"-"," ")
test=replace(test,".","")
test=replace(test,",","")
test=replace(test,"?","")
test=replace(test,"!","")
test=replace(test,";","")
test=replace(test,":","")
test=replace(test,"\","")
test=replace(test,"|","")
test=replace(test,"("," ")
test=replace(test,")"," ")
test=replace(test,"[","")
test=replace(test,"]","")
test=replace(test,"{","")
test=replace(test,Chr(34),"")
test=replace(test,"<","")
'test=replace(test,">","")
'test=replace(test,"<","")
test=replace(test,"=","")
test=replace(test,">>","")
test=replace(test,"+","")
test=replace(test,"_","")
test=replace(test,"'s","")
'test=replace(test,Chr(32),"")

test=replace(test," "," ")
'replaces some other charactors
test=replace(test," "," ")
test=replace(test," "," ")

'creates an array from the string

test2=split(test," ")
i=0

'Go through the contents of the array one by one then triming spaces and getting the initial letter and first two digits,
'these to be used to find the dictionary variables
do while i < ubound(test2)
lal=test2(i)
lal=trim(lal)
lala=lcase(left(lal,2))
initials=lcase(left(lal,1))
'response.write ubound(test2)-i &" ; lal="& lal &" test3 ="
'use the array.inc to find which varible in the diction.inc to check
%>


<%
' run through the dictionary variable and see if the current array item has a match
Dim objRegExp
Set objRegExp = New RegExp
objRegExp.Pattern = "\b"& lal &"\b"
objRegExp.IgnoreCase = True
objRegExp.Global = false
Dim strStringToSearch
strStringToSearch = dict
loo=objRegExp.Test(strStringToSearch)
Set objRegExp = Nothing

Dim objRegEx
Set objRegEx = New RegExp
objRegEx.Pattern = "\b"& lal &"\b"
objRegEx.IgnoreCase = True
objRegEx.Global = false
zoo=objRegEx.Test(passed)
Set objRegEx = Nothing
if zoo = false then
'if InStr(passed,lal)=0 then

'boo=inStr (dict,test2(i))
if loo=false then

test3="<a href='#'>"& lal &"</a>"
'response.write test3 & lal &"<br>"

dim objregx
set objregx = New RegExp
'>[^<] *("+sFrom+")[^<] *<)","[^<(.+?)>]>
'\b"& lal &"\b
objregx.Pattern = ">[^<>]*\b "& lal &" \b[^<>]*<"
'objregx.Pattern = ">[^>]*[ ]{1,5}\b"& lal &"\b[ ]{1,5}[^<]*<"
objregx.Global = True
back = objregx.Replace(back,test3)

Set objregx = nothing

end if
end if
i=i+1
passed =passed & lal & " "
'response.write loo & lal &"<br>"

response.write test3 &" loo = "& loo&"<br>"
loop

response.write back
'response.write now
'response.write "<br>"& passed
%>

</body>
</html>

avner

kine , why can't use the JavaScript code ?

I'm unable to test your code since I am not running IIS.

kine

ASKER

Learning experience I'm afraid. I have set myself a series of tasks to be completed with specific methods. I'm getting closer anyway. I will post the results up here when its done.

avner

Let us know if you need additional specific help with RegExp.

markhoy

Try this:

Dim myRegExp
Set myRegExp = New RegExp
myRegExp.Pattern = "<[^\>]*>"
myRegExp.Global = True
myRegExp.IgnoreCase = True

markhoy

Finding text that is NOT in HTML tags:

First the function:
--------------
Function RegExpReplace(strInput, strPattern, strReplace)
' Use <?> to indicate the match you wish to replace

' Create and setup several variables:
Dim regEx, Match, Matches, Position, strReturn
Position = 1
strReturn = ""

' Set up the regular expression:
Set regEx = New RegExp
regEx.Pattern = strPattern
regEx.IgnoreCase = True
regEx.Global = True

' Get all the matches for it:
Set Matches = regEx.Execute(strInput)

' Go through the Matches collection
' and build the output string:
For Each Match in Matches
strReturn = strReturn & Mid(strInput, Position, Match.FirstIndex+1-Position)
strReturn = strReturn & Replace(strReplace, "<?>", Match.Value)
Position = Len(Match.Value) + Match.FirstIndex + 1
Next

' Add any text after the last match
strReturn = strReturn & Mid(strInput, Position, Len(strInput))

RegExpReplace = strReturn
End Function
--------------
This was grabbed right from the previous post which came from this article: http://www.aspfaqs.com/aspfaqs/ShowFAQ.asp?FAQID=66

But the example in the previous article didn't quite work for me...
--------------
strHTML = RegExpReplace(strHTML, "strong(?![^<]+>), "*<?>*")
--------------
...it kept replacing text in html tags as well as text not in html tags. So, I banged around with it and came up with this modification:
--------------
strHTML = RegExpReplace(strHTML, "(?![^<]+>)" + strSearch + "(?![^<]+>)", strReplace)
--------------
strSearch is the text that I'm looking for and strReplace is what I want to replace it with. You might be able to modify this for your needs?

markhoy

ALso,

https://www.experts-exchange.com/questions/20433094/regex-text-outside-of-a-href-tags.html

And if you are happy to use Perl see here:

http://www.perlmonks.org/index.pl?node_id=246935

avner

kine, do you need any further help with this question or can it be closed ?

kine

ASKER

Yeah, its done and dusted, your code showed the way. Cheers