Eddie Shipman
asked on
rlibby -> RegExpr
Russell, I'm trying to convert some VB.Net code to Delphi and it includes LOTS of .Net RegEx class calls in it.
Can you help me convert the expression to work with your class? If so, I'll post the particular VB.Net regex
expressions.
The code in question is a MHT builder. They use regex to retrieve parts of an html page for downloading.
Would I be able to use your code in an article if I write one about this MHTBuilder class I'm building?
Can you help me convert the expression to work with your class? If so, I'll post the particular VB.Net regex
expressions.
The code in question is a MHT builder. They use regex to retrieve parts of an html page for downloading.
Would I be able to use your code in an article if I write one about this MHTBuilder class I'm building?
ASKER
Well, not sure what you mean by "backreferencing" as I know next to nothing about RegEx.
Glad you were able to find that source. I have got a whole lot of it converted already but have been
vacillating over how to replace that regex stuff. I thought about using TRegExpr but since I don't
know anything about expressions or even what they are doing in the VB.Net code, I am stuck.
The major problems I've been having is the parsing of the relative URLs to create absolute
local URLs for use in the MHT.
On anther note:
I am using idHTTP to do the retrieval of the items but I think I'm going to switch to using WinInet
so any developers won't have to have Indy to use it; what do you think?
Glad you were able to find that source. I have got a whole lot of it converted already but have been
vacillating over how to replace that regex stuff. I thought about using TRegExpr but since I don't
know anything about expressions or even what they are doing in the VB.Net code, I am stuck.
The major problems I've been having is the parsing of the relative URLs to create absolute
local URLs for use in the MHT.
On anther note:
I am using idHTTP to do the retrieval of the items but I think I'm going to switch to using WinInet
so any developers won't have to have Indy to use it; what do you think?
A little descriptive from MS on backreferencing:
---
Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.
For instance, to find repeating adjacent characters such as the two Ls in the word "tall", you would use the regular expression (?<char>\w)\k<char>, which uses the metacharacter \w to find any single-word character. The grouping construct (?<char> ) encloses the metacharacter to force the regular expression engine to remember a subexpression match (which in this case will be any single character) and save it under the name "char". The backreference construct \k<char> causes the engine to compare the current character to the previously matched character stored under "char". The entire regular expression successfully finds a match wherever a single character is the same as the preceding character.
---
It can make life easier, as you only have to write a single expression and let the parser do the rest. It can also make life harder, as it is very hard to track what the actual expression is doing. Anyways, you could also easily do this with a few lines of code using a basic parser (like the one I posted code for).
Regarding using TRegExpr in your source, for articles, etc. I don't mind in the least how/where it is used, or even if any credit is given. I only posted it (in another q) because it is a nice example of a state machine that is:
1.) Pretty easy to read and follow
2.) Can easily be extended because of (1)
3.) Handles 90% of things that we most commonly use regular expression parsers for
4.) Is very fast because it is implemented as a non-lazy/non-greedy parser.
All (3) means is that the programmer has to do a little more manual parsing (like in the backreferencing example). This is not necessarily a negative, because in many cases its faster to parse a string, then parse the substring using a strict parser than it is to parse the string using a lazy parser. Or maybe it was because I was too lazy to implement the lazy routines? <g>....
Regarding Indy or WinInet; If it was me, I would probably use the greatest common denominator - which would be wininet, as long as it provided me with all the functionality I required. Its just a case where more delphi developers will have wininet vs indy. Plus, its a whole lot less baggage to carry around. Speaking only from my personal experience, I will go right down to the socket api level if I don't need the extra overhead.
-- please note, this is my opinion only, and I do not presume to speak for anyone else. --
But back to your problem; if you want to post the section of code that is causing you problems, then I could take a look at it for you and try and provide a good/solid solution to it.
Russell
---
Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.
For instance, to find repeating adjacent characters such as the two Ls in the word "tall", you would use the regular expression (?<char>\w)\k<char>, which uses the metacharacter \w to find any single-word character. The grouping construct (?<char> ) encloses the metacharacter to force the regular expression engine to remember a subexpression match (which in this case will be any single character) and save it under the name "char". The backreference construct \k<char> causes the engine to compare the current character to the previously matched character stored under "char". The entire regular expression successfully finds a match wherever a single character is the same as the preceding character.
---
It can make life easier, as you only have to write a single expression and let the parser do the rest. It can also make life harder, as it is very hard to track what the actual expression is doing. Anyways, you could also easily do this with a few lines of code using a basic parser (like the one I posted code for).
Regarding using TRegExpr in your source, for articles, etc. I don't mind in the least how/where it is used, or even if any credit is given. I only posted it (in another q) because it is a nice example of a state machine that is:
1.) Pretty easy to read and follow
2.) Can easily be extended because of (1)
3.) Handles 90% of things that we most commonly use regular expression parsers for
4.) Is very fast because it is implemented as a non-lazy/non-greedy parser.
All (3) means is that the programmer has to do a little more manual parsing (like in the backreferencing example). This is not necessarily a negative, because in many cases its faster to parse a string, then parse the substring using a strict parser than it is to parse the string using a lazy parser. Or maybe it was because I was too lazy to implement the lazy routines? <g>....
Regarding Indy or WinInet; If it was me, I would probably use the greatest common denominator - which would be wininet, as long as it provided me with all the functionality I required. Its just a case where more delphi developers will have wininet vs indy. Plus, its a whole lot less baggage to carry around. Speaking only from my personal experience, I will go right down to the socket api level if I don't need the extra overhead.
-- please note, this is my opinion only, and I do not presume to speak for anyone else. --
But back to your problem; if you want to post the section of code that is causing you problems, then I could take a look at it for you and try and provide a good/solid solution to it.
Russell
Just as an update; not sure where you are with this....
Taking a look at my old code, I realized that I never really addressed sub-expressions that well. Didn't handle anything more that (xxx|xxx), and no post modifiers were allowed. Not very useful for complex parsing. Anyways, I spent some time updating the source to handle multiple nesting levels of ( ) expressions, and they also allow for post modifiers on them as well. Also, in order to provide something similar to the ".*" syntax (that I totally despise), the code now handles negated sub expressions. For example, to find "<tag>" + everything until a "</tag>" you could use the following:
// match-not match zero or more times-match
Pattern:='<tag>(!</tag>)*<
Makes parsing a little bit easier anyways. To me, it also makes a little more sense, because its explicitly stating "match this", then match while NOT that, and then match that.
Now taking the above VB example of
"(?<Key>href=[""'](?!#|htt
which is used to match only relative hrefs, we can now use TRegExpr with the following:
href=["'](!#|http|ftp|mail
I still don't handle backreferencing (not used here), or named expressions (eg Key, Value), but with the bead usage, splitting it is pretty simple, eg;
href=@1["'](!#|http|ftp|ma
When a match is found, the string starting from BeadPos[0] to BeadPos[1] is what will need to be replaced with your absolute href.
I'm also including some other regular expression examples, just so that you get a feel for it:
Email name regular expression :
([a-zA-Z0-9_\-\.]+)\@((\[[
Password of 4 to 15 chars, starts with a letter :
^[a-zA-Z][%a_]{3,14}$
Phone number in a wide variety of formats :
\(?[0-9]{3}\)?(\-|\s)?[0-9
IP Address matching :
%w(25[0-5]|2[0-4][0-9]|[01
Html tag matching :
<tag[^>]*>(!</tag>)*</tag>
Relative HREF parsing :
href=["'](!#|http|ftp|mail
And of course, the updated source. (please read the commented header, as it contains the latest changes and additions)
Regards,
Russell
----------
unit RegExpr;
//////////////////////////
//
// Unit : REGEXPR
// Description : Regular expression parser for Delphi
// Date : 10.14.2003
// 12.02.2004
//
// Description :
//
// ^ = Start anchor indicator (if used in first position)
// $ = End anchor indicator (must be final character)
// ( ) = Parse sub expression. No limitations to nest depth.
// (! = Negate the sub expression, eg (!subexpr), (!subexpr|subexp). For
// a match to occur, the sub expression CANNOT match the string, thus allowing
// the string to be incremented. This provides a feature similar to .* when
// parsing.
// | = OR the 2 sub expressions. Parens must be used to start the ORing
// of two or more sub expressions. For multiple ORing the following
// syntax can be used: (subexpr1|subexpr2|subexpr
// ^ = Must be used after [, negates a class
// . = Match any char (cannot be negated, and cannot have post modifiers)
// * = Post modifier, match zero or more occurrences of char/char class
// ? = Post modifier, match zero or one occurences of char/char class
// + = Post modifier, match one or more occurences of char/char class
// {} = Post modifier, allows for the patterns {m}, {m,} and {m, n} where
// {m} matches exactly m occurrences of the char/char class
// {m,} matches m occurences or more of the char/char class
// {m, n} matches m to n occurences of the char / char class
// [] = Char class, cannot contain ".", pre/post modifiers, or other meta
// characters except for "-".
// - = Used to define char ranges in char class
// @0..9 = Mark the current bead position (integer value). Useful for seperating
// sub parts of the expression
// \ = Escaped char - the following have special meaning
// \a = Bell
// \b = Backspace
// \e = Escape
// \f = Formfeed
// \n = Newline
// \r = Carriage return
// \t = Tab
// \v = Vertical tab
// \xHH = Hex code, must be 2 byte trailer in hex.
// All other escaped charactes are translated as literals. This allows
// the use of what would otherwise be meta characters. (ie: *, ., [, ...)
// % = Extended char class range. Must be trailed by one of the following. Uppercase
// character will negate the class range (eg %A is the same as [^%a])
// %a = 0-9, A-Z, a-z
// %c = A-Z, a-z
// %d = 0-9
// %h = 0-9, A-F, a-f
// %i = \_, \-, \$, 0-9, A-Z, a-z
// %n = 0-9
// %s = " ", \t
// %w = \n, \r, " ", \t
// Any other trailer will cause an expression error.
// x = All other characters are taken as literal values.
//
//////////////////////////
interface
{$Z4}
uses
Windows, SysUtils, Classes;
//////////////////////////
// Regular expression exception class
//////////////////////////
type
ERegExpr = class(Exception);
//////////////////////////
// Regular expression data types
//////////////////////////
type
PCharSet = ^TCharSet;
TCharSet = Set of Char;
//////////////////////////
// Regular expression instruction types
//////////////////////////
type
TRegType = (rtTerminal, rtBead, rtMark, rtMarkOr, rtMarkEnd, rtMarkNegEnd,
rtAnyChar, rtChar, rtClass, rtNegClass);
//////////////////////////
// Regular expression post modifier types
//////////////////////////
type
TRegPost = (rpNone, rpOne, rpZMore, rpZOne, rpOneMore, rpExact, rpCMore, rpCtoC);
//////////////////////////
// Regular expression state structure
//////////////////////////
type
PRegState = ^TRegState;
TRegState = packed record
regType: TRegType;
regPost: TRegPost;
dwMin: Integer;
dwMax: Integer;
case Integer of
0 : (lpClass: PCharSet);
1 : (cChar: Char);
2 : (dwJump: Integer);
end;
//////////////////////////
// Regular expression sub-expression state
//////////////////////////
type
PRegSubStack = ^TRegSubStack;
TRegSubStack = packed record
lpPrevious: PRegSubStack;
szCursor: PChar;
dwBranch: Integer;
dwMatched: Integer;
end;
//////////////////////////
// Regular expression constants
//////////////////////////
const
REG_LOOPING = (-1);
REG_NEGEXPR: Array [False..True] of TRegType = (rtMarkEnd, rtMarkNegEnd);
REG_NEGCLASS: Array [False..True] of TRegType = (rtClass, rtNegClass);
REG_TERMCHARS: Set of Char = [#0, '$', '(', ')', '|'];
REG_HEXCHARS: Set of Char = ['0'..'9', 'A'..'F', 'a'..'f'];
REG_METACHARS: Set of Char = ['$', '(', ')', '|', '^', '*', '?', '+', '-',
'@', '[', ']', '.', '{', '}', ',', '%', '!'];
REG_PARSEERRORS: Array [0..18] of PChar =
('Expression pattern has not been set', // 0
'Expression pattern cannot be blank', // 1
'Invalid pattern expression', // 2
'Unexpected end of expression at position (%d)', // 3
'Unexpected meta character at position (%d)', // 4
'Characters after end anchor at position (%d)', // 5
'Invalid special identifier at position (%d)', // 6
'Invalid empty set at position (%d)', // 7
'Min count must be less than max count (%d)', // 8
'Expected a closing brace at position (%d)', // 9
'Exact match count cannot be set to zero (%d)', // 10
'Expected a numeric at position (%d)', // 11
'Invalid numeric expression at position (%d)', // 12
'Unexpected terminator found in sub expression (%d)',// 13
'Expected a closing paren at position (%d)', // 14
'Bead markers must be bewteen 0 and 9', // 15
'Stack expression underflow', // 16
'Empty sub expressesion encountered (%d)', // 17
'Invalid hex char encounter (%d)' // 18
);
//////////////////////////
// Regular expression class
//////////////////////////
type
TRegExpr = class(TObject)
private
// Private declarations
FBeadPos: Array [0..9] of Integer;
FAncStart: Boolean;
FAncEnd: Boolean;
FMatchCase: Boolean;
FMatched: Boolean;
FMatchPos: Integer;
FMatchLen: Integer;
FTable: TList;
FError: PChar;
FPos: PChar;
FParsed: Boolean;
FPattern: PChar;
FSource: PChar;
FCursor: PChar;
FAnchor: PChar;
protected
// Protected declarations
function GetSource: PChar;
function GetPattern: PChar;
function GetPosition: Integer;
function GetBeadPos(Index: Integer): Integer;
procedure SetSource(Source: PChar);
procedure SetPattern(Pattern: PChar);
procedure SetMatchCase(CaseSensitive
procedure ClearStateTable;
function AllocStackEntry(Stack: PRegSubStack; Jump: Integer; Cursor: PChar): PRegSubStack;
procedure FreeStack(Stack: PRegSubStack);
function PopStackEntry(Stack: PRegSubStack): PRegSubStack;
function AllocState(StateType: TRegType): PRegState;
procedure FreeState(State: PRegState);
procedure ParseSubExpression;
procedure ParseParens;
procedure ParseExpression;
procedure ParseAnyChar;
procedure ParseCharClass;
procedure ParseChar;
procedure ParseCharRange(CClass: PCharSet; var Negated: Boolean);
procedure ParseSpecialRange(CClass: PCharSet; var Negated: Boolean);
function ParseCCChar: Char;
function ParseEscChar: Char;
function ParseNum: Integer;
procedure ParsePostModifier(State: PRegState);
procedure ResetMatched;
procedure ResetAll;
function Match(Start: Boolean): Boolean;
function MatchSubString: Boolean;
function ScanLineStart: Boolean;
procedure RaiseError(Error: Integer; InsertPos: Boolean = False);
public
// Public declarations
constructor Create;
constructor CreatePattern(Pattern: PChar);
destructor Destroy; override;
function Test(Pattern: PChar): Boolean;
function MatchFirst: Boolean;
function MatchNext: Boolean;
property BeadPos[Index: Integer]: Integer read GetBeadPos;
property CaseSensitive: Boolean read FMatchCase write SetMatchCase;
property MatchPos: Integer read FMatchPos;
property MatchLen: Integer read FMatchLen;
property Matched: Boolean read FMatched;
property Pattern: PChar read GetPattern write SetPattern;
property Position: Integer read GetPosition;
property Source: PChar read GetSource write SetSource;
end;
implementation
//////////////////////////
// TRegExpr
//////////////////////////
procedure TRegExpr.RaiseError(Error:
begin
// Check to see if we should be formatting this
if InsertPos then
// Formatted
StrFmt(FError, REG_PARSEERRORS[Error], [Succ(FPos-FPattern)])
else
// Straight copy
StrCopy(FError, REG_PARSEERRORS[Error]);
// Raise the exception with the detailed information
raise ERegExpr.Create(FError);
end;
function TRegExpr.GetBeadPos(Index:
begin
// Check index range for bead position
if not(Index in [0..9]) then RaiseError(15);
// Return the bead marker
result:=FBeadPos[Index];
end;
function TRegExpr.Test(Pattern: PChar): Boolean;
begin
// Set the pattern to test
SetPattern(Pattern);
// Call MatchFirst and return results
result:=MatchFirst;
end;
function TRegExpr.GetPosition: Integer;
begin
// Return the current anchor position within the source text
result:=FAnchor-FSource;
end;
function TRegExpr.MatchNext: Boolean;
begin
// Reset matched state
ResetMatched;
// Call Match
result:=Match(False);
end;
function TRegExpr.MatchFirst: Boolean;
begin
// Reset matched state and source starting position
ResetAll;
// Call Match
result:=Match(True);
end;
function TRegExpr.Match(Start: Boolean): Boolean;
begin
// Set default result
result:=False;
// Check to make sure the expression was parsed
if FParsed then
begin
// Check anchor start
if FAncStart then
begin
// If not the first time then scan to a new line
if not(Start) then ScanLineStart;
// Keep looping until match, or cursor is at the end
while (FCursor^ > #0) do
begin
// Search for the pattern
result:=MatchSubString and ((FCursor-FAnchor) > 0);
if result then break;
// Increment the anchor
Inc(FAnchor);
// Scan to the next line
if not(ScanLineStart) then break;
end;
end
else
begin
// Keep looping until match, or cursor is at the end
while (FCursor^ > #0) do
begin
// Search for the pattern
result:=MatchSubString and ((FCursor-FAnchor) > 0);
if result then break;
// Increment the anchor
Inc(FAnchor);
// Set the cursor to the anchor
FCursor:=FAnchor;
end;
end;
// If we did not succeed then we must perform a full reset
if result then
begin
// Set match pos and length
FMatched:=True;
FMatchPos:=FAnchor-FSource
FMatchLen:=FCursor-FAnchor
// If match length is zero, which can happen when matching zero times, then
// we may run into cases where the cursor does not get incremented. We need
// to increment for those cases
if ((FMatchLen = 0) and (FCursor^ > #0)) then Inc(FCursor);
// Set the next starting search point
FAnchor:=FCursor;
end
else
ResetAll;
end
else
// Not parsed
RaiseError(0);
end;
function TRegExpr.MatchSubString: Boolean;
var prssStack: PRegSubStack;
prsNode: PRegState;
ppaTable: PPointerList;
dwCharLoop: Integer;
dwIndex: Integer;
bMatch: Boolean;
cChar: Char;
begin
// Set default result (optimisitic)
result:=True;
// Set node pointer and stack pointer
prssStack:=nil;
prsNode:=nil;
// Map in the table as a straight pointer array
ppaTable:=FTable.List;
// Clear bead positions (and conveniently set dwIndex = zero)
dwIndex:=10;
repeat
Dec(dwIndex);
FBeadPos[dwIndex]:=0;
until (dwIndex = 0);
// Resource protection
try
// Loop until we are done
while True do
begin
// Get the state record and make sure the index is incremented
prsNode:=ppaTable^[dwIndex
Inc(dwIndex);
// Check for terminating state or expression break
case prsNode^.regType of
// End of compiled expression
rtTerminal : break;
// Bead handling
rtBead : FBeadPos[prsNode^.dwJump]:
// Sub expression start
rtMark :
begin
// Check to see if we are in a loop
if ((prssStack = nil) or (prssStack^.dwBranch > REG_LOOPING)) then
// Allocate a new stack entry
prssStack:=AllocStackEntry
else
// Update the branch link in the stack record
prssStack^.dwBranch:=prsNo
end;
// OR marker
rtMarkOr :
begin
// Get stack entry
if Assigned(prssStack) then
begin
// Check current result
if result then
// Jump out of here
dwIndex:=prsNode^.dwJump
else
begin
// Reset the cursor and set new jump point
FCursor:=prssStack^.szCurs
prssStack^.dwBranch:=prsNo
// Reset the result
result:=True;
end;
end
else
result:=False;
end;
// Sub expression end
rtMarkEnd,
rtMarkNegEnd:
begin
// Get stack entry
if Assigned(prssStack) then
begin
// Check for negation
if (prsNode^.regType = rtMarkNegEnd) then
begin
// Flip result
result:=not(result);
// Success?
if result then
begin
// Reset the cursor
FCursor:=prssStack^.szCurs
// Push next if anything is left
if (FCursor^ > #0) then
// Next char
Inc(FCursor)
else
// Failure
result:=False;
end
end;
// Post operator can be handled by min and max match values
if result then
begin
// Increment the stack match counter
Inc(prssStack^.dwMatched);
// Do we need to keep looping?
if (prssStack^.dwMatched < prsNode^.dwMax) then
begin
// Set stack cursor
prssStack^.szCursor:=FCurs
// Setup for next loop
prssStack^.dwBranch:=REG_L
// Jump back
dwIndex:=prsNode^.dwJump;
end
else
// Free the stack entry
prssStack:=PopStackEntry(p
end
else
begin
// Did we match at least the minimum number of times?
result:=(prssStack^.dwMatc
// Reset the cursor
FCursor:=prssStack^.szCurs
// Free the stack entry
prssStack:=PopStackEntry(p
end;
end;
// Check result and stack level, must still have string to process as well
if (result or Assigned(prssStack)) then Continue;
// Failure and stack is empty, or not string left
Break;
end;
else
// Simple char state loop counter to handle the multiple matching
dwCharLoop:=0;
while True do
begin
// Set default match
bMatch:=False;
// Perform the character matching (if we have string)
if (FCursor^ > #0) then
begin
// Check case sensitive
if FMatchCase then
cChar:=FCursor^
else
cChar:=UpCase(FCursor^);
// Check the compare type
case prsNode^.regType of
rtAnyChar : bMatch:=not(cChar = #10);
rtChar : bMatch:=(cChar = prsNode^.cChar);
rtClass : bMatch:=(cChar in prsNode^.lpClass^);
rtNegClass : bMatch:=not(cChar in prsNode^.lpClass^);
end;
// Increment the cursor position
Inc(FCursor, Ord(bMatch));
end;
// Handle the post operator
if bMatch then
begin
// Increment the match counter
Inc(dwCharLoop);
// Do we need to keep matching?
if (dwCharLoop >= prsNode^.dwMax) then
begin
// Finished matching
result:=True;
break;
end;
end
else
begin
// Did we match the minimum required?
result:=(dwCharLoop >= prsNode^.dwMin);
// Done matching
break;
end;
end;
// Check for failure
if not(result) then
begin
// If no jump is set then we fail
if Assigned(prssStack) then
// Update the current index with the branch jump
dwIndex:=prssStack^.dwBran
else
// Failure
break;
end;
end;
end;
// Success, we still need to check against the terminal state in case we
// ran out of string to process
if result and Assigned(prsNode) then
begin
// Make sure we processed the terminal state
result:=(prsNode^.regType = rtTerminal);
// Still need to check against an end anchor if one was specified
if (result and FAncEnd) then result:=(FCursor^ in [#0, #13, #10]);
end
else
result:=False;
finally
// Free the stack
FreeStack(prssStack);
end;
end;
function TRegExpr.ScanLineStart: Boolean;
begin
// Set default result
result:=False;
// Need to position the cursor right after a line break
while not(FCursor^ in [#0, #10]) do Inc(FCursor);
// If we are at a null then there is no next line so return false
if (FCursor^ > #0) then
begin
// Increment past the line feed
Inc(FCursor);
// Reset the anchor to match the cursor
FAnchor:=FCursor;
// Success
result:=True;
end;
end;
procedure TRegExpr.ParseSubExpressio
var prsNode: PRegState;
bNegated: Boolean;
begin
// Parse the sub expression until we hit a terminating char
while not(FPos^ in REG_TERMCHARS) do
begin
// Handle the character
case FPos^ of
// Any character
'.' : ParseAnyChar;
// Character class
'[' : ParseCharClass;
// Parse a bead position
'@' :
begin
// Skip next
Inc(FPos);
// Bead must be bewteen 0..9
if not(FPos^ in ['0'..'9']) then RaiseError(15);
// Allocate a new state
prsNode:=AllocState(rtBead
prsNode^.dwJump:=Ord(FPos^
FTable.Add(prsNode);
// Skip next
Inc(FPos);
end;
// Parse an extended char class
'%' :
begin
// Resource protection
prsNode:=nil;
try
// Allocate a new state
prsNode:=AllocState(rtClas
prsNode^.lpClass:=AllocMem
// Parse the range and post modifier
ParseSpecialRange(prsNode^
if bNegated then prsNode^.regType:=rtNegCla
ParsePostModifier(prsNode)
// Add to the state table
FTable.Add(prsNode);
except
// Free the state
FreeState(prsNode);
// Re-raise the exception
raise;
end;
end;
else
// It is a failure to be sitting at a meta char at this point
if (FPos^ in REG_METACHARS) then
RaiseError(4, True)
else
// Parse a literal or escaped character
ParseChar;
end;
end;
end;
procedure TRegExpr.ParseParens;
var prsMark: PRegState;
prsMarkOr: PRegState;
prsMarkEnd: PRegState;
bNegate: Boolean;
dwHead: Integer;
dwMark: Integer;
begin
// Skip starting paren
Inc(FPos);
// Add the mark record (hold pointer so we can fill the jump). Also maintain the
// head so we can use post modifiers against the sub expression
prsMark:=AllocState(rtMark
dwHead:=FTable.Add(prsMark
dwMark:=dwHead;
// Can' be sitting at an empty expression
if (FPos^ in [')', '|']) then RaiseError(17, True);
// Is this a negated sub expression?
bNegate:=(FPos^ = '!');
if bNegate then Inc(FPos);
// Parse the sub expression
ParseSubExpression;
// We need to be be sitting at one of the following
// "(" = Start new sub expression
// "|" = Or the expression with a secondary sub expression
// ")" = End the current sub expression
// Better be sitting at a sub expression delimiter
if not(FPos^ in ['(', '|', ')']) then RaiseError(13, True);
// Keep looping while one of the following is encountered "(", "|"
while (FPos^ in ['(', '|']) do
begin
// Handle the char
case FPos^ of
// Recurse and parse next set of parens
'(' : ParseParens;
// Parse "or" statement
'|' :
begin
Inc(FPos);
// Add the mark OR
prsMarkOr:=AllocState(rtMa
// Add to the table and set the mark jump point
prsMark^.dwJump:=FTable.Ad
// Switch
prsMark:=prsMarkOr;
// Parse the second part of the expression. Handle cases where the
// secondary expression starts with a "("
if (FPos^ = '(') then
// Parse paren statement
ParseParens
else
// Parse the actual sub expression
ParseSubExpression;
end;
else
// Sanity check
RaiseError(13, True);
end;
end;
// End the sub expression
if (FPos^ <> ')') then RaiseError(14, True);
Inc(FPos);
// Allocate the mark end and set post modifier loop jump. If a post modifier is
// encountered, then we will jump back to the start of the sub expression after the
// "(".
prsMarkEnd:=AllocState(REG
prsMarkEnd^.dwJump:=dwMark
// Parse any post modifiers
ParsePostModifier(prsMarkE
// Add the end and update the last jump
prsMark^.dwJump:=FTable.Ad
end;
procedure TRegExpr.ParseExpression;
var prsNode: PRegState;
begin
// Clear the current state table and reparse
ClearStateTable;
// Clear match settings
ResetMatched;
// Reset expression parser anchors
FAncStart:=False;
FAncEnd:=False;
// Set parser variable
FPos:=FPattern;
// Check for null pattern string
if (FPos^ = #0) then RaiseError(1);
// Check for starting anchor
if (FPos^ = '^') then
begin
// Set anchor start flag
FAncStart:=True;
Inc(FPos);
// Check for null
if (FPos^ = #0) then RaiseError(2);
end
else
// Check for just an ending anchor
if (FPos^ = '$') then RaiseError(2);
// Catch exceptions so we can clear the state table on failure
try
// Parse until we hit an ending char
while not(FPos^ in [#0, '$']) do
begin
case FPos^ of
'(' : ParseParens;
')' : RaiseError(4, True);
'|' : RaiseError(4, True);
else
ParseSubExpression;
end;
end;
// Check for tail end anchor
if (FPos^ = '$') then
begin
FAncEnd:=True;
// Push next
Inc(FPos);
// We must be at the null
if (FPos^ > #0) then RaiseError(5, True);
end;
// Add the terminal state
prsNode:=AllocState(rtTerm
FTable.Add(prsNode);
except
// Clear the anchor flags
FAncStart:=False;
FAncEnd:=False;
// Clear the state table
ClearStateTable;
// Re-raise the exception again
raise;
end;
end;
procedure TRegExpr.ParseChar;
var prsNode: PRegState;
cChar: Char;
begin
// Is the pattern null
if (FPos^ = #0) then RaiseError(3, True);
// Check for invalid meta characters
if (FPos^ in REG_METACHARS) then RaiseError(4, True);
// Check for escaped character
if (FPos^ = '\') then
cChar:=ParseEscChar
else
// Literal character
cChar:=FPos^;
// Push next
Inc(FPos);
// Allocate a new state
prsNode:=AllocState(rtChar
if not(FMatchCase) then
prsNode^.cChar:=UpCase(cCh
else
prsNode^.cChar:=cChar;
try
// Parse out any post modifier
ParsePostModifier(prsNode)
except
// Chance to free the memory
FreeState(prsNode);
// Re-raise the exception
raise;
end;
// Add to state table
FTable.Add(prsNode);
end;
function TRegExpr.ParseEscChar: Char;
var lpszHex: PCHar;
begin
// Push next
Inc(FPos);
// Check the next character
case FPos^ of
#0 :
begin
result:=#0;
RaiseError(3, True);
end;
'a' : result:=#7;
'b' : result:=#8;
'e' : result:=#27;
'f' : result:=#12;
'n' : result:=#10;
'r' : result:=#13;
't' : result:=#9;
's' : result:=#32;
'v' : result:=#11;
'x' :
begin
Inc(FPos);
lpszHex:=FPos;
if not(FPos^ in REG_HEXCHARS) then RaiseError(18, True);
Inc(FPos);
if not(FPos^ in REG_HEXCHARS) then RaiseError(18, True);
result:=Chr(StrToInt('$'+l
end;
else
// Escape as is
result:=FPos^;
end;
end;
function TRegExpr.ParseCCChar: Char;
begin
// Is the pattern null
if (FPos^ = #0) then RaiseError(3, True);
// Check for invalid meta characters
if (FPos^ in REG_METACHARS) then RaiseError(4, True);
// Check for escaped character
if (FPos^ = '\') then
result:=ParseEscChar
else
// Literal character
result:=FPos^;
// Push next
Inc(FPos);
end;
procedure TRegExpr.ParseSpecialRange
var cChar: Char;
begin
// Push next
Inc(FPos);
// Is the pattern null
if (FPos^ = #0) then RaiseError(3, True);
// Check for invalid meta characters
if (FPos^ in REG_METACHARS) then RaiseError(4, True);
// Reverse the negation passed in if the char is upper case. This will correctly
// handle cases where ^%D is specified, which equates to NOT NOT CHAR CLASS
if (Ord(FPos^) < 97) then Negated:=not(Negated);
// Handle the special char range
case FPos^ of
// Alpha, numeric
'A',
'a' :
begin
for cChar:='0' to '9' do Include(CClass^, cChar);
for cChar:='A' to 'Z' do Include(CClass^, cChar);
for cChar:='a' to 'z' do Include(CClass^, cChar);
end;
// Alpha
'C',
'c' :
begin
for cChar:='A' to 'Z' do Include(CClass^, cChar);
for cChar:='a' to 'z' do Include(CClass^, cChar);
end;
// Numeric
'D',
'd' : for cChar:='0' to '9' do Include(CClass^, cChar);
// Hexadecimal
'H',
'h' :
begin
for cChar:='0' to '9' do Include(CClass^, cChar);
for cChar:='A' to 'F' do Include(CClass^, cChar);
for cChar:='a' to 'f' do Include(CClass^, cChar);
end;
// Alpha, numeric, underscore, dash, dollar
'I',
'i' :
begin
Include(CClass^, '_');
Include(CClass^, '-');
Include(CClass^, '$');
for cChar:='0' to '9' do Include(CClass^, cChar);
for cChar:='A' to 'Z' do Include(CClass^, cChar);
for cChar:='a' to 'z' do Include(CClass^, cChar);
end;
// White space but not line breaks
'S',
's' :
begin
Include(CClass^, #9);
Include(CClass^, #32);
end;
// All white space
'W',
'w' :
begin
Include(CClass^, #9);
Include(CClass^, #10);
Include(CClass^, #13);
Include(CClass^, #32);
end;
else
// Invalid identifier
RaiseError(6, True);
end;
// Push next
Inc(FPos);
end;
procedure TRegExpr.ParseCharRange(CC
var cStart: Char;
cEnd: Char;
cSwap: Char;
begin
// Check for special char range
if (FPos^ = '%') then
// Parse special character set
ParseSpecialRange(CClass, Negated)
else
begin
// Parse a single character
cStart:=ParseCCChar;
// Check for character range
if (FPos^ = '-') then
begin
// Range of characters
Inc(FPos);
// Parse the next character
cEnd:=ParseCCChar;
// Build the range as a character set
if (cStart > cEnd) then
begin
// Exchange first and last
cSwap:=cStart;
cStart:=cEnd;
cEnd:=cSwap;
end;
// Include the characters in the character set
for cSwap:=cStart to cEnd do
begin
Include(CClass^, cSwap);
if not(FMatchCase) then Include(CClass^, UpCase(cSwap));
end;
end
else
begin
// Single character
if FMatchCase then
Include(CClass^, cStart)
else
Include(CClass^, UpCase(cStart));
end;
end;
end;
procedure TRegExpr.ParseCharClass;
var prsNode: PRegState;
bNegate: Boolean;
bNewNegate: Boolean;
begin
// Push next
Inc(FPos);
// Is this a negated class?
bNegate:=(FPos^ = '^');
if bNegate then Inc(FPos);
bNewNegate:=bNegate;
// Check for an immediate ], which means an empty set (which is not allowed)
if (FPos^ = ']') then RaiseError(7, True);
// Allocate a state record and char class
prsNode:=AllocState(REG_NE
prsNode^.lpClass:=AllocMem
// Start parsing the character ranges until we hit the closing bracket
try
// Parse char range
while not(FPos^ in [#0, ']']) do ParseCharRange(prsNode^.lp
// Check for change in negation
if (bNewNegate <> bNegate) then prsNode^.regType:=REG_NEGC
// We must have the ending bracket
if (FPos^ = #0) then RaiseError(3, True);
// Push next
Inc(FPos);
// Parse out any post modifier
ParsePostModifier(prsNode)
except
// Gives us the chance to clean up
FreeState(prsNode);
// Re-raise the exception
raise;
end;
// Add to state table
FTable.Add(prsNode);
end;
procedure TRegExpr.ParseAnyChar;
var prsNode: PRegState;
begin
// Push next
Inc(FPos);
// Add the new state record
prsNode:=AllocState(rtAnyC
FTable.Add(prsNode);
end;
procedure TRegExpr.ParsePostModifier
begin
// Set default post modifier
State^.regPost:=rpOne;
State^.dwMin:=1;
State^.dwMax:=1;
// If not a post modifier then just exit
if not(FPos^ in ['*', '?', '+', '{']) then exit;
// Set the post modifier
case FPos^ of
'*' :
begin
State^.regPost:=rpZMore;
State^.dwMin:=0;
State^.dwMax:=MaxInt;
end;
'?' :
begin
State^.regPost:=rpZOne;
State^.dwMin:=0;
end;
'+' :
begin
State^.regPost:=rpOneMore;
State^.dwMax:=MaxInt;
end;
'{' : State^.regPost:=rpExact;
end;
// Push next
Inc(FPos);
// If *, ?, or + then we are done
if (State^.regPost <> rpExact) then exit;
// Get the first number
State^.dwMin:=ParseNum;
// Check for end of expression
if (FPos^ = #0) then RaiseError(3, True);
// Next character needs to be either a , or }
case FPos^ of
',' :
begin
// Push next
Inc(FPos);
// Check for end of expression
if (FPos^ = #0) then RaiseError(3, True);
// We may have either a closing } or a numeric
if (FPos^ = '}') then
State^.regPost:=rpCMore
else
begin
// Going to be a min to max range
State^.regPost:=rpCtoC;
// Get max range count
State^.dwMax:=ParseNum;
// Check for end of expression
if (FPos^ = #0) then RaiseError(3, True);
// Min has to be less than Max
if (State^.dwMin >= State^.dwMax) then RaiseError(8, True);
// Check for closing }
if (FPos^ <> '}') then RaiseError(9, True);
end;
end;
'}' : if (State^.dwMin = 0) then RaiseError(10, True);
else
RaiseError(9, True);
end;
// Update the min and max values (Count to count is already set)
case State^.regPost of
// Exactly
rpExact : State^.dwMax:=State^.dwMin
// Count or more
rpCMore : State^.dwMax:=MaxInt;
end;
// Push next
Inc(FPos);
end;
function TRegExpr.ParseNum: Integer;
var lpszHold: PChar;
cMark: Char;
dwErr: Integer;
begin
// Skip white space
if (FPos^ in [#9, #32]) then Inc(FPos);
// Save the current position
lpszHold:=FPos;
// Better be a number waiting
if not(FPos^ in ['0'..'9']) then RaiseError(11, True);
// Keep incrementing while numeric
while (FPos^ in ['0'..'9']) do Inc(FPos);
// Terminate and convert to number
cMark:=FPos^;
FPos^:=#0;
Val(lpszHold, result, dwErr);
FPos^:=cMark;
// Check for error
if (dwErr > 0) then
begin
FPos:=lpszHold;
Inc(FPos, dwErr);
RaiseError(12, True);
end;
end;
procedure TRegExpr.ResetAll;
begin
// Reset scanning positions
FCursor:=FSource;
FAnchor:=FSource;
// Reset matched state
FMatched:=False;
FMatchPos:=0;
FMatchLen:=0;
end;
procedure TRegExpr.ResetMatched;
begin
// Set anchor and cursor to equal source
FMatched:=False;
FMatchPos:=0;
FMatchLen:=0;
end;
procedure TRegExpr.SetMatchCase(Case
begin
// Check current setting
if (CaseSensitive <> FMatchCase) then
begin
// Set new case sensitive setting
FMatchCase:=CaseSensitive;
// Reparse if we were in a parsed state
if FParsed then ParseExpression;
end;
end;
function TRegExpr.GetSource: PChar;
begin
// Return the source
result:=FSource;
end;
function TRegExpr.GetPattern: PChar;
begin
// Return the pattern
result:=FPattern;
end;
procedure TRegExpr.SetSource(Source:
begin
// Assign the new source
FreeMem(FSource);
FSource:=AllocMem(Succ(Str
StrCopy(FSource, Source);
// Source has changed, update the new state
ResetAll;
end;
procedure TRegExpr.SetPattern(Patter
begin
// Assign the new pattern
FreeMem(FPattern);
FPattern:=AllocMem(Succ(St
StrCopy(FPattern, Pattern);
// Reset parsed state
FParsed:=False;
// Parse the expression
ParseExpression;
// Set parsed state to true
FParsed:=True;
end;
procedure TRegExpr.FreeState(State: PRegState);
begin
// Is the state record assigned?
if Assigned(State) then
begin
// If char class is allocated then we need to free the memory for the char set
if (State^.regType in [rtClass, rtNegClass]) then FreeMem(State^.lpClass);
// Free the state record
FreeMem(State);
end;
end;
procedure TRegExpr.ClearStateTable;
var dwIndex: Integer;
begin
// Clear all the NFA state records
for dwIndex:=Pred(FTable.Count
// Clear the table
FTable.Clear;
end;
procedure TRegExpr.FreeStack(Stack: PRegSubStack);
var lpStack: PRegSubStack;
begin
// Get current stack entry
lpStack:=Stack;
// Free all stack entries
while Assigned(lpStack) do
begin
// Free entry and get previous entry
lpStack:=PopStackEntry(lpS
end;
end;
function TRegExpr.PopStackEntry(Sta
begin
// Check for underflow
if Assigned(Stack) then
begin
// Set result (which is prior stack entry)
result:=Stack^.lpPrevious;
// Free memory for this entry
FreeMem(Stack);
end
else
begin
// Underflow
RaiseError(16);
// Pacify the compiler
result:=nil;
end;
end;
function TRegExpr.AllocStackEntry(S
begin
// Allocate memory for stack entry
result:=AllocMem(SizeOf(TR
// Set defaults
with result^ do
begin
lpPrevious:=Stack;
dwBranch:=Jump;
szCursor:=Cursor;
end;
end;
function TRegExpr.AllocState(StateT
begin
// Allocate a new regx state record
result:=AllocMem(SizeOf(TR
// Set the type
result^.regType:=StateType
result^.dwMin:=1;
result^.dwMax:=1;
end;
constructor TRegExpr.CreatePattern(Pat
begin
// Perform inherited for TRegExpr
Create;
// Set the pattern
SetPattern(Pattern);
end;
constructor TRegExpr.Create;
begin
// Perform inherited
inherited Create;
// Initialization
FPattern:=AllocMem(1);
FSource:=AllocMem(1);
FError:=AllocMem(256);
FPos:=FPattern;
FAncStart:=False;
FAncEnd:=False;
FMatchCase:=False;
FTable:=TList.Create;
FParsed:=False;
// Reset to default (unmatched) state
ResetAll;
end;
destructor TRegExpr.Destroy;
begin
// Clear the state records
ClearStateTable;
// Free the state table list
FTable.Free;
// Free all allocated memory
FreeMem(FPattern);
FreeMem(FSource);
FreeMem(FError);
// Perform inherited
inherited Destroy;
end;
end.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
While I have not really taken a look at your solution, I will presume it has been tested because of your reputation.
I will begin retrofitting the MHTBuilder class I'm working on and will note you in the article I plan to write for
BDN.
Much appreciated...
I will begin retrofitting the MHTBuilder class I'm working on and will note you in the article I plan to write for
BDN.
Much appreciated...
No problem... and you can be assured that I have tested the regular expression parser VERY well (lots of little bugs that used to be there and no more).
Regarding the relative->absolute url conversion, I would suggest using something faster than Delphi's StringReplace function (as it is SLOW compared to others, like FastString's). Irregardless of that small note, the code example is clean and does "exactly" what the MHT builder code does. All you need to do is pass the text and the absolute url, and let it do the rest.
Regards,
Russell
As a final sidenote to this...
I plan on implementing a few more code routines in the parser class (this was on my list of things to do), and one of those is string replacing. This will offer the benefit of a high speed string replacement routine because the match pos/length, as well as the replacement length are known during the match phase. The replacement text can also be composed of tags %0..%9, which will act as format strings to the backreference tags. So, for example, the code I provided for you will actually turn into:
New expression patterns
-------------------------- ---
((:1href)=["'']!(\#|http|f tp|mailto| javascript )\/?(:2[^" '']*)["''] )
((:1src)=["'']!(http)\/?(: 2[^"'']*)[ "''])
Replacement Text
-------------------------- ---
'%1="'+AbsoluteURL+'%2"';
procedure ConvertHtml(Source: TStrings; AbsoluteURL: String);
var regParser: TRegExpr;
dwIndex: Integer;
const
Patterns: Array [0..1] of PChar =
(
'((:1href)=["'']!(\#|http| ftp|mailto |javascrip t)\/?(:2[^ "'']*)["'' ])',
'((:1src)=["'']!(http)\/?( :2[^"'']*) ["''])'
);
begin
// Fix the absolute url so it ends with a trailing slash
dwIndex:=Length(AbsoluteUR L);
if (dwIndex > 0) and (AbsoluteURL[dwIndex] <> '/') then AbsoluteURL:=AbsoluteURL+' /';
// Make sure source is assigned and is not empty
if Assigned(Source) and (Source.Count > 0) then
begin
// Create the parser
regParser:=TRegExpr.Create ;
// Resource protection
try
// Set source text to perform matching on
regParser.Source:=Source.G etText;
// Iterate both patterns
for dwIndex:=Low(Patterns) to High(Patterns) do
begin
// Set the pattern
regParser.Pattern:=Pattern s[dwIndex] ;
// Replace
regParser.Replace('%1="'+A bsoluteURL +'%2"');
end;
// Update the source text
Source.SetText(regParser.S ource);
finally
// Free all local objects
FreeAndNil(regParser);
end;
end;
end;
Which is obviously much shoter and more compact than the original code example. If you are interested in getting the final class code , then let me know and I will post it when finished.
Regards,
Russell
I plan on implementing a few more code routines in the parser class (this was on my list of things to do), and one of those is string replacing. This will offer the benefit of a high speed string replacement routine because the match pos/length, as well as the replacement length are known during the match phase. The replacement text can also be composed of tags %0..%9, which will act as format strings to the backreference tags. So, for example, the code I provided for you will actually turn into:
New expression patterns
--------------------------
((:1href)=["'']!(\#|http|f
((:1src)=["'']!(http)\/?(:
Replacement Text
--------------------------
'%1="'+AbsoluteURL+'%2"';
procedure ConvertHtml(Source: TStrings; AbsoluteURL: String);
var regParser: TRegExpr;
dwIndex: Integer;
const
Patterns: Array [0..1] of PChar =
(
'((:1href)=["'']!(\#|http|
'((:1src)=["'']!(http)\/?(
);
begin
// Fix the absolute url so it ends with a trailing slash
dwIndex:=Length(AbsoluteUR
if (dwIndex > 0) and (AbsoluteURL[dwIndex] <> '/') then AbsoluteURL:=AbsoluteURL+'
// Make sure source is assigned and is not empty
if Assigned(Source) and (Source.Count > 0) then
begin
// Create the parser
regParser:=TRegExpr.Create
// Resource protection
try
// Set source text to perform matching on
regParser.Source:=Source.G
// Iterate both patterns
for dwIndex:=Low(Patterns) to High(Patterns) do
begin
// Set the pattern
regParser.Pattern:=Pattern
// Replace
regParser.Replace('%1="'+A
end;
// Update the source text
Source.SetText(regParser.S
finally
// Free all local objects
FreeAndNil(regParser);
end;
end;
end;
Which is obviously much shoter and more compact than the original code example. If you are interested in getting the final class code , then let me know and I will post it when finished.
Regards,
Russell
ASKER
Russell,
Having problems with this one:
' remove the <base href=''> tag if present; causes problems when viewing locally.
Regex.Replace(strHtml, "<base[^>]*?>", "")
Is this right?
LRegParser := TRLRegExpr.Create;
try
LRegParser.Source := LHTMLStrings.Text; // StringList with HTML
LRegParser.Pattern := '<base[^>]*?>';
LRegParser.Replace(PChar(' %1%2'), True);
finally
LRegParser.Free;
end;
These are also giving me fits
' -- StripTagFromHtml("iframe", strHtml) removes all iframe tags.
Private Function StripTagFromHtml(ByVal strTag As String, ByVal strHtml As String) As String
Dim reg As Regex = New Regex( _
String.Format("<{0}[^>]*?> [\w|\t|\r| \W]*?</{0} >", strTag), _
RegexOptions.Multiline _
Or RegexOptions.IgnoreCase _
Or RegexOptions.Compiled)
Return reg.Replace(strHtml, "")
End Function
'-- get rid of all HTML tags
strFileContents = Regex.Replace(strFileConte nts, "<\w+(\s+[A-Za-z0-9_\-]+\s *=\s*(""([ ^""]*)""|' ([^']*)')) *\s*(/)*>| <[^>]+>", " ", RegexOptions.Compiled)
Having problems with this one:
' remove the <base href=''> tag if present; causes problems when viewing locally.
Regex.Replace(strHtml, "<base[^>]*?>", "")
Is this right?
LRegParser := TRLRegExpr.Create;
try
LRegParser.Source := LHTMLStrings.Text; // StringList with HTML
LRegParser.Pattern := '<base[^>]*?>';
LRegParser.Replace(PChar('
finally
LRegParser.Free;
end;
These are also giving me fits
' -- StripTagFromHtml("iframe",
Private Function StripTagFromHtml(ByVal strTag As String, ByVal strHtml As String) As String
Dim reg As Regex = New Regex( _
String.Format("<{0}[^>]*?>
RegexOptions.Multiline _
Or RegexOptions.IgnoreCase _
Or RegexOptions.Compiled)
Return reg.Replace(strHtml, "")
End Function
'-- get rid of all HTML tags
strFileContents = Regex.Replace(strFileConte
Little tied up at the moment, so I only have time to fully address part of this (I will address the remainder in the morning).
1.) The only tricky thing in the first example is the *? notation, which turns greediness into laziness, which is what my parser does anyways. So:
Regex.Replace(strHtml, "<base[^>]*?>", "")
becomes
LRegParser := TRLRegExpr.Create;
try
LRegParser.Source := PChar(LHTMLStrings.Text); // StringList with HTML
LRegParser.Pattern := '<base[^>]*>';
LRegParser.Replace('', True); // This will replace the match with blank
finally
LRegParser.Free;
end;
If you needed to use part of the match in the replacement, THEN you would use the %0..%9 tags. Eg:
Pattern:='<(:1\w+)(:2[^>]*
Replace('{%1%2}', True);
Which would match any tag in the format of "<[word chars][any chars]>" and make the following backreference
:1 = [word chars]
:2 = [any chars]
And then replace the whole match with
{[word chars][any chars]}
As a side note, I have been working on improvements to the parser that eek another 10-15% out of performance speed, and have also added named backreference matching as well (beyond the the :0 to :9 numbered matching) eg;
'<(:<TagName>\w+)(:<TagExt
I will send you a copy when changes are done, if you are interested. (another week or so). It will also include a text file with examples to go from. But.... back to the issues at hand <g>
2.) The second item does nothing more than match (you pass the format string) the
<%s[any chars]>[any chars]</%s>
So, to do this only requires the following:
function StripTagFromHtml(TagName, HtmlString: String): String;
begin
Parser.Source:=PChar(HtmlS
Parser.Pattern:=PChar(Form
Parser.Replace('', False);
result:=Parser.Source;
end;
and call it with "iframe" as the TagName param. The pattern does nothing more than match the start tag, matches everything until the end tag, then matches the end tag.
Im out of time, but a quick view of the last problem doesnt appear to be anything major. I will give you the pattern in the morning.
Regards,
Russell
The final question:
'-- get rid of all HTML tags
strFileContents = Regex.Replace(strFileConte nts, "<\w+(\s+[A-Za-z0-9_\-]+\s *=\s*(""([ ^""]*)""|' ([^']*)')) *\s*(/)*>| <[^>]+>", " ", RegexOptions.Compiled)
deals with removing ALL html tags and inserting a " ", which would leave only the text of the web page. I have enhanced the version above so it also catches the <!-- {text} -->, which tends to have other tags (< and >) embedded within it.
Parser.Pattern:='(<\w+(\s+ [A-Za-z0-9 _\-]+\s*=\ s*("([^"]* )"|''([^'' ]*)''))*\s *(/)*>|<\! \-\-!(\-\- >)*\-\->|< [^>]+>)';
Parser.Replace(' ', False); // Replace matches with space using non-expanded replacement string. Or simply put,
// the replacement string does not contain % tags.
You could also thow a match for " " into the mix, which would remove most of the "noise" from the textual result of the page. Eg:
Parser.Pattern:='(\ | <\w+(\s+[A -Za-z0-9_\ -]+\s*=\s* ("([^"]*)" |''([^'']* )''))*\s*( /)*>|<\!\- \-!(\-\->) *\-\->|<[^ >]+>)';
-------------
Russell
'-- get rid of all HTML tags
strFileContents = Regex.Replace(strFileConte
deals with removing ALL html tags and inserting a " ", which would leave only the text of the web page. I have enhanced the version above so it also catches the <!-- {text} -->, which tends to have other tags (< and >) embedded within it.
Parser.Pattern:='(<\w+(\s+
Parser.Replace(' ', False); // Replace matches with space using non-expanded replacement string. Or simply put,
// the replacement string does not contain % tags.
You could also thow a match for " " into the mix, which would remove most of the "noise" from the textual result of the page. Eg:
Parser.Pattern:='(\ |
-------------
Russell
ASKER
I'm getting close to having this thing done. It shouldn't be too much longer. I am also using
the Jazarsoft HTMLParser the fill my links list. Makes it very easy and I don't have to rely on MSHTML.
It also lets me get frame info which MSHTML would not do.
Seems I won't be using as much regex stuff as I expected due to using the Jazarsoft parser, too.
Thanks for your help...
the Jazarsoft HTMLParser the fill my links list. Makes it very easy and I don't have to rely on MSHTML.
It also lets me get frame info which MSHTML would not do.
Seems I won't be using as much regex stuff as I expected due to using the Jazarsoft parser, too.
Thanks for your help...
No problem on the help...
If you are parsing html, then using a parser specific to that task is better suited than regular expressions.
Not to say that regular expressions don't have their use/purpose, as they provide a powerful and generic means for parsing data.
And on the plus side, you got some deep exposure to pattern matching (never know when you might need it)
Regards,
Russell
If you are parsing html, then using a parser specific to that task is better suited than regular expressions.
Not to say that regular expressions don't have their use/purpose, as they provide a powerful and generic means for parsing data.
And on the plus side, you got some deep exposure to pattern matching (never know when you might need it)
Regards,
Russell
So things like:
reg = New Regex( _
"(?<Key>href=[""'](?!#|htt
RegexOptions.IgnoreCase _
Or RegexOptions.Multiline _
Or RegexOptions.Compiled)
nvc = AddMatchesToCollection(str
reg = New Regex( _
"(?<Key>src=[""'](?!http)(
RegexOptions.IgnoreCase _
Or RegexOptions.Multiline _
Or RegexOptions.Compiled)
nvc = AddMatchesToCollection(str
Will require a little bit more parsing (eg, match the first set, then match the next set if the first set was found). If you let me know which examples you are having problems with, Im sure I can give you a hand though. Not promising anything, but I may be able to extend the parsing class so it handles backreferencing....
Regards,
Russell