Link to home
Start Free TrialLog in
Avatar of Eddie Shipman
Eddie ShipmanFlag for United States of America

asked on

rlibby -> RegExpr

Russell, I'm trying to convert some VB.Net code to Delphi and it includes LOTS of .Net RegEx class calls in it.
Can you help me convert the expression to work with your class? If so, I'll post the particular VB.Net regex
expressions.

The code in question is a MHT builder. They use regex to retrieve parts of an html page for downloading.

Would I be able to use your code in an article if I write one about this MHTBuilder class I'm building?
Avatar of Russell Libby
Russell Libby
Flag of United States of America image

I took a look at the MHT Builder code, and my expression parser can be used to replace most, BUT not all the code.( I didn't add greedy parsing or backreferencing into my parser)

So things like:


     reg = New Regex( _
            "(?<Key>href=[""'](?!#|http|ftp|mailto|javascript)(?<Value>[^""']*)[""'])", _
            RegexOptions.IgnoreCase _
            Or RegexOptions.Multiline _
            Or RegexOptions.Compiled)

        nvc = AddMatchesToCollection(strHTML, reg, nvc)

        reg = New Regex( _
            "(?<Key>src=[""'](?!http)(?<Value>[^""']*)[""'])", _
            RegexOptions.IgnoreCase _
            Or RegexOptions.Multiline _
            Or RegexOptions.Compiled)
        nvc = AddMatchesToCollection(strHTML, reg, nvc)

Will require a little bit more parsing (eg, match the first set, then match the next set if the first set was found). If you let me know which examples you are having problems with, Im sure I can give you a hand though. Not promising anything, but I may be able to extend the parsing class so it handles backreferencing....

Regards,
Russell


Avatar of Eddie Shipman

ASKER

Well, not sure what you mean by "backreferencing" as I know next to nothing about RegEx.

Glad you were able to find that source. I have got a whole lot of it converted already but have been
vacillating over how to replace that regex stuff. I thought about using TRegExpr but since I don't
know anything about expressions or even what they are doing in the VB.Net code, I am stuck.

The major problems I've been having is the parsing of the relative URLs to create absolute
local URLs for use in the MHT.

On anther note:
I am using idHTTP to do the retrieval of the items but I think I'm going to switch to using WinInet
so any developers won't have to have Indy to use it; what do you think?
A little descriptive from MS on backreferencing:

---
Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.

For instance, to find repeating adjacent characters such as the two Ls in the word "tall", you would use the regular expression (?<char>\w)\k<char>, which uses the metacharacter \w to find any single-word character. The grouping construct (?<char> ) encloses the metacharacter to force the regular expression engine to remember a subexpression match (which in this case will be any single character) and save it under the name "char". The backreference construct \k<char> causes the engine to compare the current character to the previously matched character stored under "char". The entire regular expression successfully finds a match wherever a single character is the same as the preceding character.
---

It can make life easier, as you only have to write a single expression and let the parser do the rest. It can also make life harder, as it is very hard to track what the actual expression is doing. Anyways, you could also easily do this with a few lines of code using a basic parser (like the one I posted code for).

Regarding using TRegExpr in your source, for articles, etc. I don't mind in the least how/where it is used, or even if any credit is given. I only posted it (in another q)  because it is a nice example of a state machine that is:

1.) Pretty easy to read and follow
2.) Can easily be extended because of (1)
3.) Handles 90% of things that we most commonly use regular expression parsers for
4.) Is very fast because it is implemented as a non-lazy/non-greedy parser.

All (3) means is that the programmer has to do a little more manual parsing (like in the backreferencing example). This is not necessarily a negative, because in many cases its faster to parse a string, then parse the substring using a strict parser than it is to parse the string using a lazy parser. Or maybe it was because I was too lazy to implement the lazy routines? <g>....

Regarding Indy or WinInet; If it was me, I would probably use the greatest common denominator - which would be wininet, as long as it provided me with all the functionality I required. Its just a case where more delphi developers will have wininet vs indy. Plus, its a whole lot less baggage to carry around. Speaking only from my personal experience, I will go right down to the socket api level if I don't need the extra overhead.

-- please note, this is my opinion only, and I do not presume to speak for anyone else. --

But back to your problem; if you want to post the section of code that is causing you problems, then I could take a look at it for you and try and provide a good/solid solution to it.

Russell




Just as an update; not sure where you are with this....

Taking a look at my old code, I realized that I never really addressed sub-expressions that well. Didn't handle anything more that (xxx|xxx), and no post modifiers were allowed. Not very useful for complex parsing. Anyways, I spent some time updating the source to handle multiple nesting levels of ( ) expressions, and they also allow for post modifiers on them as well. Also, in order to provide something similar to the ".*" syntax (that I totally despise), the code now handles negated sub expressions. For example, to find "<tag>" + everything until a "</tag>" you could use the following:

// match-not match zero or more times-match
Pattern:='<tag>(!</tag>)*</tag>';

Makes parsing a little bit easier anyways. To me, it also makes a little more sense, because its explicitly stating "match this", then match while NOT that, and then match that.

Now taking the above VB example of

  "(?<Key>href=[""'](?!#|http|ftp|mailto|javascript)(?<Value>[^""']*)[""'])", _

which is used to match only relative hrefs, we can now use TRegExpr with the following:

   href=["'](!#|http|ftp|mailto|javascript)([^"']*)["']

I still don't handle backreferencing (not used here), or named expressions (eg Key, Value), but with the bead usage, splitting it is pretty simple, eg;

   href=@1["'](!#|http|ftp|mailto|javascript)([^"']*)["']@2

When a match is found, the string starting from BeadPos[0] to BeadPos[1] is what will need to be replaced with your absolute href.

I'm also including some other regular expression examples, just so that you get a feel for it:

Email name regular expression :
([a-zA-Z0-9_\-\.]+)\@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)

Password of 4 to 15 chars, starts with a letter :
^[a-zA-Z][%a_]{3,14}$

Phone number in a wide variety of formats :
\(?[0-9]{3}\)?(\-|\s)?[0-9]{3}\-[0-9]{4}

IP Address matching :
%w(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)%w

Html tag matching :
<tag[^>]*>(!</tag>)*</tag>

Relative HREF parsing :
href=["'](!#|http|ftp|mailto|javascript)([^"']*)["']

And of course, the updated source. (please read the commented header, as it contains the latest changes and additions)

Regards,
Russell

----------

unit RegExpr;
////////////////////////////////////////////////////////////////////////////////
//
//   Unit        :  REGEXPR
//   Description :  Regular expression parser for Delphi
//   Date        :  10.14.2003
//                  12.02.2004
//
//   Description :
//
//   ^        =  Start anchor indicator (if used in first position)
//   $        =  End anchor indicator (must be final character)
//   ( )      =  Parse sub expression. No limitations to nest depth.
//   (!       =  Negate the sub expression, eg (!subexpr), (!subexpr|subexp). For
//               a match to occur, the sub expression CANNOT match the string, thus allowing
//               the string to be incremented. This provides a feature similar to .* when
//               parsing.
//   |        =  OR the 2 sub expressions. Parens must be used to start the ORing
//               of two or more sub expressions. For multiple ORing the following
//               syntax can be used: (subexpr1|subexpr2|subexpr3|...)
//   ^        =  Must be used after [, negates a class
//   .        =  Match any char (cannot be negated, and cannot have post modifiers)
//   *        =  Post modifier, match zero or more occurrences of char/char class
//   ?        =  Post modifier, match zero or one occurences of char/char class
//   +        =  Post modifier, match one or more occurences of char/char class
//   {}       =  Post modifier, allows for the patterns {m}, {m,} and {m, n} where
//               {m} matches exactly m occurrences of the char/char class
//               {m,} matches m occurences or more of the char/char class
//               {m, n} matches m to n occurences of the char / char class
//   []       =  Char class, cannot contain ".", pre/post modifiers, or other meta
//               characters except for "-".
//   -        =  Used to define char ranges in char class
//   @0..9    =  Mark the current bead position (integer value). Useful for seperating
//               sub parts of the expression
//   \        =  Escaped char - the following have special meaning
//               \a    =  Bell
//               \b    =  Backspace
//               \e    =  Escape
//               \f    =  Formfeed
//               \n    =  Newline
//               \r    =  Carriage return
//               \t    =  Tab
//               \v    =  Vertical tab
//               \xHH  =  Hex code, must be 2 byte trailer in hex.
//               All other escaped charactes are translated as literals. This allows
//               the use of what would otherwise be meta characters. (ie: *, ., [, ...)
//   %        =  Extended char class range. Must be trailed by one of the following. Uppercase
//               character will negate the class range (eg %A is the same as  [^%a])
//               %a    =  0-9, A-Z, a-z
//               %c    =  A-Z, a-z
//               %d    =  0-9
//               %h    =  0-9, A-F, a-f
//               %i    =  \_, \-, \$, 0-9, A-Z, a-z
//               %n    =  0-9
//               %s    =  " ", \t
//               %w    =  \n, \r, " ", \t
//               Any other trailer will cause an expression error.
//   x        =  All other characters are taken as literal values.
//
////////////////////////////////////////////////////////////////////////////////
interface

{$Z4}
uses
  Windows, SysUtils, Classes;

////////////////////////////////////////////////////////////////////////////////
//   Regular expression exception class
////////////////////////////////////////////////////////////////////////////////
type
  ERegExpr          =  class(Exception);

////////////////////////////////////////////////////////////////////////////////
//   Regular expression data types
////////////////////////////////////////////////////////////////////////////////
type
  PCharSet          =  ^TCharSet;
  TCharSet          =  Set of Char;

////////////////////////////////////////////////////////////////////////////////
//   Regular expression instruction types
////////////////////////////////////////////////////////////////////////////////
type
  TRegType          =  (rtTerminal, rtBead, rtMark, rtMarkOr, rtMarkEnd, rtMarkNegEnd,
                        rtAnyChar,  rtChar, rtClass, rtNegClass);

////////////////////////////////////////////////////////////////////////////////
//   Regular expression post modifier types
////////////////////////////////////////////////////////////////////////////////
type
  TRegPost          =  (rpNone, rpOne, rpZMore, rpZOne, rpOneMore, rpExact, rpCMore, rpCtoC);

////////////////////////////////////////////////////////////////////////////////
//   Regular expression state structure
////////////////////////////////////////////////////////////////////////////////
type
  PRegState         =  ^TRegState;
  TRegState         =  packed record
     regType:       TRegType;
     regPost:       TRegPost;
     dwMin:         Integer;
     dwMax:         Integer;
     case Integer of
        0  :  (lpClass:       PCharSet);
        1  :  (cChar:         Char);
        2  :  (dwJump:        Integer);
  end;

////////////////////////////////////////////////////////////////////////////////
//   Regular expression sub-expression state
////////////////////////////////////////////////////////////////////////////////
type
  PRegSubStack      =  ^TRegSubStack;
  TRegSubStack      =  packed record
     lpPrevious:    PRegSubStack;
     szCursor:      PChar;
     dwBranch:      Integer;
     dwMatched:     Integer;
  end;

////////////////////////////////////////////////////////////////////////////////
//   Regular expression constants
////////////////////////////////////////////////////////////////////////////////
const
  REG_LOOPING       =  (-1);
  REG_NEGEXPR:      Array [False..True] of TRegType = (rtMarkEnd, rtMarkNegEnd);
  REG_NEGCLASS:     Array [False..True] of TRegType = (rtClass, rtNegClass);
  REG_TERMCHARS:    Set of Char = [#0, '$', '(', ')', '|'];
  REG_HEXCHARS:     Set of Char = ['0'..'9', 'A'..'F', 'a'..'f'];
  REG_METACHARS:    Set of Char = ['$', '(', ')', '|', '^', '*', '?', '+', '-',
                                   '@', '[', ']', '.', '{', '}', ',', '%', '!'];
  REG_PARSEERRORS:  Array [0..18] of PChar =
                    ('Expression pattern has not been set',               // 0
                     'Expression pattern cannot be blank',                // 1
                     'Invalid pattern expression',                        // 2
                     'Unexpected end of expression at position (%d)',     // 3
                     'Unexpected meta character at position (%d)',        // 4
                     'Characters after end anchor at position (%d)',      // 5
                     'Invalid special identifier at position (%d)',       // 6
                     'Invalid empty set at position (%d)',                // 7
                     'Min count must be less than max count (%d)',        // 8
                     'Expected a closing brace at position (%d)',         // 9
                     'Exact match count cannot be set to zero (%d)',      // 10
                     'Expected a numeric at position (%d)',               // 11
                     'Invalid numeric expression at position (%d)',       // 12
                     'Unexpected terminator found in sub expression (%d)',// 13
                     'Expected a closing paren at position (%d)',         // 14
                     'Bead markers must be bewteen 0 and 9',              // 15
                     'Stack expression underflow',                        // 16
                     'Empty sub expressesion encountered (%d)',           // 17
                     'Invalid hex char encounter (%d)'                    // 18
                    );

////////////////////////////////////////////////////////////////////////////////
//   Regular expression class
////////////////////////////////////////////////////////////////////////////////
type
  TRegExpr          =  class(TObject)
  private
     // Private declarations
     FBeadPos:      Array [0..9] of Integer;
     FAncStart:     Boolean;
     FAncEnd:       Boolean;
     FMatchCase:    Boolean;
     FMatched:      Boolean;
     FMatchPos:     Integer;
     FMatchLen:     Integer;
     FTable:        TList;
     FError:        PChar;
     FPos:          PChar;
     FParsed:       Boolean;
     FPattern:      PChar;
     FSource:       PChar;
     FCursor:       PChar;
     FAnchor:       PChar;
  protected
     // Protected declarations
     function       GetSource: PChar;
     function       GetPattern: PChar;
     function       GetPosition: Integer;
     function       GetBeadPos(Index: Integer): Integer;
     procedure      SetSource(Source: PChar);
     procedure      SetPattern(Pattern: PChar);
     procedure      SetMatchCase(CaseSensitive: Boolean);
     procedure      ClearStateTable;
     function       AllocStackEntry(Stack: PRegSubStack; Jump: Integer; Cursor: PChar): PRegSubStack;
     procedure      FreeStack(Stack: PRegSubStack);
     function       PopStackEntry(Stack: PRegSubStack): PRegSubStack;
     function       AllocState(StateType: TRegType): PRegState;
     procedure      FreeState(State: PRegState);
     procedure      ParseSubExpression;
     procedure      ParseParens;
     procedure      ParseExpression;
     procedure      ParseAnyChar;
     procedure      ParseCharClass;
     procedure      ParseChar;
     procedure      ParseCharRange(CClass: PCharSet; var Negated: Boolean);
     procedure      ParseSpecialRange(CClass: PCharSet; var Negated: Boolean);
     function       ParseCCChar: Char;
     function       ParseEscChar: Char;
     function       ParseNum: Integer;
     procedure      ParsePostModifier(State: PRegState);
     procedure      ResetMatched;
     procedure      ResetAll;
     function       Match(Start: Boolean): Boolean;
     function       MatchSubString: Boolean;
     function       ScanLineStart: Boolean;
     procedure      RaiseError(Error: Integer; InsertPos: Boolean = False);
  public
     // Public declarations
     constructor    Create;
     constructor    CreatePattern(Pattern: PChar);
     destructor     Destroy; override;
     function       Test(Pattern: PChar): Boolean;
     function       MatchFirst: Boolean;
     function       MatchNext: Boolean;
     property       BeadPos[Index: Integer]: Integer read GetBeadPos;
     property       CaseSensitive: Boolean read FMatchCase write SetMatchCase;
     property       MatchPos: Integer read FMatchPos;
     property       MatchLen: Integer read FMatchLen;
     property       Matched: Boolean read FMatched;
     property       Pattern: PChar read GetPattern write SetPattern;
     property       Position: Integer read GetPosition;
     property       Source: PChar read GetSource write SetSource;
  end;

implementation

////////////////////////////////////////////////////////////////////////////////
//   TRegExpr
////////////////////////////////////////////////////////////////////////////////
procedure TRegExpr.RaiseError(Error: Integer; InsertPos: Boolean = False);
begin

  // Check to see if we should be formatting this
  if InsertPos then
     // Formatted
     StrFmt(FError, REG_PARSEERRORS[Error], [Succ(FPos-FPattern)])
  else
     // Straight copy
     StrCopy(FError, REG_PARSEERRORS[Error]);

  // Raise the exception with the detailed information
  raise ERegExpr.Create(FError);

end;

function TRegExpr.GetBeadPos(Index: Integer): Integer;
begin

  // Check index range for bead position
  if not(Index in [0..9]) then RaiseError(15);

  // Return the bead marker
  result:=FBeadPos[Index];

end;

function TRegExpr.Test(Pattern: PChar): Boolean;
begin

  // Set the pattern to test
  SetPattern(Pattern);

  // Call MatchFirst and return results
  result:=MatchFirst;

end;

function TRegExpr.GetPosition: Integer;
begin

  // Return the current anchor position within the source text
  result:=FAnchor-FSource;

end;

function TRegExpr.MatchNext: Boolean;
begin

  // Reset matched state
  ResetMatched;

  // Call Match
  result:=Match(False);

end;

function TRegExpr.MatchFirst: Boolean;
begin

  // Reset matched state and source starting position
  ResetAll;

  // Call Match
  result:=Match(True);

end;

function TRegExpr.Match(Start: Boolean): Boolean;
begin

  // Set default result
  result:=False;

  // Check to make sure the expression was parsed
  if FParsed then
  begin
     // Check anchor start
     if FAncStart then
     begin
        // If not the first time then scan to a new line
        if not(Start) then ScanLineStart;
        // Keep looping until match, or cursor is at the end
        while (FCursor^ > #0) do
        begin
           // Search for the pattern
           result:=MatchSubString and ((FCursor-FAnchor) > 0);
           if result then break;
           // Increment the anchor
           Inc(FAnchor);
           // Scan to the next line
           if not(ScanLineStart) then break;
        end;
     end
     else
     begin
        // Keep looping until match, or cursor is at the end
        while (FCursor^ > #0) do
        begin
           // Search for the pattern
           result:=MatchSubString and ((FCursor-FAnchor) > 0);
           if result then break;
           // Increment the anchor
           Inc(FAnchor);
           // Set the cursor to the anchor
           FCursor:=FAnchor;
        end;
     end;
     // If we did not succeed then we must perform a full reset
     if result then
     begin
        // Set match pos and length
        FMatched:=True;
        FMatchPos:=FAnchor-FSource;
        FMatchLen:=FCursor-FAnchor;
        // If match length is zero, which can happen when matching zero times, then
        // we may run into cases where the cursor does not get incremented. We need
        // to increment for those cases
        if ((FMatchLen = 0) and (FCursor^ > #0)) then Inc(FCursor);
        // Set the next starting search point
        FAnchor:=FCursor;
     end
     else
        ResetAll;
  end
  else
     // Not parsed
     RaiseError(0);

end;

function TRegExpr.MatchSubString: Boolean;
var  prssStack:     PRegSubStack;
     prsNode:       PRegState;
     ppaTable:      PPointerList;
     dwCharLoop:    Integer;
     dwIndex:       Integer;
     bMatch:        Boolean;
     cChar:         Char;
begin

  // Set default result (optimisitic)
  result:=True;

  // Set node pointer and stack pointer
  prssStack:=nil;
  prsNode:=nil;

  // Map in the table as a straight pointer array
  ppaTable:=FTable.List;

  // Clear bead positions (and conveniently set dwIndex = zero)
  dwIndex:=10;
  repeat
     Dec(dwIndex);
     FBeadPos[dwIndex]:=0;
  until (dwIndex = 0);

  // Resource protection
  try
     // Loop until we are done
     while True do
     begin
        // Get the state record and make sure the index is incremented
        prsNode:=ppaTable^[dwIndex];
        Inc(dwIndex);
        // Check for terminating state or expression break
        case prsNode^.regType of
           // End of compiled expression
           rtTerminal  :  break;
           // Bead handling
           rtBead      :  FBeadPos[prsNode^.dwJump]:=FCursor-FSource;
           // Sub expression start
           rtMark      :
           begin
              // Check to see if we are in a loop
              if ((prssStack = nil) or (prssStack^.dwBranch > REG_LOOPING)) then
                 // Allocate a new stack entry
                 prssStack:=AllocStackEntry(prssStack, prsNode^.dwJump, FCursor)
              else
                 // Update the branch link in the stack record
                 prssStack^.dwBranch:=prsNode^.dwJump;
           end;
           // OR marker
           rtMarkOr    :
           begin
              // Get stack entry
              if Assigned(prssStack) then
              begin
                 // Check current result
                 if result then
                    // Jump out of here
                    dwIndex:=prsNode^.dwJump
                 else
                 begin
                    // Reset the cursor and set new jump point
                    FCursor:=prssStack^.szCursor;
                    prssStack^.dwBranch:=prsNode^.dwJump;
                    // Reset the result
                    result:=True;
                 end;
              end
              else
                 result:=False;
           end;
           // Sub expression end
           rtMarkEnd,
           rtMarkNegEnd:
           begin
              // Get stack entry
              if Assigned(prssStack) then
              begin
                 // Check for negation
                 if (prsNode^.regType = rtMarkNegEnd) then
                 begin
                    // Flip result
                    result:=not(result);
                    // Success?
                    if result then
                    begin
                       // Reset the cursor
                       FCursor:=prssStack^.szCursor;
                       // Push next if anything is left
                       if (FCursor^ > #0) then
                          // Next char
                          Inc(FCursor)
                       else
                          // Failure
                          result:=False;
                    end
                 end;
                 // Post operator can be handled by min and max match values
                 if result then
                 begin
                    // Increment the stack match counter
                    Inc(prssStack^.dwMatched);
                    // Do we need to keep looping?
                    if (prssStack^.dwMatched < prsNode^.dwMax) then
                    begin
                       // Set stack cursor
                       prssStack^.szCursor:=FCursor;
                       // Setup for next loop
                       prssStack^.dwBranch:=REG_LOOPING;
                       // Jump back
                       dwIndex:=prsNode^.dwJump;
                    end
                    else
                       // Free the stack entry
                       prssStack:=PopStackEntry(prssStack);
                 end
                 else
                 begin
                    // Did we match at least the minimum number of times?
                    result:=(prssStack^.dwMatched >= prsNode^.dwMin);
                    // Reset the cursor
                    FCursor:=prssStack^.szCursor;
                    // Free the stack entry
                    prssStack:=PopStackEntry(prssStack);
                 end;
              end;
              // Check result and stack level, must still have string to process as well
              if (result or Assigned(prssStack)) then Continue;
              // Failure and stack is empty, or not string left
              Break;
           end;
        else
           // Simple char state loop counter to handle the multiple matching
           dwCharLoop:=0;
           while True do
           begin
              // Set default match
              bMatch:=False;
              // Perform the character matching (if we have string)
              if (FCursor^ > #0) then
              begin
                 // Check case sensitive
                 if FMatchCase then
                    cChar:=FCursor^
                 else
                    cChar:=UpCase(FCursor^);
                 // Check the compare type
                 case prsNode^.regType of
                    rtAnyChar   :  bMatch:=not(cChar = #10);
                    rtChar      :  bMatch:=(cChar = prsNode^.cChar);
                    rtClass     :  bMatch:=(cChar in prsNode^.lpClass^);
                    rtNegClass  :  bMatch:=not(cChar in prsNode^.lpClass^);
                 end;
                 // Increment the cursor position
                 Inc(FCursor, Ord(bMatch));
              end;
              // Handle the post operator
              if bMatch then
              begin
                 // Increment the match counter
                 Inc(dwCharLoop);
                 // Do we need to keep matching?
                 if (dwCharLoop >= prsNode^.dwMax) then
                 begin
                    // Finished matching
                    result:=True;
                    break;
                 end;
              end
              else
              begin
                 // Did we match the minimum required?
                 result:=(dwCharLoop >= prsNode^.dwMin);
                 // Done matching
                 break;
              end;
           end;
           // Check for failure
           if not(result) then
           begin
              // If no jump is set then we fail
              if Assigned(prssStack) then
                 // Update the current index with the branch jump
                 dwIndex:=prssStack^.dwBranch
              else
                 // Failure
                 break;
           end;
        end;
     end;

     // Success, we still need to check against the terminal state in case we
     // ran out of string to process
     if result and Assigned(prsNode) then
     begin
        // Make sure we processed the terminal state
        result:=(prsNode^.regType = rtTerminal);
        // Still need to check against an end anchor if one was specified
        if (result and FAncEnd) then result:=(FCursor^ in [#0, #13, #10]);
     end
     else
        result:=False;

  finally
     // Free the stack
     FreeStack(prssStack);
  end;

end;

function TRegExpr.ScanLineStart: Boolean;
begin

  // Set default result
  result:=False;

  // Need to position the cursor right after a line break
  while not(FCursor^ in [#0, #10]) do Inc(FCursor);

  // If we are at a null then there is no next line so return false
  if (FCursor^ > #0) then
  begin
     // Increment past the line feed
     Inc(FCursor);
     // Reset the anchor to match the cursor
     FAnchor:=FCursor;
     // Success
     result:=True;
  end;

end;

procedure TRegExpr.ParseSubExpression;
var  prsNode:       PRegState;
     bNegated:      Boolean;
begin

  // Parse the sub expression until we hit a terminating char
  while not(FPos^ in REG_TERMCHARS) do
  begin
     // Handle the character
     case FPos^ of
        // Any character
        '.'   :  ParseAnyChar;
        // Character class
        '['   :  ParseCharClass;
        // Parse a bead position
        '@'   :
        begin
           // Skip next
           Inc(FPos);
           // Bead must be bewteen 0..9
           if not(FPos^ in ['0'..'9']) then RaiseError(15);
           // Allocate a new state
           prsNode:=AllocState(rtBead);
           prsNode^.dwJump:=Ord(FPos^)-48;
           FTable.Add(prsNode);
           // Skip next
           Inc(FPos);
        end;
        // Parse an extended char class
        '%'   :
        begin
           // Resource protection
           prsNode:=nil;
           try
              // Allocate a new state
              prsNode:=AllocState(rtClass);
              prsNode^.lpClass:=AllocMem(SizeOf(TCharSet));
              // Parse the range and post modifier
              ParseSpecialRange(prsNode^.lpClass, bNegated);
              if bNegated then prsNode^.regType:=rtNegClass;
              ParsePostModifier(prsNode);
              // Add to the state table
              FTable.Add(prsNode);
           except
              // Free the state
              FreeState(prsNode);
              // Re-raise the exception
              raise;
           end;
        end;
     else
        // It is a failure to be sitting at a meta char at this point
        if (FPos^ in REG_METACHARS) then
           RaiseError(4, True)
        else
           // Parse a literal or escaped character
           ParseChar;
     end;
  end;

end;

procedure TRegExpr.ParseParens;
var  prsMark:       PRegState;
     prsMarkOr:     PRegState;
     prsMarkEnd:    PRegState;
     bNegate:       Boolean;
     dwHead:        Integer;
     dwMark:        Integer;
begin

  // Skip starting paren
  Inc(FPos);

  // Add the mark record (hold pointer so we can fill the jump). Also maintain the
  // head so we can use post modifiers against the sub expression
  prsMark:=AllocState(rtMark);
  dwHead:=FTable.Add(prsMark);
  dwMark:=dwHead;

  // Can' be sitting at an empty expression
  if (FPos^ in [')', '|']) then RaiseError(17, True);

  // Is this a negated sub expression?
  bNegate:=(FPos^ = '!');
  if bNegate then Inc(FPos);

  // Parse the sub expression
  ParseSubExpression;

  // We need to be be sitting at one of the following
  // "("   =  Start new sub expression
  // "|"   =  Or the expression with a secondary sub expression
  // ")"   =  End the current sub expression

  // Better be sitting at a sub expression delimiter
  if not(FPos^ in ['(', '|', ')']) then RaiseError(13, True);

  // Keep looping while one of the following is encountered "(",  "|"
  while (FPos^ in ['(', '|']) do
  begin
     // Handle the char
     case FPos^ of
        // Recurse and parse next set of parens
        '('   :  ParseParens;
        // Parse "or" statement
        '|'   :
        begin
           Inc(FPos);
           // Add the mark OR
           prsMarkOr:=AllocState(rtMarkOr);
           // Add to the table and set the mark jump point
           prsMark^.dwJump:=FTable.Add(prsMarkOr);
           // Switch
           prsMark:=prsMarkOr;
           // Parse the second part of the expression. Handle cases where the
           // secondary expression starts with a "("
           if (FPos^ = '(') then
              // Parse paren statement
              ParseParens
           else
              // Parse the actual sub expression
              ParseSubExpression;
        end;
     else
        // Sanity check
        RaiseError(13, True);
     end;
  end;

  // End the sub expression
  if (FPos^ <> ')') then RaiseError(14, True);
  Inc(FPos);

  // Allocate the mark end and set post modifier loop jump. If a post modifier is
  // encountered, then we will jump back to the start of the sub expression after the
  // "(".
  prsMarkEnd:=AllocState(REG_NEGEXPR[bNegate]);
  prsMarkEnd^.dwJump:=dwMark;

  // Parse any post modifiers
  ParsePostModifier(prsMarkEnd);

  // Add the end and update the last jump
  prsMark^.dwJump:=FTable.Add(prsMarkEnd);

end;

procedure TRegExpr.ParseExpression;
var  prsNode:       PRegState;
begin

  // Clear the current state table and reparse
  ClearStateTable;

  // Clear match settings
  ResetMatched;

  // Reset expression parser anchors
  FAncStart:=False;
  FAncEnd:=False;

  // Set parser variable
  FPos:=FPattern;

  // Check for null pattern string
  if (FPos^ = #0) then RaiseError(1);

  // Check for starting anchor
  if (FPos^ = '^') then
  begin
     // Set anchor start flag
     FAncStart:=True;
     Inc(FPos);
     // Check for null
     if (FPos^ = #0) then RaiseError(2);
  end
  else
     // Check for just an ending anchor
     if (FPos^ = '$') then RaiseError(2);

  // Catch exceptions so we can clear the state table on failure
  try
     // Parse until we hit an ending char
     while not(FPos^ in [#0, '$']) do
     begin
        case FPos^ of
           '('   :  ParseParens;
           ')'   :  RaiseError(4, True);
           '|'   :  RaiseError(4, True);
        else
           ParseSubExpression;
        end;
     end;
     // Check for tail end anchor
     if (FPos^ = '$') then
     begin
        FAncEnd:=True;
        // Push next
        Inc(FPos);
        // We must be at the null
        if (FPos^ > #0) then RaiseError(5, True);
     end;
     // Add the terminal state
     prsNode:=AllocState(rtTerminal);
     FTable.Add(prsNode);
  except
     // Clear the anchor flags
     FAncStart:=False;
     FAncEnd:=False;
     // Clear the state table
     ClearStateTable;
     // Re-raise the exception again
     raise;
  end;

end;

procedure TRegExpr.ParseChar;
var  prsNode:       PRegState;
     cChar:         Char;
begin

  // Is the pattern null
  if (FPos^ = #0) then RaiseError(3, True);

  // Check for invalid meta characters
  if (FPos^ in REG_METACHARS) then RaiseError(4, True);

  // Check for escaped character
  if (FPos^ = '\') then
     cChar:=ParseEscChar
  else
     // Literal character
     cChar:=FPos^;

  // Push next
  Inc(FPos);

  // Allocate a new state
  prsNode:=AllocState(rtChar);
  if not(FMatchCase) then
     prsNode^.cChar:=UpCase(cChar)
  else
     prsNode^.cChar:=cChar;

  try
     // Parse out any post modifier
     ParsePostModifier(prsNode);
  except
     // Chance to free the memory
     FreeState(prsNode);
     // Re-raise the exception
     raise;
  end;

  // Add to state table
  FTable.Add(prsNode);

end;

function TRegExpr.ParseEscChar: Char;
var  lpszHex:       PCHar;
begin

  // Push next
  Inc(FPos);

  // Check the next character
  case FPos^ of
     #0    :
     begin
        result:=#0;
        RaiseError(3, True);
     end;
     'a'   :  result:=#7;
     'b'   :  result:=#8;
     'e'   :  result:=#27;
     'f'   :  result:=#12;
     'n'   :  result:=#10;
     'r'   :  result:=#13;
     't'   :  result:=#9;
     's'   :  result:=#32;
     'v'   :  result:=#11;
     'x'   :
     begin
        Inc(FPos);
        lpszHex:=FPos;
        if not(FPos^ in REG_HEXCHARS) then RaiseError(18, True);
        Inc(FPos);
        if not(FPos^ in REG_HEXCHARS) then RaiseError(18, True);
        result:=Chr(StrToInt('$'+lpszHex[0]+lpszHex[1]));
     end;
  else
     // Escape as is
     result:=FPos^;
  end;

end;

function TRegExpr.ParseCCChar: Char;
begin

  // Is the pattern null
  if (FPos^ = #0) then RaiseError(3, True);

  // Check for invalid meta characters
  if (FPos^ in REG_METACHARS) then RaiseError(4, True);

  // Check for escaped character
  if (FPos^ = '\') then
     result:=ParseEscChar
  else
     // Literal character
     result:=FPos^;

  // Push next
  Inc(FPos);

end;

procedure TRegExpr.ParseSpecialRange(CClass: PCharSet; var Negated: Boolean);
var  cChar:         Char;
begin

  // Push next
  Inc(FPos);

  // Is the pattern null
  if (FPos^ = #0) then RaiseError(3, True);

  // Check for invalid meta characters
  if (FPos^ in REG_METACHARS) then RaiseError(4, True);

  // Reverse the negation passed in if the char is upper case. This will correctly
  // handle cases where ^%D is specified, which equates to NOT NOT CHAR CLASS
  if (Ord(FPos^) < 97) then Negated:=not(Negated);

  // Handle the special char range
  case FPos^ of
     // Alpha, numeric
     'A',
     'a'   :
     begin
        for cChar:='0' to '9' do Include(CClass^, cChar);
        for cChar:='A' to 'Z' do Include(CClass^, cChar);
        for cChar:='a' to 'z' do Include(CClass^, cChar);
     end;
     // Alpha
     'C',
     'c'   :
     begin
        for cChar:='A' to 'Z' do Include(CClass^, cChar);
        for cChar:='a' to 'z' do Include(CClass^, cChar);
     end;
     // Numeric
     'D',
     'd'   :  for cChar:='0' to '9' do Include(CClass^, cChar);
     // Hexadecimal
     'H',
     'h'   :
     begin
        for cChar:='0' to '9' do Include(CClass^, cChar);
        for cChar:='A' to 'F' do Include(CClass^, cChar);
        for cChar:='a' to 'f' do Include(CClass^, cChar);
     end;
     // Alpha, numeric, underscore, dash, dollar
     'I',
     'i'   :
     begin
        Include(CClass^, '_');
        Include(CClass^, '-');
        Include(CClass^, '$');
        for cChar:='0' to '9' do Include(CClass^, cChar);
        for cChar:='A' to 'Z' do Include(CClass^, cChar);
        for cChar:='a' to 'z' do Include(CClass^, cChar);
     end;
     // White space but not line breaks
     'S',
     's'   :
     begin
        Include(CClass^, #9);
        Include(CClass^, #32);
     end;
     // All white space
     'W',
     'w'   :
     begin
        Include(CClass^, #9);
        Include(CClass^, #10);
        Include(CClass^, #13);
        Include(CClass^, #32);
     end;
  else
     // Invalid identifier
     RaiseError(6, True);
  end;

  // Push next
  Inc(FPos);

end;

procedure TRegExpr.ParseCharRange(CClass : PCharSet; var Negated: Boolean);
var  cStart:        Char;
     cEnd:          Char;
     cSwap:         Char;
begin

  // Check for special char range
  if (FPos^ = '%') then
     // Parse special character set
     ParseSpecialRange(CClass, Negated)
  else
  begin
     // Parse a single character
     cStart:=ParseCCChar;
     // Check for character range
     if (FPos^ = '-') then
     begin
        // Range of characters
        Inc(FPos);
        // Parse the next character
        cEnd:=ParseCCChar;
        // Build the range as a character set
        if (cStart > cEnd) then
        begin
           // Exchange first and last
           cSwap:=cStart;
           cStart:=cEnd;
           cEnd:=cSwap;
        end;
        // Include the characters in the character set
        for cSwap:=cStart to cEnd do
        begin
           Include(CClass^, cSwap);
           if not(FMatchCase) then Include(CClass^, UpCase(cSwap));
        end;
     end
     else
     begin
        // Single character
        if FMatchCase then
           Include(CClass^, cStart)
        else
           Include(CClass^, UpCase(cStart));
     end;
  end;

end;

procedure TRegExpr.ParseCharClass;
var  prsNode:       PRegState;
     bNegate:       Boolean;
     bNewNegate:    Boolean;
begin

  // Push next
  Inc(FPos);

  // Is this a negated class?
  bNegate:=(FPos^ = '^');
  if bNegate then Inc(FPos);
  bNewNegate:=bNegate;

  // Check for an immediate ], which means an empty set (which is not allowed)
  if (FPos^ = ']') then RaiseError(7, True);

  // Allocate a state record and char class
  prsNode:=AllocState(REG_NEGCLASS[bNegate]);
  prsNode^.lpClass:=AllocMem(SizeOf(TCharSet));

  // Start parsing the character ranges until we hit the closing bracket
  try
     // Parse char range
     while not(FPos^ in [#0, ']']) do ParseCharRange(prsNode^.lpClass, bNewNegate);
     // Check for change in negation
     if (bNewNegate <> bNegate) then prsNode^.regType:=REG_NEGCLASS[bNewNegate];
     // We must have the ending bracket
     if (FPos^ = #0) then RaiseError(3, True);
     // Push next
     Inc(FPos);
     // Parse out any post modifier
     ParsePostModifier(prsNode);
  except
     // Gives us the chance to clean up
     FreeState(prsNode);
     // Re-raise the exception
     raise;
  end;

  // Add to state table
  FTable.Add(prsNode);

end;

procedure TRegExpr.ParseAnyChar;
var  prsNode:       PRegState;
begin

  // Push next
  Inc(FPos);

  // Add the new state record
  prsNode:=AllocState(rtAnyChar);
  FTable.Add(prsNode);

end;

procedure TRegExpr.ParsePostModifier(State: PRegState);
begin

  // Set default post modifier
  State^.regPost:=rpOne;
  State^.dwMin:=1;
  State^.dwMax:=1;

  // If not a post modifier then just exit
  if not(FPos^ in ['*', '?', '+', '{']) then exit;

  // Set the post modifier
  case FPos^ of
     '*'   :
     begin
        State^.regPost:=rpZMore;
        State^.dwMin:=0;
        State^.dwMax:=MaxInt;
     end;
     '?'   :
     begin
        State^.regPost:=rpZOne;
        State^.dwMin:=0;
     end;
     '+'   :
     begin
        State^.regPost:=rpOneMore;
        State^.dwMax:=MaxInt;
     end;
     '{'   :  State^.regPost:=rpExact;
  end;

  // Push next
  Inc(FPos);

  // If *, ?, or + then we are done
  if (State^.regPost <> rpExact) then exit;

  // Get the first number
  State^.dwMin:=ParseNum;

  // Check for end of expression
  if (FPos^ = #0) then RaiseError(3, True);

  // Next character needs to be either a , or }
  case FPos^ of
     ','   :
     begin
        // Push next
        Inc(FPos);
        // Check for end of expression
        if (FPos^ = #0) then RaiseError(3, True);
        // We may have either a closing } or a numeric
        if (FPos^ = '}') then
           State^.regPost:=rpCMore
        else
        begin
           // Going to be a min to max range
           State^.regPost:=rpCtoC;
           // Get max range count
           State^.dwMax:=ParseNum;
           // Check for end of expression
           if (FPos^ = #0) then RaiseError(3, True);
           // Min has to be less than Max
           if (State^.dwMin >= State^.dwMax) then RaiseError(8, True);
           // Check for closing }
           if (FPos^ <> '}') then RaiseError(9, True);
        end;
     end;
     '}'   :  if (State^.dwMin = 0) then RaiseError(10, True);
  else
     RaiseError(9, True);
  end;

  // Update the min and max values (Count to count is already set)
  case State^.regPost of
     // Exactly
     rpExact     :  State^.dwMax:=State^.dwMin;
     // Count or more
     rpCMore     :  State^.dwMax:=MaxInt;
  end;

  // Push next
  Inc(FPos);

end;

function TRegExpr.ParseNum: Integer;
var  lpszHold:      PChar;
     cMark:         Char;
     dwErr:         Integer;
begin

  // Skip white space
  if (FPos^ in [#9, #32]) then Inc(FPos);

  // Save the current position
  lpszHold:=FPos;

  // Better be a number waiting
  if not(FPos^ in ['0'..'9']) then RaiseError(11, True);

  // Keep incrementing while numeric
  while (FPos^ in ['0'..'9']) do Inc(FPos);

  // Terminate and convert to number
  cMark:=FPos^;
  FPos^:=#0;
  Val(lpszHold, result, dwErr);
  FPos^:=cMark;

  // Check for error
  if (dwErr > 0) then
  begin
     FPos:=lpszHold;
     Inc(FPos, dwErr);
     RaiseError(12, True);
  end;

end;

procedure TRegExpr.ResetAll;
begin

  // Reset scanning positions
  FCursor:=FSource;
  FAnchor:=FSource;

  // Reset matched state
  FMatched:=False;
  FMatchPos:=0;
  FMatchLen:=0;

end;

procedure TRegExpr.ResetMatched;
begin

  // Set anchor and cursor to equal source
  FMatched:=False;
  FMatchPos:=0;
  FMatchLen:=0;

end;

procedure TRegExpr.SetMatchCase(CaseSensitive: Boolean);
begin

  // Check current setting
  if (CaseSensitive <> FMatchCase) then
  begin
     // Set new case sensitive setting
     FMatchCase:=CaseSensitive;
     // Reparse if we were in a parsed state
     if FParsed then ParseExpression;
  end;

end;

function TRegExpr.GetSource: PChar;
begin

  // Return the source
  result:=FSource;

end;

function TRegExpr.GetPattern: PChar;
begin

  // Return the pattern
  result:=FPattern;

end;

procedure TRegExpr.SetSource(Source: PChar);
begin

  // Assign the new source
  FreeMem(FSource);
  FSource:=AllocMem(Succ(StrLen(Source)));
  StrCopy(FSource, Source);

  // Source has changed, update the new state
  ResetAll;

end;

procedure TRegExpr.SetPattern(Pattern: PChar);
begin

  // Assign the new pattern
  FreeMem(FPattern);
  FPattern:=AllocMem(Succ(StrLen(Pattern)));
  StrCopy(FPattern, Pattern);

  // Reset parsed state
  FParsed:=False;

  // Parse the expression
  ParseExpression;

  // Set parsed state to true
  FParsed:=True;

end;

procedure TRegExpr.FreeState(State: PRegState);
begin

  // Is the state record assigned?
  if Assigned(State) then
  begin
     // If char class is allocated then we need to free the memory for the char set
     if (State^.regType in [rtClass, rtNegClass]) then FreeMem(State^.lpClass);
     // Free the state record
     FreeMem(State);
  end;

end;

procedure TRegExpr.ClearStateTable;
var  dwIndex:    Integer;
begin

  // Clear all the NFA state records
  for dwIndex:=Pred(FTable.Count) downto 0 do FreeState(FTable[dwIndex]);

  // Clear the table
  FTable.Clear;

end;

procedure TRegExpr.FreeStack(Stack: PRegSubStack);
var  lpStack:       PRegSubStack;
begin

  // Get current stack entry
  lpStack:=Stack;

  // Free all stack entries
  while Assigned(lpStack) do
  begin
     // Free entry and get previous entry
     lpStack:=PopStackEntry(lpStack);
  end;

end;

function TRegExpr.PopStackEntry(Stack: PRegSubStack): PRegSubStack;
begin

  // Check for underflow
  if Assigned(Stack) then
  begin
     // Set result (which is prior stack entry)
     result:=Stack^.lpPrevious;
     // Free memory for this entry
     FreeMem(Stack);
  end
  else
  begin
     // Underflow
     RaiseError(16);
     // Pacify the compiler
     result:=nil;
  end;

end;

function TRegExpr.AllocStackEntry(Stack: PRegSubStack; Jump: Integer; Cursor: PChar): PRegSubStack;
begin

  // Allocate memory for stack entry
  result:=AllocMem(SizeOf(TRegSubStack));

  // Set defaults
  with result^ do
  begin
     lpPrevious:=Stack;
     dwBranch:=Jump;
     szCursor:=Cursor;
  end;

end;

function TRegExpr.AllocState(StateType: TRegType): PRegState;
begin

  // Allocate a new regx state record
  result:=AllocMem(SizeOf(TRegState));

  // Set the type
  result^.regType:=StateType;
  result^.dwMin:=1;
  result^.dwMax:=1;

end;

constructor TRegExpr.CreatePattern(Pattern: PChar);
begin

  // Perform inherited for TRegExpr
  Create;

  // Set the pattern
  SetPattern(Pattern);

end;

constructor TRegExpr.Create;
begin

  // Perform inherited
  inherited Create;

  // Initialization
  FPattern:=AllocMem(1);
  FSource:=AllocMem(1);
  FError:=AllocMem(256);
  FPos:=FPattern;
  FAncStart:=False;
  FAncEnd:=False;
  FMatchCase:=False;
  FTable:=TList.Create;
  FParsed:=False;

  // Reset to default (unmatched) state
  ResetAll;

end;

destructor TRegExpr.Destroy;
begin

  // Clear the state records
  ClearStateTable;

  // Free the state table list
  FTable.Free;

  // Free all allocated memory
  FreeMem(FPattern);
  FreeMem(FSource);
  FreeMem(FError);

  // Perform inherited
  inherited Destroy;

end;

end.








ASKER CERTIFIED SOLUTION
Avatar of Russell Libby
Russell Libby
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
While I have not really taken a look at your solution, I will presume it has been tested because of your reputation.

I will begin retrofitting the MHTBuilder class I'm working on and will note you in the article I plan to write for
BDN.

Much appreciated...

No problem... and you can be assured that I have tested the regular expression parser VERY well (lots of little bugs that used to be there and no more).

Regarding the relative->absolute url conversion, I would suggest using something faster than Delphi's StringReplace function (as it is SLOW compared to others, like FastString's). Irregardless of that small note, the code example is clean and does "exactly" what the MHT builder code does. All you need to do is pass the text and the absolute url, and let it do the rest.

Regards,
Russell
As a final sidenote to this...

I plan on implementing a few more code routines in the parser class (this was on my list of things to do), and one of those is string replacing. This will offer the benefit of a high speed string replacement routine because the match pos/length, as well as the replacement length are known during the match phase. The replacement text can also be composed of tags %0..%9, which will act as format strings to the backreference tags. So, for example, the code I provided for you will actually turn into:

New expression patterns
-----------------------------
((:1href)=["'']!(\#|http|ftp|mailto|javascript)\/?(:2[^"'']*)["''])
((:1src)=["'']!(http)\/?(:2[^"'']*)["''])

Replacement Text
-----------------------------
'%1="'+AbsoluteURL+'%2"';

procedure ConvertHtml(Source: TStrings; AbsoluteURL: String);
var  regParser:     TRegExpr;
     dwIndex:       Integer;
const
  Patterns:         Array [0..1] of PChar =
                    (
                       '((:1href)=["'']!(\#|http|ftp|mailto|javascript)\/?(:2[^"'']*)["''])',
                       '((:1src)=["'']!(http)\/?(:2[^"'']*)["''])'
                    );
begin

  // Fix the absolute url so it ends with a trailing slash
  dwIndex:=Length(AbsoluteURL);
  if (dwIndex > 0) and (AbsoluteURL[dwIndex] <> '/') then AbsoluteURL:=AbsoluteURL+'/';

  // Make sure source is assigned and is not empty
  if Assigned(Source) and (Source.Count > 0) then
  begin
     // Create the parser
     regParser:=TRegExpr.Create;
     // Resource protection
     try
        // Set source text to perform matching on
        regParser.Source:=Source.GetText;
        // Iterate both patterns
        for dwIndex:=Low(Patterns) to High(Patterns) do
        begin
           // Set the pattern
           regParser.Pattern:=Patterns[dwIndex];
           // Replace
           regParser.Replace('%1="'+AbsoluteURL+'%2"');
        end;
        // Update the source text
        Source.SetText(regParser.Source);
     finally
        // Free all local objects
        FreeAndNil(regParser);
     end;
  end;

end;

Which is obviously much shoter and more compact than the original code example. If you are interested in getting the final class code , then let me know and I will post it when finished.

Regards,
Russell

Russell,
 Having problems with this one:

  ' remove the <base href=''> tag if present; causes problems when viewing locally.
  Regex.Replace(strHtml, "<base[^>]*?>", "")

Is this right?

  LRegParser := TRLRegExpr.Create;
  try
    LRegParser.Source  := LHTMLStrings.Text;  // StringList with HTML
    LRegParser.Pattern := '<base[^>]*?>';
    LRegParser.Replace(PChar('%1%2'), True);
  finally
    LRegParser.Free;
  end;

These are also giving me fits

  ' -- StripTagFromHtml("iframe", strHtml) removes all iframe tags.

    Private Function StripTagFromHtml(ByVal strTag As String, ByVal strHtml As String) As String
        Dim reg As Regex = New Regex( _
        String.Format("<{0}[^>]*?>[\w|\t|\r|\W]*?</{0}>", strTag), _
            RegexOptions.Multiline _
            Or RegexOptions.IgnoreCase _
            Or RegexOptions.Compiled)
        Return reg.Replace(strHtml, "")
    End Function


        '-- get rid of all HTML tags
        strFileContents = Regex.Replace(strFileContents, "<\w+(\s+[A-Za-z0-9_\-]+\s*=\s*(""([^""]*)""|'([^']*)'))*\s*(/)*>|<[^>]+>", " ", RegexOptions.Compiled)


Little tied up at the moment, so I only have time to fully address part of this (I will address the remainder in the morning).

1.) The only tricky thing in the first example is the *? notation, which turns greediness into laziness, which is what my parser does anyways. So:

Regex.Replace(strHtml, "<base[^>]*?>", "")

becomes

 LRegParser := TRLRegExpr.Create;
  try
    LRegParser.Source  := PChar(LHTMLStrings.Text);  // StringList with HTML
    LRegParser.Pattern := '<base[^>]*>';
    LRegParser.Replace('', True); // This will replace the match with blank
  finally
    LRegParser.Free;
  end;

If you needed to use part of the match in the replacement, THEN you would use the %0..%9 tags. Eg:

Pattern:='<(:1\w+)(:2[^>]*)>';
Replace('{%1%2}', True);

Which would match any tag in the format of "<[word chars][any chars]>" and make the following backreference

:1 = [word chars]
:2 = [any chars]

And then replace the whole match with

{[word chars][any chars]}

As a side note, I have been working on improvements to the parser that eek another 10-15% out of performance speed, and have also added named backreference matching as well (beyond the the :0 to :9 numbered matching) eg;

'<(:<TagName>\w+)(:<TagExtra>[^>]*)>'

I will send you a copy when changes are done, if you are interested. (another week or so). It will also include a text file with examples to go from. But.... back to the issues at hand <g>

2.) The second item does nothing more than match (you pass the format string) the

<%s[any chars]>[any chars]</%s>

So, to do this only requires the following:

function StripTagFromHtml(TagName, HtmlString: String): String;
begin

  Parser.Source:=PChar(HtmlString);
  Parser.Pattern:=PChar(Format('<%0:s[^>]*>!(</%0:s>)*</%0:s>', [TagName]));
  Parser.Replace('', False);
  result:=Parser.Source;

end;    

and call it with "iframe" as the TagName param. The pattern does nothing more than match the start tag, matches everything until the end tag, then matches the end tag.

Im out of time, but a quick view of the last problem doesnt appear to be anything major. I will give you the pattern in the morning.

Regards,
Russell
The final question:

   '-- get rid of all HTML tags
        strFileContents = Regex.Replace(strFileContents, "<\w+(\s+[A-Za-z0-9_\-]+\s*=\s*(""([^""]*)""|'([^']*)'))*\s*(/)*>|<[^>]+>", " ", RegexOptions.Compiled)

deals with removing ALL html tags and inserting a " ", which would leave only the text of the web page. I have enhanced the version above so it also catches the <!-- {text} -->, which tends to have other tags (< and >) embedded within it.

Parser.Pattern:='(<\w+(\s+[A-Za-z0-9_\-]+\s*=\s*("([^"]*)"|''([^'']*)''))*\s*(/)*>|<\!\-\-!(\-\->)*\-\->|<[^>]+>)';
Parser.Replace(' ', False); // Replace matches with space using non-expanded replacement string. Or simply put,
                                      // the replacement string does not contain % tags.

You could also thow a match for "&nbsp;" into the mix, which would remove most of the "noise" from the textual result of the page. Eg:

Parser.Pattern:='(\&nbsp;|<\w+(\s+[A-Za-z0-9_\-]+\s*=\s*("([^"]*)"|''([^'']*)''))*\s*(/)*>|<\!\-\-!(\-\->)*\-\->|<[^>]+>)';

-------------

Russell
I'm getting close to having this thing done. It shouldn't be too much longer. I am also using
the Jazarsoft HTMLParser the fill my links list. Makes it very easy and I don't have to rely on MSHTML.
It also lets me get frame info which MSHTML would not do.

Seems I won't be using as much regex stuff as I expected due to using the Jazarsoft parser, too.

Thanks for your help...
No problem on the help...

If you are parsing html, then using a parser specific to that task is better suited than regular expressions.
Not to say that regular expressions don't have their use/purpose, as they provide a powerful and generic means for parsing data.

And on the plus side, you got some deep exposure to pattern matching (never know when you might need it)

Regards,
Russell