Searching a HTML document without Webbrowser.. Is it possible?

Hello Experts,
In my application I need to search through a bunch of  *.mht documents and store the findings in a table. I managed to get this working very well, thanks to EddieShipman and all other help I have received. Anyhow, the procedure works fine but is VERY VERY time consuming when You search through lets say 100 quite large files. Is there a way to do this without loading the document to a TWebbrowser?

See my code snippet..


I have no ideas if this is possible at all but just asking&

-Hakan

procedure TForm1.btnFindClick(Sender: TObject);
var
  url: Widestring;
  i: integer;
  v: OleVariant;
 
begin
  x := Twebbrowser.Create(self);
  TWinControl(x).Parent := Form1;
  TWinControl(x).Visible := False;
  x.Silent := True;
  x.Cursor := crDefault;
  x.OnDocumentComplete := Form1.WebBrowserDocumentComplete;
  ResTable.Active := True;
 
  iCount := 0;
  iRecord := -1;
  ResTable.Close;
  ResTable.Open;
  ResTable.DisableControls;
  for i := 1 to 34 do
  begin
    OurNSHandler.ResUrl := GetString(1000 + i);
    ResTag := 1000 + i;
    Url := 'http://xxx/' + OurNSHandler.ResUrl + '.mht';
    x.navigate(Url, v, v, v, v);
    Application.ProcessMessages;
     TextRange := ((x.Document as IHTMLDocument2).Body as
    IHTMLBodyElement).CreateTextRange;
    WBLocateHighlight(x, Edit1.Text);
    Application.ProcessMessages;
  end;
 
  ResTable.First;
  ResTable.EnableControls;
  TWinControl(x).Free;
 
  lblFound.Caption := IntToStr(iCount) + ' results found';
  if iCount = 0 then
    Showmessage('No results found.');
end;
 
procedure TForm1.WebBrowserDocumentComplete(Sender: TObject;
  const pDisp: IDispatch; var URL: OleVariant);
begin
  TextRange := ((x.Document as IHTMLDocument2).Body as
    IHTMLBodyElement).CreateTextRange;
end;
 
procedure TForm1.WBLocateHighlight(WB: Twebbrowser; Text: string);
const
  prefix = '<span style="background-color: yellow; font-weight: bolder;">';
  suffix = '</span>';
var
  tr: IHTMLTxtRange;
  sDocTitle: string;
begin
  if Assigned(WB.Document) then
  begin
    tr := ((wb.Document as IHTMLDocument2).body as
      IHTMLBodyElement).createTextRange;
    sDocTitle := WB.OleObject.Document.Title;
    iElement := 0;
    while tr.findText(Text, 1, 0) do
    begin
      ResTable.AppendRecord([0, sDocTitle, tr.parentElement.innerText,
        tr.text, 0, iElement, ResTag]);
      Inc(IElement, 1);
      tr.moveStart('character', Length(tr.text) + 1);
      FElement := tr.parentElement.innerHTML;
      Inc(iCount, 1);
    end;
  end;
end;

Open in new window

hakanfaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wildzeroCommented:
Hi there,

I just archived a web page (File -> Save As -> .mht) and then right clicked on this file and was able to open it via a text editor and still see all the html code in there.

So I am guessing you could load these files inside a TStringlist, then extract the information you want, probly with Regular Expressions (extra library required). All depends what you want to do.

If you are looking at just doing something like
 - Open .mht files
 - Locate string "<span style="background-color: yellow; font-weight: bolder;">"
 - Extract all until "</div>"
 - Process the extracted text, save to database

then I think you could get away without using TWebBrowser.

0
developmentguruPresidentCommented:
 It depends on what type of pages you are trying to analyze.  Are they local, or would you need to get them dynamically?  Either way you can do that without the browser component, but the server has a chance to ouput some very uninformative HTML back to you.  Getting the HTML is not the issue, it's just whether or not it will contain what you think it will.  From that point the processing is up to you.  Depending on what you want to do with it, the processing could be simple or... not so simple.  Could you give us an example HTML set along with what information you are attempting to extract?
0
hakanfaAuthor Commented:
Well, what I need is the "pure" text, not HTML. If there is a way to get the "text" without loading the file into a browser I can manage the rest by my self.
0
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

developmentguruPresidentCommented:
What you will need to do is get the HTML and parse out the text.  If all you need is to remove all formatting it should not be too rough.  I would make a class that can take a string or stream as input and produce a string as output.  Unfortunately I do not have the time to write this today.  If you would like me to see what I can do, let me know.  I do not know of any pre-built utilities to do that.  I could not try until Monday most likely.
0
hakanfaAuthor Commented:
Formatting not needed.. just amount of findings.. please let me know if I you come up with something...
0
wildzeroCommented:
Sure you could just do something like (drop two memo and button on your form)

function StripHTML(S: string): string;
var
  TagBegin, TagEnd, TagLength: integer;
begin
  TagBegin := Pos( '<', S);      // search position of first < 

  while (TagBegin > 0) do begin  // while there is a < in S
    TagEnd := Pos('>', S);              // find the matching > 
    TagLength := TagEnd - TagBegin + 1;
    Delete(S, TagBegin, TagLength);     // delete the tag
    TagBegin:= Pos( '<', S);            // search for next <
  end;
 
  Result := S;                   // give the result
end;

button1.click
begin
  memo1.lines.loadfromfile('path to .mht file');
  memo2.text := StripHTML(Memo1.text);
end

Code from here -> http://www.festra.com/eng/snip12.htm

and see how well that works.
The best solution would be to use stringlists and remove the HTML in memory - would be faster, but this should be a little test.

I would also use regular expressions to handle the removing of tags.
0
developmentguruPresidentCommented:
 Here is the DFM and code for a Form2 that will let you type a URL in and click GO to retrieve it into the top memo.  After that clicking the Strip HTML button will remove all the tags and show the results in the botttom memo.  The problem here is that you will likely need to find information in the body of the HTML while stripping out any scripting, etc.  This code is not quite that complete.  It does give you an example of how to start both retrieving the HTML from the internet without the web control, and some example code on how to strip the tags.  I will leave the tag matching (for doing a better job of removing scripting, etc) up to you.  Let me know if this will do it for you.
object Form2: TForm2
  Left = 0
  Top = 0
  Caption = 'Strip HTML Example'
  ClientHeight = 280
  ClientWidth = 426
  Color = clBtnFace
  Font.Charset = DEFAULT_CHARSET
  Font.Color = clWindowText
  Font.Height = -11
  Font.Name = 'Tahoma'
  Font.Style = []
  OldCreateOrder = False
  PixelsPerInch = 96
  TextHeight = 13
  object Label1: TLabel
    Left = 8
    Top = 11
    Width = 23
    Height = 13
    Caption = 'URL:'
  end
  object mRawHTML: TMemo
    Left = 8
    Top = 30
    Width = 410
    Height = 121
    TabOrder = 0
  end
  object eURL: TEdit
    Left = 37
    Top = 8
    Width = 348
    Height = 21
    TabOrder = 1
    Text = 'http://www.google.com'
  end
  object btnGO: TButton
    Left = 385
    Top = 8
    Width = 33
    Height = 22
    Caption = 'GO'
    TabOrder = 2
    OnClick = btnGOClick
  end
  object btnStipHTML: TButton
    Left = 343
    Top = 157
    Width = 75
    Height = 25
    Caption = 'Strip HTML'
    TabOrder = 3
    OnClick = btnStipHTMLClick
  end
  object mPlainText: TMemo
    Left = 8
    Top = 184
    Width = 410
    Height = 89
    TabOrder = 4
  end
  object IdHTTP1: TIdHTTP
    AllowCookies = True
    ProxyParams.BasicAuthentication = False
    ProxyParams.ProxyPort = 0
    Request.ContentLength = -1
    Request.Accept = 'text/html, */*'
    Request.BasicAuthentication = False
    Request.UserAgent = 'Mozilla/3.0 (compatible; Indy Library)'
    HTTPOptions = [hoForceEncodeParams]
    Left = 384
    Top = 40
  end
end
 
 
unit Unit2;
 
interface
 
uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
  Dialogs, IdBaseComponent, IdComponent, IdTCPConnection, IdTCPClient, IdHTTP,
  StdCtrls;
 
type
  TForm2 = class(TForm)
    mRawHTML: TMemo;
    eURL: TEdit;
    Label1: TLabel;
    btnGO: TButton;
    IdHTTP1: TIdHTTP;
    btnStipHTML: TButton;
    mPlainText: TMemo;
    procedure btnGOClick(Sender: TObject);
    procedure btnStipHTMLClick(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
    function StripHTML(HTML : string) : string;
  end;
 
var
  Form2: TForm2;
 
implementation
 
{$R *.dfm}
 
procedure TForm2.btnGOClick(Sender: TObject);
var
  SS : TStringStream;
 
begin
  SS := TStringStream.Create('');
  try
    with IdHTTP1, Request do
      begin
        Get(eURL.Text, SS);
        SS.Seek(0, soFromBeginning);
        mRawHTML.Lines.Text := SS.ReadString(SS.Size);
      end;
  finally
    SS.Free;
  end;
end;
 
procedure TForm2.btnStipHTMLClick(Sender: TObject);
begin
  mPlainText.Lines.Text := StripHTML(mRawHTML.Lines.Text);
end;
 
function TForm2.StripHTML(HTML: string): string;
var
  I : integer;
  HTMLLen : integer;
  Stripping : boolean;
  PlainText : string;
  First, Last : integer;
 
begin
  HTMLLen := Length(HTML);
  Stripping := false;
  PlainText := '';
 
  I := 1;
  while I < HTMLLen do
    begin
      First := I;
 
      //find the <
      while (I < HTMLLen) and (HTML[I] <> '<') do
        inc(I);
 
      //copy text up to that point.
      PlainText := PlainText + copy(HTML, First, I - First);
 
      //find the >
      while (I < HTMLLen) and (HTML[I] <> '>') do
        inc(I);
 
      //add a line break after >
      if HTML[I] = '>' then
        PlainText := PlainText + #13#10;
 
      //move on
      inc(I);
    end;
  Result := PlainText;
end;
 
end.

Open in new window

0
hakanfaAuthor Commented:
Dear fellows,
I was on a short business trip, so please accept apologies for the delay.
Any how, developmentguru, Im sorry to say but Your solution did not work for me. The files in question are MHT files and for some reason the tags are not exact the same as for HTML?. Strange though.. but believe me, I have spent hours trying to get it working.

Though I found a solution on the Delphi pages my self. (see code snippet) This did the trick&

But, thank You for Your kindness and help.

Kind regards,
-Hakan

function TForm1.GetValue(AURL: String): String;
var
  IDoc:    IHTMLDocument2;
  strHTML: String;
  v:       Variant;
  i, j, x: integer;
  ovTags:  OleVariant;
  oText:   String;
  idHTTP1: TidHTTP;
begin
  Idoc:=CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
  idHTTP1 := TidHTTP.Create(Self);
  try
    IDoc.designMode:='on';
    while IDoc.readyState<>'complete' do
      Application.ProcessMessages;
    v:=VarArrayCreate([0,0],VarVariant);
    strHTML := idHTTP1.Get(AURL);
    v[0]:= strHTML;
    IDoc.write(PSafeArray(System.TVarData(v).VArray));
    IDoc.designMode:='off';
    while IDoc.readyState<>'complete' do
      Application.ProcessMessages;
    ovTags := IDoc.all.tags('B');
    if ovTags.Length > 0 then
    begin
      for x := 0 to ovTags.Length-1 do
      begin
        oText := ovTags.Item(x).InnerText;
        if Pos('R$', oText) > 0 then
        begin
          Result := oText;
          Break;
        end;
      end;
    end;
  finally
    idHTTP1.Free;
    IDoc := nil;
  end;
end;

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Delphi

From novice to tech pro — start learning today.