We help IT Professionals succeed at work.

Analyzer for HTML-Pages

tetsuo21
tetsuo21 asked
on
Hi,
can anyone explain, how I can fast recursively analyze HTML-pages for links (html, zip, mail), for example with IHTMLDocument2. The routine has to surf on the links through HTML-pages in the neighborhood of the initial URL. Can be with or without frames. thanx

Tet
Comment
Watch Question

All-around developer
CERTIFIED EXPERT
Commented:
I'd suggest downloading the THTMLPars component from
Torry's.

Using THTMLPars unit, drop a Button, a memo, and an IdHTTP
component on a form, then in the buttonClick, do this:

procedure TForm1.Button1Click(Sender: TObject);
var
  s: string;
  j, i: integer;
  obj: TObject;
  HTMLTag: THTMLTag;
  HTMLParam: THTMLParam;
begin
  Memo1.Clear;
  Memo1.Lines.Text := IdHTTP1.Get('http://www.experts-exchange.com');
  Memo1.Lines.SaveToFile('c:\temp\blah.html');
  HTMLParser := THTMLParser.Create;
  HTMLParser.Lines.loadfromfile('c:\temp\blah.html');
  HTMLParser.Execute;

  for i := 1 to HTMLParser.parsed.count do
  begin
    obj := HTMLParser.parsed[i - 1];

    if obj.classtype = THTMLTag then
    begin
      HTMLTag := THTMLTag(obj);
      s := 'TAG: <' + HTMLTag.name;
      if HTMLTag.Name = 'A' then
      begin
        if HTMLTag.Params.count = 0 then memo1.Lines.add(s + '>')
        else
        begin
          for j := 1 to HTMLTag.Params.count do
          begin
            HTMLParam := HTMLTag.Params[j - 1];
            s := s + ' ' + HTMLParam.key;
            if HTMLParam.value<>'' then s := s + '="' + HTMLParam.value + '"';
          end;
          s := s + '>';
          memo1.Lines.add(s);
        end;
      end;
    end;
  end;
  Button2.Enabled := true;
end;

This will get all links from the page into your memo.

Play around with the code...
P.S. I tried using Stream so I wouldn't have to save
the file to disk but ran into problems with the
IdHTTP component. I am checking it our now and will
post an update if possible.

There are a couple other HTML parsers out there that have
events for each tag but I found they were not as fast.
Listening...
Eddie ShipmanAll-around developer
CERTIFIED EXPERT

Commented:
HELLO,
tetsuo21...

Are you there????

Did you try my suggestion???

tetsuo21:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.
CERTIFIED EXPERT

Commented:
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

accept EddieShipman's comment as answer

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

Thanks,

geobul
EE Cleanup Volunteer
Eddie ShipmanAll-around developer
CERTIFIED EXPERT

Commented:
OK, try this, a little easier...

Download this unit:

http://www.euromind.com/iedelphi/download/uiless4.zip

Then do this:

procedure TForm1.ParseResults(sl: TStringList);
var
  i: Integer;
  s: String;
begin
  s := sl.Strings[0];
  sl.Text := s;
  for i := 4 to 44 do
    Memo1.Lines.Add(sl.Strings[i]);
end;

procedure TForm1.btnGoClick(Sender: TObject);
var
  sh: TUILess;
  sl: TStringList;
begin
  sh := TUILess.Create(nil);
  sl := TStringList.Create;
  try
    // I can't donwload it at this time
    // but I think this is the function that
    // that returns all the anchors.
    GetAnchorList(sh.get('http://www.experts-exchange.com), sl);
    ParseResults(sl);
  finally
    sl.Free;
  end;
end;

Explore More ContentExplore courses, solutions, and other research materials related to this topic.