Solved

Parse HTML file and download listed files

Posted on 2004-08-26
15
468 Views
Last Modified: 2010-04-04
I thought it would be simple to create an image grabber. A prog which display the source of the webpage and extract all image locations. This images should be downloaded.
The first step is simple, but obviously the images can not be downloaded even if I try to parse the correct location.
The example extract the code of my webpage, lists the linked image locations of the page.
Then it builds absolute URLs.
But then the download seems not work.
It might be easy if the images are in the same directory as the index.html.
Sometimes I get an image but the file seems corrupted.

Is an approach like this possible? How should I get through?

{I tried to get image info with a twebbrowser. This works fine so far. Based on this it might work work, but I like to understand this manual approach}
Here the code

UNIT ImageGrabberU;

INTERFACE

USES
  Windows, Messages, filectrl, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls, OleCtrls, SHDocVw_TLB;

TYPE
  TForm1 = CLASS(TForm)
    btnDownloadImg: TButton;
    Edit1: TEdit;
    Label1: TLabel;
    Memo1: TMemo;
    Label2: TLabel;
    btnGetImages: TButton;
    ListBox1: TListBox;
    Label3: TLabel;
    btnSource: TButton;
    PROCEDURE btnDownloadImgClick(Sender: TObject);
    PROCEDURE btnSourceClick(Sender: TObject);
    PROCEDURE btnGetImagesClick(Sender: TObject);

  Private
    { Private declarations }
    source, sourcepath, dest: STRING;
  Public
    { Public declarations }
  END;

VAR
  Form1: TForm1;

IMPLEMENTATION

{$R *.DFM}

USES
  URLMon, ShellApi, MSHTML_TLB;

FUNCTION DownloadFile(SourceFile, DestFile: STRING): Boolean;
BEGIN
  TRY
    Result := UrlDownloadToFile(NIL, PChar(SourceFile), PChar(DestFile), 0, NIL) = 0;
  EXCEPT
    Result := False;
  END;
END;

FUNCTION ExtractUrlFileName(CONST AUrl: STRING): STRING;
VAR
  i, po: Integer; s: STRING;
BEGIN
  i := LastDelimiter('/', AUrl);
  result := uppercase(Copy(AUrl, i + 1, Length(AUrl) - (i)));
END;

FUNCTION ExtractUrlFilePath(CONST AUrl: STRING): STRING;
VAR
  i, po: Integer; s: STRING;
BEGIN
  i := LastDelimiter('/', AUrl);
  s := uppercase(Copy(AUrl, 1, i - 1));
  result := (s);
END;

PROCEDURE TForm1.btnDownloadImgClick(Sender: TObject);
VAR i: integer; s, entry, d, SourceFile, DestFile: STRING;
BEGIN
  //this should get the ../out of the filename and return an absolute url
  FOR i := 0 TO listbox1.Items.count - 1 DO
  BEGIN
    sourcepath := source;
    entry := listbox1.items[i];
    IF pos('../', entry) > 0 THEN
    BEGIN
      Sourcepath := ExtractUrlFilePath(Sourcepath);
      WHILE pos('../', entry) > 0 DO
      BEGIN
        delete(entry, 1, 3);
        Sourcepath := ExtractUrlFilePath(Sourcepath);
      END;
      sourcefile := Sourcepath + '/' + entry;
    END
    ELSE SourceFile := sourcepath + listbox1.Items[i];

    d := includetrailingbackslash(extractfilepath(application.exename)) + 'Img';
    IF NOT DirectoryExists(d) THEN createDir(d);

    DestFile := includetrailingbackslash(d) +
      'Img' + inttostr(I) + extractfileext(listbox1.Items[i]);

    DownloadFile(SourceFile, DestFile);

  END;
END;



PROCEDURE TForm1.btnSourceClick(Sender: TObject);
VAR n: STRING;
BEGIN
  source := edit1.text;
  n := ExtractUrlFileName(source);
  sourcepath := copy(source, 1, length(source) - length(N));

  dest := includetrailingbackslash(extractfilepath(application.exename)) +
    ExtractUrlFileName(source);
  IF DownloadFile(Source, Dest) THEN
    memo1.Lines.LoadFromFile(dest);
END;

FUNCTION ExtractImgFileName(CONST AUrl: STRING): STRING;
VAR
  i, po, pb: Integer; s, subs: STRING;
BEGIN
  s := uppercase(aURL);
  subs := '<A HREF="';
  Po := pos(subs, s);
  IF po > 0 THEN
  BEGIN
    delete(s, 1, po + length(subs) - 1);

    pb := 0;
    s := uppercase(S);
    IF pos('.GIF"', s) > 0 THEN pb := pos('.GIF"', s) ELSE
      IF pos('.JPG"', s) > 0 THEN pb := pos('.JPG"', s) ELSE
        IF pos('.BMP"', s) > 0 THEN pb := pos('.BMP"', s);


    IF pb > 0 THEN
    BEGIN
      delete(s, pb + 4, length(s));
      result := S;
    END;

  END
  ELSE result := '';
END;

PROCEDURE TForm1.btnGetImagesClick(Sender: TObject);
VAR Ln, I, Po, st: integer; s, Nm: STRING;
BEGIN
  listbox1.Clear;
  Ln := memo1.lines.Count;
  WITH memo1 DO
    FOR i := 0 TO Ln - 1 DO
    BEGIN
      s := lines[i];
      Nm := ExtractImgFileName(s);
      IF Nm <> '' THEN listbox1.Items.Add(
          Nm);
    END;
END;


END.

//form
object Form1: TForm1
  Left = 109
  Top = 23
  Width = 668
  Height = 490
  Caption = 'Form1'
  Color = clBtnFace
  Font.Charset = DEFAULT_CHARSET
  Font.Color = clWindowText
  Font.Height = -11
  Font.Name = 'MS Shell Dlg 2'
  Font.Style = []
  OldCreateOrder = False
  PixelsPerInch = 96
  TextHeight = 13
  object Label1: TLabel
    Left = 48
    Top = 24
    Width = 49
    Height = 13
    Caption = 'Paste URL'
  end
  object Label2: TLabel
    Left = 144
    Top = 72
    Width = 88
    Height = 13
    Caption = 'HTML Page source'
  end
  object Label3: TLabel
    Left = 48
    Top = 376
    Width = 71
    Height = 26
    Caption = 'Extract Image locations'
    WordWrap = True
  end
  object btnDownloadImg: TButton
    Left = 560
    Top = 344
    Width = 75
    Height = 25
    Caption = 'Download Img'
    TabOrder = 0
    OnClick = btnDownloadImgClick
  end
  object Edit1: TEdit
    Left = 40
    Top = 40
    Width = 601
    Height = 21
    TabOrder = 1
    Text = 'http://www.hushpage.net/Fun/Gallery/gallery.html'
  end
  object Memo1: TMemo
    Left = 40
    Top = 96
    Width = 601
    Height = 209
    Lines.Strings = (
      'Memo1')
    ScrollBars = ssBoth
    TabOrder = 2
  end
  object btnGetImages: TButton
    Left = 40
    Top = 344
    Width = 75
    Height = 25
    Caption = 'Extract Img'
    TabOrder = 3
    OnClick = btnGetImagesClick
  end
  object ListBox1: TListBox
    Left = 128
    Top = 344
    Width = 409
    Height = 105
    ItemHeight = 13
    TabOrder = 4
  end
  object btnSource: TButton
    Left = 40
    Top = 64
    Width = 75
    Height = 25
    Caption = 'get source'
    TabOrder = 5
    OnClick = btnSourceClick
  end
end
0
Comment
Question by:hush021299
  • 8
  • 7
15 Comments
 
LVL 26

Expert Comment

by:EddieShipman
ID: 11932744
Much easier:

uses ..., MSHTML {mshtml2_TLB if using D5 or below};

var
  i: integer;
  ovImages: Variant;
  ovImage: Variant;
  sImageSrc: String;
begin
  // Get ALL IMG tags
  ovImages := WebBrowser1.OleObject.Document.all.tags('IMG');
  for i := 0 to (ovImages.Length - 1) do
  begin
    // iterate and get each img tag's src and add it to the memo.
    ovImage := ovImages.Item(i);
    sImageSrc := ovImage.src;
    // skip spacer.gif images.
    if Pos('spacer.gif', sImageSrc) = 0 then
    begin
      // even though on my site I have relative URL's,
      // the image src attribute contains full path info.
      // may need to modify that depending upon your server.
      {
      if Pos('http', sImageSrc) = 0 then
        sImageSrc := WebBrowser1.OleObject.Document.Location + '/' + sImageSrc;
      }
      Memo1.Lines.Add(sImageSrc);
    end;
  end;
end;
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 11932786
You can also create an IHTMLDOMDocument2 directly and assign the source like this:

uses ...,ActiveX, COMObj;
var
  IDoc:         IHTMLDocument2;
  v:            Variant;
  HTML:         String;
  i:            Integer;
  ovImages:     Variant;
  ovImage:      Variant;
  sImageSrc:    String;
begin
  Idoc:=CreateComObject(Class_HTMLDocument) as IHTMLDocument2;
  try
    IDoc.designMode:='on';
    while IDoc.readyState<>'complete' do
      Application.ProcessMessages;
    v:=VarArrayCreate([0,0],VarVariant);
    v[0]:= HTMLSource; // pass the HTML source here...
    IDoc.write(PSafeArray(System.TVarData(v).VArray));
    IDoc.designMode:='off';
    while IDoc.readyState<>'complete' do
      Application.ProcessMessages;
    // Get ALL IMG tags
    ovImages := IDoc.all.tags('IMG');
    for i := 0 to (ovImages.Length - 1) do
    begin
      // iterate and get each img tag's src and add it to the memo.
      ovImage := ovImages.Item(i);
      sImageSrc := ovImage.src;
      // skip spacer.gif images.
      if Pos('spacer.gif', sImageSrc) = 0 then
      begin
        // even though on my site I have relative URL's,
        // the image src attribute contains full path info.
        // may need to modify that depending upon your server.
        {
        if Pos('http', sImageSrc) = 0 then
          sImageSrc := WebBrowser1.OleObject.Document.Location + '/' + sImageSrc;
        }
        Memo1.Lines.Add(sImageSrc);
      end;
    end;
  finally
    IDoc := nil;
  end;
end;

0
 
LVL 1

Author Comment

by:hush021299
ID: 11977433
Thank you for your answer.
I still have had no time to utilze your code yet.
So I will come back to this next week.

cheers
hh
0
 
LVL 1

Author Comment

by:hush021299
ID: 12062272
Hello,
I have tried the first code now. Basically I had this solution in mind as a backup.
But there are reasons why I want to try finding a manually solution.

First I like to understand how this stuff works
then I want to get the images from the links as well (should be possible with your approach as well) and
I do not want actually to browse to the page with the webbrowser.

But this is what I do using tWebBrowser.So if I have a slow connection it is like browsing. If I decide not to show images in the IE options, I even dont get them with the web browser approach.

The basic idea was to understand the underlaying link system of a page and go through the hirarchie. E.g. listing all the zip files of my web page. Look what date or size. Everything in text form.

It is a bit like this Offline viewers. I used them a while ago to collect Delphi Tips on my local machine. However, now I like to get a list of absolute locations of file location in nested urls.

Do you think this is possible? .. and also with TWebbrowser? (I can get the date and size of all listed images with TWebbrowser, but again I would prefere not do do browsing).
0
 
LVL 1

Author Comment

by:hush021299
ID: 12062360
I tried the code for the webpage.
VarArrayCreate not found!

Could you provide this code too?

Thanks
hH
0
 
LVL 1

Author Comment

by:hush021299
ID: 12062681
Sorry, using D7 I had to add Variants.
Looks interesting so far,only the links do not work yet.
I will try to understand the location issue later.

Btw. This is how I want to look for image info:
var
  i: Word;
  ImageWidth, ImageHeight: Integer;
  ImageHref, ImageFileSize, ImageTextAlternative: string;
  Document: IHtmlDocument2;
begin
  // Loop through all images of a TWebbrowser
  // Schleife über alle Bilder im Webbrowser
  for i := 0 to WebBrowser1.OleObject.Document.Images.Length - 1 do
  begin
    Document := WebBrowser1.Document as IHtmlDocument2;
    // Retrieves the calculated width of the image.
    ImageWidth := WebBrowser1.OleObject.Document.Images.Item(i).Width;
    // Retrieves the height of the image.
    ImageHeight := WebBrowser1.OleObject.Document.Images.Item(i).Height;
    // Retrieves the file size of the image.
    ImageFileSize := (Document.Images.Item(i, 0) as IHTMLImgElement).FileSize;
    // Retrieves the entire URL that the browser uses to locate the image
    ImageHref := (Document.Images.Item(i, 0) as IHTMLImgElement).Href;
    // Retrieves a text alternative to the graphic.
    ImageTextAlternative := (Document.Images.Item(i, 0) as IHTMLImgElement).alt;
    // Show image information in a TListbox
    ListBox2.Items.Add(Format('%s : %d x %d Pixels; %s Bytes; %s',
      [ImageHref, ImageWidth, ImageHeight, ImageFileSize, ImageTextAlternative]));
  end;
    label4.caption:=inttostr(listbox2.items.count);
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 12064114
There is another option that I forgot about.
Doenload the Extended IEParser v2 from the Delphi-WebBrowser group

http://f2.grp.yahoofs.com/v1/0DxIQceLxjLqtOcIg0YyDLLQ0WA6fCpN9U0Yokv1DxPN8oVj-WHSko8ryG2L7zy_iQBnus99vj11_8-kzIdzky8v27ndKKIQ2mIyoHDtWySV/extended%20ieparser%20v2.%20zip

You need to be a member but it is free to sign up.

I have used it before the parse HTML and it is very fast and easy to use.

0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 1

Author Comment

by:hush021299
ID: 12072443
..I ve tried to log in, but still no responding email.
Can you send me the file?
hush4@web.de

Do you think we could get the stuff running as mentioned above? If I could do it w.o. twebbrowser I can increase the points.
0
 
LVL 1

Author Comment

by:hush021299
ID: 12073032
Now I tried the second approach. The idea is great.
I partly works nice.
But he dont give me the correct image source
e.g.
about:blank./clearpixel.gif
rather then www.hushpage.com/clearpixel.gif
and
while trying to call
www.hushpage.net it doesnt stop working.

A problem for me is that I dont find help about IHTMLDocument2 in Delphi. Even the code completion partly does not show up
e.g.
 ovImage := ovImages.Item(i);
      sImageSrc := ovImage.src;
Delphi dont show up src when enterint ovImage.  ???
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 12075094
See the comment:

  // even though on my site I have relative URL's,
  // the image src attribute contains full path info.
  // may need to modify that depending upon your server.

Now, I don't know if you can get the document.location from the IHTMLDocument2
object, especially if you are loading from a file. It looks like you are doing that, right?

Do you have a site, in paritcular, that you want to retrieve the images from?

Delphi won't give you tooltip evaluation on ovImage because it is a Variant.

The help for IHTMLDocument2 is located on msdn:
http://msdn.microsoft.com/workshop/browser/mshtml/reference/ifaces/document2/document2.asp
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 12323842
Are you loading the source from a website into the IHTMLDocument2? If you are how?
0
 
LVL 1

Author Comment

by:hush021299
ID: 12327907
I am not sure.
I tried both of your code.
The second one is the one with the thtmldocument.
I didnt know that before.
So maybe my answer might  not be clear.
I would like to get the page info without loading the page in an internet explorer derived component. If I do that I just can browse. The whole Idea have been born based on my dial up connection. Lets say I want to get all pad files from my homepage. Then I want to enter the url, receive a list of the pad files (xml) and maybe check them to download selected. Basically this might be the idea for a new program, not a required feature for a program which I am working on. So I am not in a hurry.

So I tried first with the memo, but I couldnt generate the proper urls from the homepage, evenso the information must be in the html file. This is still the thing which I like to understand most. If the htmsl page point to another page on the same server I will be able to extract the url. So I could get all html files of my page. Unfortunately on my other homepage I cant get this running either.
Btw. you ask for a page
1) www.hushpage.com   (this is the one which works(
2) www.hushpage.net    (this makes the troubel(

When I tried the InternetExplorer active X component, I realized that I only get what I see.

Now, with your IHTMLDocument I think we are on the right track. But I have been stuck since last time, (I was very busy and almost had forgot that one, I have to admit).

Have you any Idea based on the ihtmldocument?

0
 
LVL 26

Accepted Solution

by:
EddieShipman earned 125 total points
ID: 12476607
OK, what I'm suggesting is to get the HTML source code using TidHTTP and then assigning it to
the IHTMLDocument2 like in my second example. Now, since you are going to be getting the page
via the internet using HTTP, you already have the location.

I show how in this post, I'm also MrBaseball34 on Delphi Pages:

http://www.delphipages.com/threads/thread.cfm?ID=124006&G=123839
0
 
LVL 1

Author Comment

by:hush021299
ID: 12503511
Another question:
I have downloaded the IEParser 2 component.
At the moment I just can use it in Delphi 5, (cause of the Desgninf.pas).
Your example works in D7.
Do you have an example for this IEParser in D5, or can I also install this thing in D7?
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 12504409
Rename it DsgIntf, I think, and it should work for D5.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now