Link to home
Start Free TrialLog in
Avatar of matelindonesia
matelindonesia

asked on

Save HTML complete page programatically

Hi experts,

I want to make a program using embeddedWB component to save web page completely, the question How should I write code to to save all the webpage and also images ? Cause I only found SaveToFile procedure?

Thanks
Sirro
Hello Sir,

   Kindly go through the following site for information and free download of the active X control for Web screen snapshot of a given url.

    http://www.ziplib.com/_software/Development--Active_X/download_4.html

with regards,
padmaja.
Avatar of matelindonesia
matelindonesia

ASKER

I've tried save to Mht file, but how I can convert them to html and images? cause if I save in mht file, then it can't be opened in other computer,cause as long as I know, mht file link into IE chace.
I assume you never looked into the EmbeddedWB sources.

Have a look there.

Cheers,

Andrew
Avatar of Eddie Shipman
matelindonesia,
So, you want to be able to open this "saved" website on another computer.
You have a couple of options. There is MHT or Mozilla Archive Format (MAF).
These two allow you to save the entire site in a single file. While I don't know
that much about the MAF, I do know about creating an MHT.

Basically, what is happening when an MHT is being created is that you are
downloading the source, parsing it for links, images, urls, etc, and downloading
each of them. Now when an image is downloaded, it is MIME encoded and the
image data is essentially a part of the MHT file. I have an example at home I can
post later to show you how it is done. I don't have Mozilla installed so I don't know
how it would handle MHT opened as HTML.

I am finishing up on a Delphi conversion of a MHT Builder that I found on CodeProject.com.
http://www.codeproject.com/vb/net/MhtBuilder.asp
It works exactly like IE in saving a website as a single file. I haven't worked on it in a couple
of months but am ready to get started again. I am about 60% finished.

I'm actually surprised that Mozilla hasn't embraced the MHT format .vs building their own.
MHT is actually based on RFC standard 2557, compliant Multipart MIME Message
(MHTML web archive). http://www.ietf.org/rfc/rfc2557.txt





Hi eddie,
I would like to explain the problem, firstly I need function to save webpage include all of images, but I didn't find any code to that whitout showing IE save as dialog, cause I want to make my own saev as dialog. So the solution is to save those in MHT file,cause from the article I read it will also save the image, and fortunately, I found the code. using this:

procedure WB_SaveAs_MHT(WB: TEmbeddedWB;
  const FileName: string);
var
  Msg: IMessage;
  Conf: IConfiguration;
  Stream: _Stream;
  URL : widestring;
begin
  if not Assigned(WB.Document) then Exit;
  URL := WB.LocationURL;

  Msg := CoMessage.Create;
  Conf := CoConfiguration.Create;
  try
    Msg.Configuration := Conf;
    Msg.CreateMHTMLBody(URL, cdoSuppressAll, '', '');
    Stream := Msg.GetStream;
    Stream.SaveToFile(FileName, adSaveCreateOverWrite);
  finally
    Msg := nil;
    Conf := nil;
    Stream := nil;
  end;
end; (* WB_SaveAs_MHT *)

But the problem is, when I try to open the mht file into other computer using IE of cource, it only showed html page only,
Eddie, can you tell me why it can be happened? is there any code to convert mht file back itno HTML full page? if my second question is exactly with the solution you have, I would be pleased to get the progress of your project :)


best regards

ASKER CERTIFIED SOLUTION
Avatar of Eddie Shipman
Eddie Shipman
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You can also use this code from this delphi3000.com article. I have not tested it, however.

      Save a webpage with images
      URL:http://www.delphi3000.com/article.asp?ID=3464
      Category:Internet / Web
      Uploader:Ken Wilcox

      Question:Ever wanted to duplicate the functionality of your favorite
      browser and save a web page with images to disk, well here is a simple
      example that does just that. I've created two functions, the other
      function just lets you pass in a progress bar to show the status of
      the operation.

      Please note: It requires Indy to run.
     
unit URLGet;

interface

uses
  Classes, SysUtils, Forms, IdHTTP, ComCtrls;

procedure UrlDownloadToFile(URL, FileName: String); overload;

procedure UrlDownloadToFile(URL, FileName: String; PB: TProgressBar); overload;

implementation

procedure GetImages(html: String; Images: TStringList);
var
  i, j: Integer;
  tag: String;
  link: String;
begin
  html := StringReplace(html, #13#10, ' ', [rfReplaceAll]);
  i := 1;
  while (i <= Length(html)) do
  begin
    // we have a begin tag
    if html[i] = '<' then
    begin
      tag := '';
      while (i <= Length(html)) and (html[i] <> '>') do
      begin
        tag := tag + html[i];
        inc(i);
      end;
      tag := tag + html[i];
      //inc(i);

      // we have the tag, see if it is an a href
      link := '';
      if pos('SRC=', UpperCase(tag)) <> 0 then
      begin
        j := 1;
        while (j <= Length(tag)) do
        begin
          if (tag[j] = '"') or (tag[j] = '''') then
          begin
            link := '';
            inc(j);
            while (j <= Length(tag)) do
            begin
              link := link + tag[j];
              inc(j);
              if j > 12 then
              begin
                if (tag[j + 1] = '"') then
                  break;
                if (tag[j+1] = '''') then
                  break;
              end;
            end;
            link := link + tag[j];
            //inc(j);
            break;
          end;
          inc(j);
        end;

        if link <> '' then
          Images.Add(link);

      end;
    end;
    inc(i);
  end;
end;

procedure UrlDownloadToFile(URL, FileName: String);
var
  s, dir, path: String;
  i: Integer;
  ms: TMemoryStream;
  imgs, sFile: TStringList;
  HTTP: TIdHTTP;
begin
  imgs := TStringList.Create;
  HTTP := TidHTTP.Create(Application);
  sFile := TStringList.Create;
  try
    s := HTTP.Get(URL);
    if s <> '' then
    begin
      if FileName <> '' then
      begin
        path := ChangeFileExt(FileName, '') + '_files';
        CreateDir(path);
        dir := ExtractFileName(ChangeFileExt(FileName, '')) + '_files\';
        GetImages(s, imgs);
        ms := TMemoryStream.Create;
        try
          for i := 0 to pred(imgs.Count) do
          begin
            ms.Clear;
            HTTP.Get(URL + imgs[i], ms);
            ms.Position := 0;
            if ms.Size <> 0 then
              ms.SaveToFile(dir + imgs[i]);
            s := StringReplace(s, imgs[i], dir + imgs[i], [rfReplaceAll]);
          end;
        finally
          FreeAndNil(ms);
        end;

        sFile.Text := s;
        sFile.SaveToFile(FileName);
      end;
    end;
  finally
    FreeAndNil(sFile);
    FreeAndNil(HTTP);
    FreeAndNil(imgs);
  end;
end;

procedure UrlDownloadToFile(URL, FileName: String; PB: TProgressbar);
overload;
var
  s, dir, path: String;
  i: Integer;
  ms: TMemoryStream;
  imgs, sFile: TStringList;
  HTTP: TIdHTTP;
begin
  if Assigned(PB) then
  begin
    imgs := TStringList.Create;
    HTTP := TidHTTP.Create(Application);
    sFile := TStringList.Create;
    try
      s := HTTP.Get(URL);
      if s <> '' then
      begin
        if FileName <> '' then
        begin
          path := ChangeFileExt(FileName, '') + '_files';
          CreateDir(path);
          dir := ExtractFileName(ChangeFileExt(FileName, '')) + '_files\';
          GetImages(s, imgs);
          ms := TMemoryStream.Create;
          try
            PB.Max := pred(imgs.Count);
            for i := 0 to pred(imgs.Count) do
            begin
              ms.Clear;
              HTTP.Get(URL + imgs[i], ms);
              ms.Position := 0;
              if ms.Size <> 0 then
                ms.SaveToFile(dir + imgs[i]);
              s := StringReplace(s, imgs[i], dir + imgs[i],[rfReplaceAll]);
              PB.Position := i;
              Application.ProcessMessages;
            end;
          finally
            FreeAndNil(ms);
          end;
          sFile.Text := s;
          sFile.SaveToFile(FileName);
        end;
      end;
    finally
      FreeAndNil(sFile);
      FreeAndNil(HTTP);
      FreeAndNil(imgs);
    end;
  end
  else
    UrlDownloadToFile(URL, FileName);
end;

end.
Hi eddie, thanks for your comment, I really appreciated it.
"..That may be a pretty cool little utility", yes I wonder it too, and I need to save a single web page, so firstly I will try to use your prefrer solution (http://www.delphi3000.com/article.asp?ID=3464), if it goes well, I will use it, but I still hope there will be such procedure or function to convert mht fiile into HTML with all of those images,because I don't need to get the file (html+images) for twice.(first = when I browsing, second=when I want save the page.).

Thanks
Well, you can, instead of using the URLDownloadToFile, get the info from the cache but
it is difficult determining what is what in there.

You can take out the idHTTP stuff and assign the source of the TEmbeddedWB to a string
and just use the same string. You would, however, be required to retrieve all the images from
the web again.

I'll see if I can find anything on getting the correct images from the cache.
OK, you can use the IECache utilites from http://www.euromind.com/iedelphi/iecache.htm
to get the info for each entry in the cache. This way, you can iterate through them and
find the image you want by checking the URL against the URL from the HTML. Then
just copy the file to another directory and modify the URL in the HTML to show the new
location in the img tag's SRC attribute.

Look specifically at the GetEntryInfo on the left side of the page
Oke Eddie, mean while waiting for your additional help, I'll try to fix some bugs I found on URLDownloadToFile.

regards
Fiuh..., to many tag checking that should be done, it more dificult than I though
Oke eddie, Ive used Dom to get Image TAG, but how about background tag which also can be an image?
Get the background attribute of the body tag (IHTMLBodyElement).
If you get it working, I'd like to see the finished results, please...