Save complete web pages to disk using VBA

Richard Wein
Richard Wein used Ask the Experts™
on
Hello. I'm trying to write VBA code (in Excel) to browse web pages and save them to files.

I already have VBA code to automate Internet Explorer, navigate to web pages, and extract the text of the pages. But now I want to save the entire page, including images. In other words, I'd like to do the equivalent of manually printing the web page to PDF, saving it as an MHTML file, saving it as "Web page, Complete" (with the images in a separate folder), or possibly saving the whole web page as one image file.

I've tried...

1. Initiating a print job in IE, having made "Print to PDF" the default. I've tried various variations on this statement:

 IE.ExecWB OLECMDID_PRINT, OLECMDEXECOPT_DONTPROMPTUSER

The problem is that it opens a Print dialog box followed by a Save File dialog box, and the only way I can find to interact with these boxes is to use SendKeys. But that's unrelaible. I don't know how long to wait for the dialog box to appear before sending the keys.

2. Creating an MHTML file using CDO, with this statement:

cdoMsg.CreateMHTMLBody url, cdoSuppressNone

This works, but it produces a very poorly rendered version of the web page.

Can you suggest any other options, please? I'd be open to using method #1 above, if I could detect when the dialog boxes are ready to receive keys with SendKeys. Alternatively, perhaps I should use a different technology. I'm only really familiar with VBA, but I understand that there are VBA wrappers available for WebKit and Selenium. Perhaps I could use one of those, if it's not going to be too complicated.

Thanks,
Richard.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
As a standalone, https://www.httrack.com is a good option.  You can access it via command line if it is installed on the computer https://www.httrack.com/html/scripting.html and use vba to run command line.
Daniel PineaultPresident / Owner CARDA Consultants Inc.
Distinguished Expert 2018

Commented:
I normally wouldn't use SendKeys because they are unreliable, but you could try something along the lines of

Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Sub SaveURL()
    Dim oIE                   As Object

    Set oIE = CreateObject("InternetExplorer.Application")
    With oIE
        .Visible = True
        .Navigate "http://www.devhut.net"
        Do Until .ReadyState = 4    'READYSTATE_COMPLETE
            DoEvents
        Loop
        SendKeys "^(s)"
        Sleep (1000) '!!!This is critical to give the dialog a chance to load!!!
        SendKeys "(C:\...\ ...\...\YourFileName.html)"
        SendKeys "{TAB}"
        SendKeys "{TAB}"
        SendKeys "{TAB}"
        SendKeys "{TAB}"
        SendKeys "{ENTER}"
        .Quit
    End With
    Set oIE = Nothing
End Sub

Open in new window


You could also get into APIs to eliminate the SendKeys and bind to the menu bar, click on the buttons by hwnd, ... but that can get very complicated, but is the preferred approach.

Author

Commented:
Hi Scott. Thanks for the suggestion. I've had a look at htttrack, and I think it might do.

One concern I have is whether the saved files will constitute a complete snapshot of the page. I think the -n switch ensures that external images  will be saved. But Is it possible that scripts will fetch data from other sources, by Ajax for example, that won't be saved? (I hardly know anything about Ajax, so maybe I'm completely off-base here.)

By the way, sorry I haven't followed up on the Word Cloud advice you gave me. That project's on the back burner for now, but I hope to come back to it later.
Become a Microsoft Certified Solutions Expert

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

Author

Commented:
Hi Daniel. I've already tried something along those lines (not quite the same). One second wasn't enough for the dialog box to appear. I was using 5 seconds! And that's not guaranteed to be enough. (Perhaps the user's PC is overworked.)

I had a look at using Windows APIs to detect when the dialog box was open, but I wasn't able to find a solution. I've had very little experience of handling windows using the APIs. I was half-hoping that someone here could give me an API solution. ;-)
Daniel PineaultPresident / Owner CARDA Consultants Inc.
Distinguished Expert 2018

Commented:
When you (manually) open the page, wait for it to load fully, then press Ctrl+S, it takes 5+ seconds for the Save dialog to open?

Regardless, you can simply adjust the Sleep to whatever works on your PC.  I tested the code and it does work, as long as you don't do anything else while it is running!
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
You can control if you want objects from other sources or not.

ut Is it possible that scripts will fetch data from other sources, by Ajax for example, that won't be saved?

Open in new window

I'm not sure what you are asking.

Ajax would be using javascript via the browser. Using vba/vbs you can use xmlhttpost https://support.microsoft.com/en-us/help/290591/how-to-submit-form-data-by-using-xmlhttp-or-serverxmlhttp-object or https://docs.microsoft.com/en-us/windows/win32/winhttp/winhttprequest. This would be similar to an ajax request but on the server side.

Author

Commented:
Hi again. Sorry I've taken so long to reply. Thanks for your suggestions.

In the end I decided that a SendKeys solution is not reliable enough.

I implemented a solution using Httrack, but I found it didn't seem able to cope with all web pages. Quite often the downloaded files weren't sufficient to render the page fully. And occasionally it failed to download the page at all. Maybe this was because I was downloading single web pages and not whole web sites. I was using command lines like this:
{programPath} {url} -O {folderPath} {url} -v -s0 --depth=1 -n
...and this worked correctly for some pages.

I discovered that there is software for saving images of full web pages (like the Windows Print Screen button, but saving the whole web page, not just the part curently on screen). But again the two I tried didn't produce acceptable results. Pages were often not fully rendered. I'm still hoping to find a solution of this sort (as it would be better than mirroring the original files, as Httrack does), but I haven't had time to look into it any further yet.
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
Forget about vba for a moment.  What is your end goal?  Do you want to capture a screenshot of a given web page each day and archive it for example?  Do you need to detect changes in the page?

There are third party options for this and that may be easier than trying to do this all via vb.  https://pagescreen.io/tool/webpage-screenshot-archive.htm is one example of many.  There is also the possibility of using archive.org's api https://blog.archive.org/developers/.  If you go to archive.org first and just look for some current pages,you may see not all images get added.  That is where a third party option may give the best results but it also means there would be a charge as a trade off.

Author

Commented:
Thanks, Scott. PageScreen looks like it might be a good option.

To answer your question, this is for an Excel/VBA application that I've already developed for a client. Basically, it searches for web pages of interest, looking at their text and scoring them according to various criteria. It records how a page's score has changed over time, as the page changes. Recently my client asked for the ability to save the pages themselves, not just the scores. I've added the ability to save the text of the pages for comparison, but ideally he would also like to be able to see the rendered pages too.

PageScreen looks like it might suit his purposes, either on its own, or perhaps integrated with my existing application via PageScreen's API.
Developer & EE Moderator
Fellow 2018
Most Valuable Expert 2013
Commented:
The api would be the way to go for what you want to do.  

Another option is browsershots http://api.browsershots.org/documentation#FactoryInterface where you can use vba to hit a command line.

https://www.thum.io/
https://urlbox.io/

LIke I said, there are a number of these you can use.

Author

Commented:
Hi. Thanks for your help.

Scott, I just tried out https://urlbox.io/, and that seems like it will do what I want. (I'd already tried a similar service, but it didn't seem to work well.) So I've marked my question as solved. But I'll be checking out the other options you've mentioned, as those look interesting too.

Thanks again.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial