Link to home
Start Free TrialLog in
Avatar of naleo96
naleo96

asked on

Using powershell to open an MTHML document and same embedded images as a separate files

Using PowerShell script, how do I go about iterating through the images in the file, saving each image to its own individual file.

My scenario is as follows;

An MHTML is produced by daily by SSRS; the file comprising a number of charts (embedded as images) and text
The images need to be extracted as individual files and uploaded to an external FTP site

I would like to use PowerShell to accomplish the extraction of the images from the HTML (once the image files are created, moving files to the appropriate location is not an issue) but as a Powershell newby I don't really know where to start with getting the images. I have opened the MHTML as per the following, but don't know what to do next.

$ie = new-object -com "InternetExplorer.Application"
$ie.visible = $true
$ie.navigate("file://c:/projects/reports/20150831/Report.mhtml")
while ($ie.busy) {sleep -milliseconds 50}

Hope someone can help ... thanks.
SOLUTION
Avatar of Christopher Jay Wolff
Christopher Jay Wolff
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Qlemo
Good find, Christopher! The BITS stuff is indeed something very smart to use.
Also, the way approaching how to use PS code is exactly what I would have used.

naleo96, if that is not sufficient to get the images, we will need an example mshtml file to try ourselves. Sadly, using and parsing HTML pages for particular info can get tricky sometimes.
The linked script downloads every image into the same folder, not keeping any folder hierarchy. Usually that should be fine, but not if you need to keep some source info.

What is still missing from that is the FTP upload, but you told that is not important ;-). Do you only upload the images? If you need the report itself to be transferred, you'll have to adapt pathes, of course.

Sadly it seems that you cannot use BITS with FTP, so you would have to do that using a System.Net.WebClient object if you want to run that in PS (or use a cmdline FTP tool for that).
Avatar of naleo96
naleo96

ASKER

Hi All.

Thanks for your replies. I've been away a couple of days and will be able to review your suggestions later today.

Thanks for your help so far
Avatar of naleo96

ASKER

Thanks for your help.

The theory is all great but I've hit a bit of a brick wall. The BITS transfer method works great on a sample site. In my case however the path to the images is actually a parameterised query (I think thats the right term .. see the example below) and as result the Start-BitsTransfer fails with a 'FileNotFoundException'.

IMG Element Name):
<IMG BORDER="0" style="top:0px;left:0px;position:relative;" SRC="http://hansendbserver/ReportServer?%2FReservoir%20Level%20Reporting%2FReservoir%20Storage%20Level%20Report&rs%3ASessionID=zvai5c550yxoi1454paeu2yd&rs%3AFormat=HTML4.0&rs%3AImageID=IMGCON_1_0"/>

Open in new window


Error Message:
Start-BitsTransfer : The filename, directory name, or volume label syntax is incorrect. (Exception from HRESULT: 0x8007007B)
At C:\projects\Reservoir Level Reporting\Reports\20150831\Untitled5.ps1:18 char:19
+ Start-BitsTransfer <<<<  $sources $destinations -Prio Foreground # -Display $displayname
    + CategoryInfo          : NotSpecified: (:) [Start-BitsTransfer], FileNotFoundException
    + FullyQualifiedErrorId : System.IO.FileNotFoundException,Microsoft.BackgroundIntelligentTransfer.Management.NewBitsTransferCommand

One alternative I have is to load a MHTML file. However this too is unsuccessful and again the images don't actually exist as files ... they're embedded in the file. The IMG source looks like this:

<IMG onerror="this.errored=true;" BORDER="0" class="a25" SRC="cid:C_7iT0R0x0S0T0_1"/>
Once loaded into IE, the properties of the image look like this:

mhtml:file://C:\projects\Reservoir Level Reporting\Reports\20150907\Reservoir Storage Level Report.mhtml!cid:C_17iT1_1

Once again, the Bits-Transfer method cannot be used to copy the file since the physical file does not exist.

My next strategy will be to get the report output as a Word file, and try to extract the images from that.

Any other suggestions would be most welcome.
Correct, BITS needs static content, and the image sources are indeed dynamic reports.
I assume you are not able to post an example MHTML?
Avatar of naleo96

ASKER

Qlemo, thanks for your help.

Attached is a sample MHTML file - the one used in the previous example but renamed.

Don't waste too much time on this as I've actually moved on to a method that involves  using PowerShell to open the file using MSWord and perform a SaveAs to the HTML format (forcing the images to be saved as individual files). From there the script copies the individual files to the relevant external website using FTP. This is working ok but does have the drawback that it needs to run on a PC with Word installed but I can live with that at the moment.

Thanks again
sample.zip
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial