Using powershell to open an MTHML document and same embedded images as a separate files

Using PowerShell script, how do I go about iterating through the images in the file, saving each image to its own individual file.

My scenario is as follows;

An MHTML is produced by daily by SSRS; the file comprising a number of charts (embedded as images) and text
The images need to be extracted as individual files and uploaded to an external FTP site

I would like to use PowerShell to accomplish the extraction of the images from the HTML (once the image files are created, moving files to the appropriate location is not an issue) but as a Powershell newby I don't really know where to start with getting the images. I have opened the MHTML as per the following, but don't know what to do next.

$ie = new-object -com "InternetExplorer.Application"
$ie.visible = $true
$ie.navigate("file://c:/projects/reports/20150831/Report.mhtml")
while ($ie.busy) {sleep -milliseconds 50}

Hope someone can help ... thanks.
naleo96Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Christopher Jay WolffWiggle My Legs, OwnerCommented:
HI.  I'm too new at this to be gobs of help, but since no one is here yet I wanted to point out the option of using Get-Content with output piped to a ForEach-Object which can be denoted "%"  Does this look like an attractive way to proceed with parsing your MHTML file.  You may want to use the -raw parameter.  And you may like the examples in
PS>Get-Help about_split -full
Here is a version.
https://social.technet.microsoft.com/Forums/scriptcenter/en-US/3dac12b9-84f4-455f-99ed-7f20e42075dc/parse-a-log-file-with-powershell?forum=ITCG


Here another parsing example to review.
http://blogs.technet.com/b/heyscriptingguy/archive/2012/03/26/use-powershell-to-parse-an-xml-file-and-sort-the-data.aspx

And the System.IO.Path class has the GetFileNameWithoutExtension and the GetExtension

Probably Qlemo or some other Expert will be along here shortly to assist you better.  But you'll have something to do.  :)
0
Christopher Jay WolffWiggle My Legs, OwnerCommented:
Never mind all that above.  I didn't know it was built in.  Very nice.  I think they're doing exactly what you want at the following link with a starting script just like yours.  I actually thought you got your script from there and stopped reading.  :-)

To summarize:
After your script above, use the document property of your IE object
$ie.document | Get-Member
to get at the methods.

GetElementbyTagName method will allow you to pull all img tagged data.  Then pipe it to Select-Object to get the URLs of all images.
$ie.document.getElementsByTagName('img') | Select-Object -ExpandProperty src

Then use
Start-BitsTransfer
to get the images.


http://powershell.com/cs/blogs/tobias/archive/2010/03/17/downloading-images-from-webpages.aspx


Nice work from Tobias.
1
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
Good find, Christopher! The BITS stuff is indeed something very smart to use.
Also, the way approaching how to use PS code is exactly what I would have used.

naleo96, if that is not sufficient to get the images, we will need an example mshtml file to try ourselves. Sadly, using and parsing HTML pages for particular info can get tricky sometimes.
The linked script downloads every image into the same folder, not keeping any folder hierarchy. Usually that should be fine, but not if you need to keep some source info.

What is still missing from that is the FTP upload, but you told that is not important ;-). Do you only upload the images? If you need the report itself to be transferred, you'll have to adapt pathes, of course.

Sadly it seems that you cannot use BITS with FTP, so you would have to do that using a System.Net.WebClient object if you want to run that in PS (or use a cmdline FTP tool for that).
0
Simplify Active Directory Administration

Administration of Active Directory does not have to be hard.  Too often what should be a simple task is made more difficult than it needs to be.The solution?  Hyena from SystemTools Software.  With ease-of-use as well as powerful importing and bulk updating capabilities.

naleo96Author Commented:
Hi All.

Thanks for your replies. I've been away a couple of days and will be able to review your suggestions later today.

Thanks for your help so far
0
naleo96Author Commented:
Thanks for your help.

The theory is all great but I've hit a bit of a brick wall. The BITS transfer method works great on a sample site. In my case however the path to the images is actually a parameterised query (I think thats the right term .. see the example below) and as result the Start-BitsTransfer fails with a 'FileNotFoundException'.

IMG Element Name):
<IMG BORDER="0" style="top:0px;left:0px;position:relative;" SRC="http://hansendbserver/ReportServer?%2FReservoir%20Level%20Reporting%2FReservoir%20Storage%20Level%20Report&rs%3ASessionID=zvai5c550yxoi1454paeu2yd&rs%3AFormat=HTML4.0&rs%3AImageID=IMGCON_1_0"/>

Open in new window


Error Message:
Start-BitsTransfer : The filename, directory name, or volume label syntax is incorrect. (Exception from HRESULT: 0x8007007B)
At C:\projects\Reservoir Level Reporting\Reports\20150831\Untitled5.ps1:18 char:19
+ Start-BitsTransfer <<<<  $sources $destinations -Prio Foreground # -Display $displayname
    + CategoryInfo          : NotSpecified: (:) [Start-BitsTransfer], FileNotFoundException
    + FullyQualifiedErrorId : System.IO.FileNotFoundException,Microsoft.BackgroundIntelligentTransfer.Management.NewBitsTransferCommand

One alternative I have is to load a MHTML file. However this too is unsuccessful and again the images don't actually exist as files ... they're embedded in the file. The IMG source looks like this:

<IMG onerror="this.errored=true;" BORDER="0" class="a25" SRC="cid:C_7iT0R0x0S0T0_1"/>
Once loaded into IE, the properties of the image look like this:

mhtml:file://C:\projects\Reservoir Level Reporting\Reports\20150907\Reservoir Storage Level Report.mhtml!cid:C_17iT1_1

Once again, the Bits-Transfer method cannot be used to copy the file since the physical file does not exist.

My next strategy will be to get the report output as a Word file, and try to extract the images from that.

Any other suggestions would be most welcome.
0
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
Correct, BITS needs static content, and the image sources are indeed dynamic reports.
I assume you are not able to post an example MHTML?
0
naleo96Author Commented:
Qlemo, thanks for your help.

Attached is a sample MHTML file - the one used in the previous example but renamed.

Don't waste too much time on this as I've actually moved on to a method that involves  using PowerShell to open the file using MSWord and perform a SaveAs to the HTML format (forcing the images to be saved as individual files). From there the script copies the individual files to the relevant external website using FTP. This is working ok but does have the drawback that it needs to run on a PC with Word installed but I can live with that at the moment.

Thanks again
sample.zip
0
naleo96Author Commented:
Thanks for the assistance. Confirmation that I was on the right track. In the end the final part of the solution (use of BITS-Transfer) can't be applied because it doesn't work with MHTML (as posed in original question) but a great suggestion nonetheless.

In the end, my solution has been to work with an alternate starting document type (MS Word). Not my preferred solution but acceptable.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Powershell

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.