Get html source using mshtlm

Posted on 2004-11-17
Last Modified: 2010-05-18
I've been through this over and over again. What I'm trying to do, is basically recreate an html document to send as an attachment in email. Internet Explorer provides the Send to option, but it doesn't handle postbacks . For example, if I wanted to email a results page where the url is the same as the inquiry page, IE will only grab the inquiry page. So I've been writing a side application that will read the source of the html document, and basically recreate the page for an email attachment. I've been able to get to the page, and view the source using mshtml. It seems that mshtml adds code to the page, <TBODY> for example. It also truncates quotations marks, etc.

My code is able to get the active IE session, read the code, and output it. So that the page works in the email, I'm going through and adding the full address to items such as images. Everything seems to work fine except for special ASCII characters. For example:

 &nbsp; = space
 &amp; = &

I've added System.Web as a reference and tried using System.Web.HttpUtility.HtmlDecode()

It handles the &, but its converting to Â.

Here is the code as of thus far:

public string GetSelectedDocumentSource(int Index)
                  string                  WorkingSource            = "";
                  string                  LowerWorkingSource      = "";
                  string                  ReturnSource            = "";
                  string                  ImageTag                  = "";
                  string                  ImageSource                  = "";
                  string                  FullImageSource            = "";
                  int                        OpenImgIndex            = 0;
                  int                        CloseImgIndex            = 0;
                  int                        OpenImgSourceIndex      = 0;
                  int                        CloseImgSourceIndex      = 0;
                  IWebBrowserApp      Browser                        = (IWebBrowserApp) IExplorerCollection[Index];
                  HTMLDocument      Document                  = (HTMLDocument) Browser.Document;

                  // Get Images in Document
                  System.Collections.IEnumerator ImageEnumr = Document.images.GetEnumerator();

                  WorkingSource = System.Web.HttpUtility.HtmlDecode(Document.documentElement.outerHTML); // document source code
                  LowerWorkingSource = WorkingSource.ToLower();

                  OpenImgIndex = LowerWorkingSource.IndexOf("<img");
                  if(OpenImgIndex != -1)
                        CloseImgIndex = LowerWorkingSource.IndexOf(">",OpenImgIndex);
                        CloseImgIndex = -1;
                  while( OpenImgIndex != -1 && CloseImgIndex != -1)
                        ReturnSource = ReturnSource + WorkingSource.Substring(0,OpenImgIndex);
                        ImageTag = WorkingSource.Substring(OpenImgIndex,CloseImgIndex - OpenImgIndex);
                        OpenImgSourceIndex = ImageTag.IndexOf("src=\"") + 5;
                        CloseImgSourceIndex = ImageTag.IndexOf("\"",OpenImgSourceIndex);
                        ImageSource = ImageTag.Substring(OpenImgSourceIndex,CloseImgSourceIndex - OpenImgSourceIndex);
                        for(int Counter = 0; Counter < Document.images.length; Counter++)
                              HTMLImg image = (HTMLImg) ImageEnumr.Current;
                                    FullImageSource = ImageTag.Replace(ImageSource,image.src);
                              //Console.WriteLine("Original Source: " + ImageSource + " - Full Source: " + image.src);
                        ReturnSource = ReturnSource + FullImageSource;

                        // Loop Clean Up
                        LowerWorkingSource = LowerWorkingSource.Remove(0,CloseImgIndex);
                        WorkingSource = WorkingSource.Remove(0,CloseImgIndex);
                        OpenImgIndex = LowerWorkingSource.IndexOf("<img");
                        if(OpenImgIndex != -1)
                              CloseImgIndex = LowerWorkingSource.IndexOf(">",OpenImgIndex);
                              CloseImgIndex = -1;
                  } // end while
                  ReturnSource = ReturnSource + WorkingSource;
                  return ReturnSource;

The code is still a bit sloppy, so I do apologize.
Question by:MischiefMadness
    LVL 5

    Expert Comment

    Can u copy an example work? i mean a string, html souce before and after this function. So we can clearly understand the job. And, why don't u use RegEx?
    LVL 6

    Accepted Solution

    I also think there is a small problem with this encode, decode thing for &nbsp;. Why dont you replace all &nbsp; with a space before decoding and again back to &nbsp; after encoding?


    Author Comment


    That is exactly what is happening. Every &nbsp; is being 'converted' to Â. Is replacing it in a string about my only option?

    I'm not familar with RegEx to be perfectly honest. How will this 'fix' my problem?
    LVL 6

    Expert Comment

    Actually when in decode &nbsp; it is changing it to white space, but when it encodes not changing back to &nbsp;. I think the easier way is to replace.

    Featured Post

    Live - One-on-One C# Help from Top Experts

    Solve your toughest problems, fast.
    C# experts are online now and ready to help you.

    Join & Write a Comment

    In order to hide the "ugly" records selectors (triangles) in the rowheaders, here are some suggestions. Microsoft doesn't have a direct method/property to do it. You can only hide the rowheader column. First solution, the easy way The first sol…
    Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
    To add imagery to an HTML email signature, you have two options available to you. You can either add a logo/image by embedding it directly into the signature or hosting it externally and linking to it. The vast majority of email clients display l…
    Migrating to Microsoft Office 365 is becoming increasingly popular for organizations both large and small. If you have made the leap to Microsoft’s cloud platform, you know that you will need to create a corporate email signature for your Office 365…

    729 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now