Link to home
Start Free TrialLog in
Avatar of MischiefMadness
MischiefMadness

asked on

Get html source using mshtlm

I've been through this over and over again. What I'm trying to do, is basically recreate an html document to send as an attachment in email. Internet Explorer provides the Send to option, but it doesn't handle postbacks . For example, if I wanted to email a results page where the url is the same as the inquiry page, IE will only grab the inquiry page. So I've been writing a side application that will read the source of the html document, and basically recreate the page for an email attachment. I've been able to get to the page, and view the source using mshtml. It seems that mshtml adds code to the page, <TBODY> for example. It also truncates quotations marks, etc.

My code is able to get the active IE session, read the code, and output it. So that the page works in the email, I'm going through and adding the full address to items such as images. Everything seems to work fine except for special ASCII characters. For example:

 &nbsp; = space
 &amp; = &

I've added System.Web as a reference and tried using System.Web.HttpUtility.HtmlDecode()

It handles the &, but its converting to Â.

Here is the code as of thus far:

public string GetSelectedDocumentSource(int Index)
            {
                  string                  WorkingSource            = "";
                  string                  LowerWorkingSource      = "";
                  string                  ReturnSource            = "";
                  string                  ImageTag                  = "";
                  string                  ImageSource                  = "";
                  string                  FullImageSource            = "";
                  int                        OpenImgIndex            = 0;
                  int                        CloseImgIndex            = 0;
                  int                        OpenImgSourceIndex      = 0;
                  int                        CloseImgSourceIndex      = 0;
                  IWebBrowserApp      Browser                        = (IWebBrowserApp) IExplorerCollection[Index];
                  HTMLDocument      Document                  = (HTMLDocument) Browser.Document;

                  // Get Images in Document
                  System.Collections.IEnumerator ImageEnumr = Document.images.GetEnumerator();

                  WorkingSource = System.Web.HttpUtility.HtmlDecode(Document.documentElement.outerHTML); // document source code
                  LowerWorkingSource = WorkingSource.ToLower();

                  OpenImgIndex = LowerWorkingSource.IndexOf("<img");
                  if(OpenImgIndex != -1)
                  {
                        CloseImgIndex = LowerWorkingSource.IndexOf(">",OpenImgIndex);
                  }
                  else
                  {
                        CloseImgIndex = -1;
                  }
                  while( OpenImgIndex != -1 && CloseImgIndex != -1)
                  {
                        ReturnSource = ReturnSource + WorkingSource.Substring(0,OpenImgIndex);
                        
                        ImageTag = WorkingSource.Substring(OpenImgIndex,CloseImgIndex - OpenImgIndex);
                        OpenImgSourceIndex = ImageTag.IndexOf("src=\"") + 5;
                        CloseImgSourceIndex = ImageTag.IndexOf("\"",OpenImgSourceIndex);
                        ImageSource = ImageTag.Substring(OpenImgSourceIndex,CloseImgSourceIndex - OpenImgSourceIndex);
                        for(int Counter = 0; Counter < Document.images.length; Counter++)
                        {
                              ImageEnumr.MoveNext();
                              HTMLImg image = (HTMLImg) ImageEnumr.Current;
                              if(image.src.EndsWith(ImageSource))
                              {
                                    
                                    FullImageSource = ImageTag.Replace(ImageSource,image.src);
                              }
                              //Console.WriteLine("Original Source: " + ImageSource + " - Full Source: " + image.src);
                        }      
                        //Console.WriteLine("\r\n\r\n\r\n");
                        ReturnSource = ReturnSource + FullImageSource;

                        // Loop Clean Up
                        LowerWorkingSource = LowerWorkingSource.Remove(0,CloseImgIndex);
                        WorkingSource = WorkingSource.Remove(0,CloseImgIndex);
                        OpenImgIndex = LowerWorkingSource.IndexOf("<img");
                        if(OpenImgIndex != -1)
                        {
                              CloseImgIndex = LowerWorkingSource.IndexOf(">",OpenImgIndex);
                        }
                        else
                        {
                              CloseImgIndex = -1;
                        }
                        ImageEnumr.Reset();
                  } // end while
                  ReturnSource = ReturnSource + WorkingSource;
                  System.Web.HttpUtility.HtmlEncode(ReturnSource);
                  ReturnSource.Replace("<TBODY>","");
                  ReturnSource.Replace("</TBODY>","");
                  return ReturnSource;
            }


The code is still a bit sloppy, so I do apologize.
Avatar of buraksarica
buraksarica
Flag of Türkiye image

Can u copy an example work? i mean a string, html souce before and after this function. So we can clearly understand the job. And, why don't u use RegEx?
ASKER CERTIFIED SOLUTION
Avatar of Chester_M_Ragel
Chester_M_Ragel

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of MischiefMadness
MischiefMadness

ASKER

Chester,

That is exactly what is happening. Every &nbsp; is being 'converted' to Â. Is replacing it in a string about my only option?

Buraksarica,
I'm not familar with RegEx to be perfectly honest. How will this 'fix' my problem?
Actually when in decode &nbsp; it is changing it to white space, but when it encodes not changing back to &nbsp;. I think the easier way is to replace.