Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 327
  • Last Modified:

Get html source using mshtlm

I've been through this over and over again. What I'm trying to do, is basically recreate an html document to send as an attachment in email. Internet Explorer provides the Send to option, but it doesn't handle postbacks . For example, if I wanted to email a results page where the url is the same as the inquiry page, IE will only grab the inquiry page. So I've been writing a side application that will read the source of the html document, and basically recreate the page for an email attachment. I've been able to get to the page, and view the source using mshtml. It seems that mshtml adds code to the page, <TBODY> for example. It also truncates quotations marks, etc.

My code is able to get the active IE session, read the code, and output it. So that the page works in the email, I'm going through and adding the full address to items such as images. Everything seems to work fine except for special ASCII characters. For example:

 &nbsp; = space
 &amp; = &

I've added System.Web as a reference and tried using System.Web.HttpUtility.HtmlDecode()

It handles the &, but its converting to Â.

Here is the code as of thus far:

public string GetSelectedDocumentSource(int Index)
            {
                  string                  WorkingSource            = "";
                  string                  LowerWorkingSource      = "";
                  string                  ReturnSource            = "";
                  string                  ImageTag                  = "";
                  string                  ImageSource                  = "";
                  string                  FullImageSource            = "";
                  int                        OpenImgIndex            = 0;
                  int                        CloseImgIndex            = 0;
                  int                        OpenImgSourceIndex      = 0;
                  int                        CloseImgSourceIndex      = 0;
                  IWebBrowserApp      Browser                        = (IWebBrowserApp) IExplorerCollection[Index];
                  HTMLDocument      Document                  = (HTMLDocument) Browser.Document;

                  // Get Images in Document
                  System.Collections.IEnumerator ImageEnumr = Document.images.GetEnumerator();

                  WorkingSource = System.Web.HttpUtility.HtmlDecode(Document.documentElement.outerHTML); // document source code
                  LowerWorkingSource = WorkingSource.ToLower();

                  OpenImgIndex = LowerWorkingSource.IndexOf("<img");
                  if(OpenImgIndex != -1)
                  {
                        CloseImgIndex = LowerWorkingSource.IndexOf(">",OpenImgIndex);
                  }
                  else
                  {
                        CloseImgIndex = -1;
                  }
                  while( OpenImgIndex != -1 && CloseImgIndex != -1)
                  {
                        ReturnSource = ReturnSource + WorkingSource.Substring(0,OpenImgIndex);
                        
                        ImageTag = WorkingSource.Substring(OpenImgIndex,CloseImgIndex - OpenImgIndex);
                        OpenImgSourceIndex = ImageTag.IndexOf("src=\"") + 5;
                        CloseImgSourceIndex = ImageTag.IndexOf("\"",OpenImgSourceIndex);
                        ImageSource = ImageTag.Substring(OpenImgSourceIndex,CloseImgSourceIndex - OpenImgSourceIndex);
                        for(int Counter = 0; Counter < Document.images.length; Counter++)
                        {
                              ImageEnumr.MoveNext();
                              HTMLImg image = (HTMLImg) ImageEnumr.Current;
                              if(image.src.EndsWith(ImageSource))
                              {
                                    
                                    FullImageSource = ImageTag.Replace(ImageSource,image.src);
                              }
                              //Console.WriteLine("Original Source: " + ImageSource + " - Full Source: " + image.src);
                        }      
                        //Console.WriteLine("\r\n\r\n\r\n");
                        ReturnSource = ReturnSource + FullImageSource;

                        // Loop Clean Up
                        LowerWorkingSource = LowerWorkingSource.Remove(0,CloseImgIndex);
                        WorkingSource = WorkingSource.Remove(0,CloseImgIndex);
                        OpenImgIndex = LowerWorkingSource.IndexOf("<img");
                        if(OpenImgIndex != -1)
                        {
                              CloseImgIndex = LowerWorkingSource.IndexOf(">",OpenImgIndex);
                        }
                        else
                        {
                              CloseImgIndex = -1;
                        }
                        ImageEnumr.Reset();
                  } // end while
                  ReturnSource = ReturnSource + WorkingSource;
                  System.Web.HttpUtility.HtmlEncode(ReturnSource);
                  ReturnSource.Replace("<TBODY>","");
                  ReturnSource.Replace("</TBODY>","");
                  return ReturnSource;
            }


The code is still a bit sloppy, so I do apologize.
0
MischiefMadness
Asked:
MischiefMadness
  • 2
1 Solution
 
buraksaricaCommented:
Can u copy an example work? i mean a string, html souce before and after this function. So we can clearly understand the job. And, why don't u use RegEx?
0
 
Chester_M_RagelCommented:
I also think there is a small problem with this encode, decode thing for &nbsp;. Why dont you replace all &nbsp; with a space before decoding and again back to &nbsp; after encoding?

Chester
0
 
MischiefMadnessAuthor Commented:
Chester,

That is exactly what is happening. Every &nbsp; is being 'converted' to Â. Is replacing it in a string about my only option?

Buraksarica,
I'm not familar with RegEx to be perfectly honest. How will this 'fix' my problem?
0
 
Chester_M_RagelCommented:
Actually when in decode &nbsp; it is changing it to white space, but when it encodes not changing back to &nbsp;. I think the easier way is to replace.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now