MischiefMadness
asked on
Get html source using mshtlm
I've been through this over and over again. What I'm trying to do, is basically recreate an html document to send as an attachment in email. Internet Explorer provides the Send to option, but it doesn't handle postbacks . For example, if I wanted to email a results page where the url is the same as the inquiry page, IE will only grab the inquiry page. So I've been writing a side application that will read the source of the html document, and basically recreate the page for an email attachment. I've been able to get to the page, and view the source using mshtml. It seems that mshtml adds code to the page, <TBODY> for example. It also truncates quotations marks, etc.
My code is able to get the active IE session, read the code, and output it. So that the page works in the email, I'm going through and adding the full address to items such as images. Everything seems to work fine except for special ASCII characters. For example:
= space
& = &
I've added System.Web as a reference and tried using System.Web.HttpUtility.Htm lDecode()
It handles the &, but its converting to Â.
Here is the code as of thus far:
public string GetSelectedDocumentSource( int Index)
{
string WorkingSource = "";
string LowerWorkingSource = "";
string ReturnSource = "";
string ImageTag = "";
string ImageSource = "";
string FullImageSource = "";
int OpenImgIndex = 0;
int CloseImgIndex = 0;
int OpenImgSourceIndex = 0;
int CloseImgSourceIndex = 0;
IWebBrowserApp Browser = (IWebBrowserApp) IExplorerCollection[Index] ;
HTMLDocument Document = (HTMLDocument) Browser.Document;
// Get Images in Document
System.Collections.IEnumer ator ImageEnumr = Document.images.GetEnumera tor();
WorkingSource = System.Web.HttpUtility.Htm lDecode(Do cument.doc umentEleme nt.outerHT ML); // document source code
LowerWorkingSource = WorkingSource.ToLower();
OpenImgIndex = LowerWorkingSource.IndexOf ("<img");
if(OpenImgIndex != -1)
{
CloseImgIndex = LowerWorkingSource.IndexOf (">",OpenI mgIndex);
}
else
{
CloseImgIndex = -1;
}
while( OpenImgIndex != -1 && CloseImgIndex != -1)
{
ReturnSource = ReturnSource + WorkingSource.Substring(0, OpenImgInd ex);
ImageTag = WorkingSource.Substring(Op enImgIndex ,CloseImgI ndex - OpenImgIndex);
OpenImgSourceIndex = ImageTag.IndexOf("src=\"") + 5;
CloseImgSourceIndex = ImageTag.IndexOf("\"",Open ImgSourceI ndex);
ImageSource = ImageTag.Substring(OpenImg SourceInde x,CloseImg SourceInde x - OpenImgSourceIndex);
for(int Counter = 0; Counter < Document.images.length; Counter++)
{
ImageEnumr.MoveNext();
HTMLImg image = (HTMLImg) ImageEnumr.Current;
if(image.src.EndsWith(Imag eSource))
{
FullImageSource = ImageTag.Replace(ImageSour ce,image.s rc);
}
//Console.WriteLine("Origi nal Source: " + ImageSource + " - Full Source: " + image.src);
}
//Console.WriteLine("\r\n\ r\n\r\n");
ReturnSource = ReturnSource + FullImageSource;
// Loop Clean Up
LowerWorkingSource = LowerWorkingSource.Remove( 0,CloseImg Index);
WorkingSource = WorkingSource.Remove(0,Clo seImgIndex );
OpenImgIndex = LowerWorkingSource.IndexOf ("<img");
if(OpenImgIndex != -1)
{
CloseImgIndex = LowerWorkingSource.IndexOf (">",OpenI mgIndex);
}
else
{
CloseImgIndex = -1;
}
ImageEnumr.Reset();
} // end while
ReturnSource = ReturnSource + WorkingSource;
System.Web.HttpUtility.Htm lEncode(Re turnSource );
ReturnSource.Replace("<TBO DY>","");
ReturnSource.Replace("</TB ODY>","");
return ReturnSource;
}
The code is still a bit sloppy, so I do apologize.
My code is able to get the active IE session, read the code, and output it. So that the page works in the email, I'm going through and adding the full address to items such as images. Everything seems to work fine except for special ASCII characters. For example:
= space
& = &
I've added System.Web as a reference and tried using System.Web.HttpUtility.Htm
It handles the &, but its converting to Â.
Here is the code as of thus far:
public string GetSelectedDocumentSource(
{
string WorkingSource = "";
string LowerWorkingSource = "";
string ReturnSource = "";
string ImageTag = "";
string ImageSource = "";
string FullImageSource = "";
int OpenImgIndex = 0;
int CloseImgIndex = 0;
int OpenImgSourceIndex = 0;
int CloseImgSourceIndex = 0;
IWebBrowserApp Browser = (IWebBrowserApp) IExplorerCollection[Index]
HTMLDocument Document = (HTMLDocument) Browser.Document;
// Get Images in Document
System.Collections.IEnumer
WorkingSource = System.Web.HttpUtility.Htm
LowerWorkingSource = WorkingSource.ToLower();
OpenImgIndex = LowerWorkingSource.IndexOf
if(OpenImgIndex != -1)
{
CloseImgIndex = LowerWorkingSource.IndexOf
}
else
{
CloseImgIndex = -1;
}
while( OpenImgIndex != -1 && CloseImgIndex != -1)
{
ReturnSource = ReturnSource + WorkingSource.Substring(0,
ImageTag = WorkingSource.Substring(Op
OpenImgSourceIndex = ImageTag.IndexOf("src=\"")
CloseImgSourceIndex = ImageTag.IndexOf("\"",Open
ImageSource = ImageTag.Substring(OpenImg
for(int Counter = 0; Counter < Document.images.length; Counter++)
{
ImageEnumr.MoveNext();
HTMLImg image = (HTMLImg) ImageEnumr.Current;
if(image.src.EndsWith(Imag
{
FullImageSource = ImageTag.Replace(ImageSour
}
//Console.WriteLine("Origi
}
//Console.WriteLine("\r\n\
ReturnSource = ReturnSource + FullImageSource;
// Loop Clean Up
LowerWorkingSource = LowerWorkingSource.Remove(
WorkingSource = WorkingSource.Remove(0,Clo
OpenImgIndex = LowerWorkingSource.IndexOf
if(OpenImgIndex != -1)
{
CloseImgIndex = LowerWorkingSource.IndexOf
}
else
{
CloseImgIndex = -1;
}
ImageEnumr.Reset();
} // end while
ReturnSource = ReturnSource + WorkingSource;
System.Web.HttpUtility.Htm
ReturnSource.Replace("<TBO
ReturnSource.Replace("</TB
return ReturnSource;
}
The code is still a bit sloppy, so I do apologize.
Can u copy an example work? i mean a string, html souce before and after this function. So we can clearly understand the job. And, why don't u use RegEx?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Chester,
That is exactly what is happening. Every is being 'converted' to Â. Is replacing it in a string about my only option?
Buraksarica,
I'm not familar with RegEx to be perfectly honest. How will this 'fix' my problem?
That is exactly what is happening. Every is being 'converted' to Â. Is replacing it in a string about my only option?
Buraksarica,
I'm not familar with RegEx to be perfectly honest. How will this 'fix' my problem?
Actually when in decode it is changing it to white space, but when it encodes not changing back to . I think the easier way is to replace.