• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 509
  • Last Modified:

Http Question...

Dear Experts,

I want to get the informatin from a remote website. I am using MFC classes for getting the information. It is working fine. I am able to get the whole html file but what i want is only text content in the webpage. I am using the following code.

try
{
CInternetSession *strSess = new CInternetSession("new");
DWORD dwServiceType = AFX_INET_SERVICE_HTTP;
CString pServer;
CString pObject;
INTERNET_PORT nport;
CString strUrl = "http://www.google.com";
AfxParseURL(strUrl,dwServiceType,pServer,pObject,nport);
CHttpConnection* httpcon = strSess->GetHttpConnection(pServer,nport);
CHttpFile * httpFile = httpcon->OpenRequest(1,pObject,NULL,1,NULL,NULL,INTERNET_FLAG_EXISTING_CONNECT);
          BOOL result = httpFile->SendRequest(NULL,0,NULL,0);
          CStdioFile str("c:\\test.txt",CFile::modeCreate | CFile::modeReadWrite);

          while(httpFile->ReadString(text))
          {    
               str.WriteString(text+"\n");    
          }

     
}
catch(CInternetException * thro)
{
          TCHAR strError[255];
          cout<<"There was some error";
          thro->GetErrorMessage(strError,255);
          cout<<strError<<endl;
}

how to do this.

thanks.
0
jamesasp
Asked:
jamesasp
1 Solution
 
jhanceCommented:
>>whole html file but what i want is only text content in the webpage

Can you be more specific?  The "text content" and the "html file" for a web page are the SAME THING.
0
 
KangaRooCommented:
You mean you want the markup tags removed (and all comment, script and style elements)?
Then you'd parse the file and remove anything that's not between <body> and </body>.
In the body element, remove the markuptags, basically anything between angled braces '<' and '>'
This also takes care of the comments and most script and style declarations since most html designers place those within a comments.
Finally, replace &xxx; codes (like &nbsp; &lt;) with their proper characters.
0
 
jamesaspAuthor Commented:
Hai,

thanks for your comments.Yes what KangaRoo meant is right.
I want only the text excluding the html commands.
for example suppose consider the following commands.

<html>
<head>Welcome</head>
<body>
<table>
<tr>
<td>Hello welcome to this page</td>
</tr>
</table>
</body>
</html>

from this page i want to extract only

welcome and Hello welcome to this page.

how to do this. there is any command in MFC to get the innertext what we do while using ieapplication.

thanks
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
jhanceCommented:
No, there are no MFC classes or functions to parse the HTML.  For this you need an HTML parser.  See:

http://www.w3.org/MarkUp/implementations.html

for information and software to do this.

What you are trying to do is NON-TRIVIAL!
0
 
tyronenCommented:
There is a sample on how to use Microsoft's MSHTML as an HTML parser.  You must have IE 4.0 or later

http://msdn.microsoft.com/downloads/samples/internet/default.asp?url=/downloads/samples/internet/browser/walkall/default.asp

- tyronen
0
 
jamesaspAuthor Commented:
Thanks for all
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now