Solved

Http Question...

Posted on 2001-07-14
6
475 Views
Last Modified: 2012-06-21
Dear Experts,

I want to get the informatin from a remote website. I am using MFC classes for getting the information. It is working fine. I am able to get the whole html file but what i want is only text content in the webpage. I am using the following code.

try
{
CInternetSession *strSess = new CInternetSession("new");
DWORD dwServiceType = AFX_INET_SERVICE_HTTP;
CString pServer;
CString pObject;
INTERNET_PORT nport;
CString strUrl = "http://www.google.com";
AfxParseURL(strUrl,dwServiceType,pServer,pObject,nport);
CHttpConnection* httpcon = strSess->GetHttpConnection(pServer,nport);
CHttpFile * httpFile = httpcon->OpenRequest(1,pObject,NULL,1,NULL,NULL,INTERNET_FLAG_EXISTING_CONNECT);
          BOOL result = httpFile->SendRequest(NULL,0,NULL,0);
          CStdioFile str("c:\\test.txt",CFile::modeCreate | CFile::modeReadWrite);

          while(httpFile->ReadString(text))
          {    
               str.WriteString(text+"\n");    
          }

     
}
catch(CInternetException * thro)
{
          TCHAR strError[255];
          cout<<"There was some error";
          thro->GetErrorMessage(strError,255);
          cout<<strError<<endl;
}

how to do this.

thanks.
0
Comment
Question by:jamesasp
6 Comments
 
LVL 32

Expert Comment

by:jhance
ID: 6282288
>>whole html file but what i want is only text content in the webpage

Can you be more specific?  The "text content" and the "html file" for a web page are the SAME THING.
0
 
LVL 7

Expert Comment

by:KangaRoo
ID: 6283154
You mean you want the markup tags removed (and all comment, script and style elements)?
Then you'd parse the file and remove anything that's not between <body> and </body>.
In the body element, remove the markuptags, basically anything between angled braces '<' and '>'
This also takes care of the comments and most script and style declarations since most html designers place those within a comments.
Finally, replace &xxx; codes (like &nbsp; &lt;) with their proper characters.
0
 

Author Comment

by:jamesasp
ID: 6283239
Hai,

thanks for your comments.Yes what KangaRoo meant is right.
I want only the text excluding the html commands.
for example suppose consider the following commands.

<html>
<head>Welcome</head>
<body>
<table>
<tr>
<td>Hello welcome to this page</td>
</tr>
</table>
</body>
</html>

from this page i want to extract only

welcome and Hello welcome to this page.

how to do this. there is any command in MFC to get the innertext what we do while using ieapplication.

thanks
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 32

Expert Comment

by:jhance
ID: 6283338
No, there are no MFC classes or functions to parse the HTML.  For this you need an HTML parser.  See:

http://www.w3.org/MarkUp/implementations.html

for information and software to do this.

What you are trying to do is NON-TRIVIAL!
0
 

Accepted Solution

by:
tyronen earned 30 total points
ID: 6288496
There is a sample on how to use Microsoft's MSHTML as an HTML parser.  You must have IE 4.0 or later

http://msdn.microsoft.com/downloads/samples/internet/default.asp?url=/downloads/samples/internet/browser/walkall/default.asp

- tyronen
0
 

Author Comment

by:jamesasp
ID: 6322031
Thanks for all
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In days of old, returning something by value from a function in C++ was necessarily avoided because it would, invariably, involve one or even two copies of the object being created and potentially costly calls to a copy-constructor and destructor. A…
What is C++ STL?: STL stands for Standard Template Library and is a part of standard C++ libraries. It contains many useful data structures (containers) and algorithms, which can spare you a lot of the time. Today we will look at the STL Vector. …
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

816 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now