Converting a web page to a text file in C++

Joeyman
Joeyman used Ask the Experts™
on
My idea is that I want to take the news articles on the front page of Slashdot (or any other website, but let's stick with /. for now) and convert them to plain text, in a .txt file. I have a couple of general ideas of how to do them, but I'm not quite sure how to implement them. The first is to somehow have the program copy right from the screen, and paste into a text file. That is probably impossible, however. The second, more feasible idea is to have it scan the source code for a certain string, and then copy everything until it finds another certain string. Another idea is for it to start at a certain line in the source code, and copy to a certain line. The only problem I see with that is getting the HTML tags out of the text. Any ideas? I'm using Visual C++ 6.0, if that makes any difference. Thank you.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
u noe html scripts are enclosed by < and >
so just ignore those

for javascript, just try to ignore those between <script> and </script>

Author

Commented:
CoolBreeze, perhaps you misunderstood my question. I know what I need to ignore. I just don't know HOW to ignore them, or even how to automatically get the text in a form I can deal with.

Commented:
int IsJavaScript=0;
long x=0;
char ch;

while(fread(&ch,sizeof(char),1,fptrIN))
{

if(ch=='<')
{
StartIgnore=1;
x=KeepReadingUntilFindEnd();//End is >
}
else
{
//<java script> Yadda yadda </java script>
// I Would test for strings ipt> and </java
IsJavaScript=AreWeInTheMiddleOfJavaScript();
    if(!IsJavaScript)
    {
    fwrite(&ch,sizeof(char),1,fptrOUT);
    x++; // x could be used to fseek to proper spot
    }
}

}//endwhile
Learn Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

Commented:
You would also want to add newlines and returns (\n\r) for the end of your line of text. If you run into a white space while your writing to your file, test the next word and total up spaces. If over say 80 then do \n\r else write word. Adding the \r is for some editors so they know how to do newlines. (Or maybe you could write in text mode)

Doing it char by char and building strings with those chars allows you to create logic that can extract out what you want.

Author

Commented:
The important part is getting the HTML file off of the internet automatically, turning it into a text file automatically, and removing a certain number of lines at the top and bottom automatically. Can anyone help me with that?

Commented:
This question didn't show any activity for more than 21 days. I will ask Community Support to close it unless you finalize it yourself within 7 days.
You can always request to keep this question open. But remember, experts can only help if you provide feedback to their comments.
Unless there is objection or further activity,  I will suggest to

    "refund the points and delete this question"

since nobody had a satisfying answer for you.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
========
Werner

Author

Commented:
Please DO NOT close this question.
It is time to close this question since there wasn't any feedback for a long time. You are free to repost to get it to th estart of the queue.

Points refunded and moved to PAQ

** Mindphaser - Community Support Moderator **

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial