Link to home
Start Free TrialLog in
Avatar of Joeyman
Joeyman

asked on

Converting a web page to a text file in C++

My idea is that I want to take the news articles on the front page of Slashdot (or any other website, but let's stick with /. for now) and convert them to plain text, in a .txt file. I have a couple of general ideas of how to do them, but I'm not quite sure how to implement them. The first is to somehow have the program copy right from the screen, and paste into a text file. That is probably impossible, however. The second, more feasible idea is to have it scan the source code for a certain string, and then copy everything until it finds another certain string. Another idea is for it to start at a certain line in the source code, and copy to a certain line. The only problem I see with that is getting the HTML tags out of the text. Any ideas? I'm using Visual C++ 6.0, if that makes any difference. Thank you.
Avatar of CoolBreeze
CoolBreeze

u noe html scripts are enclosed by < and >
so just ignore those

for javascript, just try to ignore those between <script> and </script>
Avatar of Joeyman

ASKER

CoolBreeze, perhaps you misunderstood my question. I know what I need to ignore. I just don't know HOW to ignore them, or even how to automatically get the text in a form I can deal with.
int IsJavaScript=0;
long x=0;
char ch;

while(fread(&ch,sizeof(char),1,fptrIN))
{

if(ch=='<')
{
StartIgnore=1;
x=KeepReadingUntilFindEnd();//End is >
}
else
{
//<java script> Yadda yadda </java script>
// I Would test for strings ipt> and </java
IsJavaScript=AreWeInTheMiddleOfJavaScript();
    if(!IsJavaScript)
    {
    fwrite(&ch,sizeof(char),1,fptrOUT);
    x++; // x could be used to fseek to proper spot
    }
}

}//endwhile
You would also want to add newlines and returns (\n\r) for the end of your line of text. If you run into a white space while your writing to your file, test the next word and total up spaces. If over say 80 then do \n\r else write word. Adding the \r is for some editors so they know how to do newlines. (Or maybe you could write in text mode)

Doing it char by char and building strings with those chars allows you to create logic that can extract out what you want.
Avatar of Joeyman

ASKER

The important part is getting the HTML file off of the internet automatically, turning it into a text file automatically, and removing a certain number of lines at the top and bottom automatically. Can anyone help me with that?
This question didn't show any activity for more than 21 days. I will ask Community Support to close it unless you finalize it yourself within 7 days.
You can always request to keep this question open. But remember, experts can only help if you provide feedback to their comments.
Unless there is objection or further activity,  I will suggest to

    "refund the points and delete this question"

since nobody had a satisfying answer for you.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
========
Werner
Avatar of Joeyman

ASKER

Please DO NOT close this question.
ASKER CERTIFIED SOLUTION
Avatar of Mindphaser
Mindphaser

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial