asked on

Stripping HTML from a file

I need to strip all HTML from a file so i can work with the remaining text.. i've been trying to use a while loop as follows...:

int x = outStr.Find( '<' );
while ( x != -1 )
{
      x = outStr.Find( '<' );
      int y = outStr.Find( '>' );
      outStr.Delete( x, y - x + 1);
}

this craps out with a massive memory and cpu hogging..

anyone have any ideas...

zeek

monkesdb

ok, here goes. *cracks knuckles*...

#include <string>
#include <algorithm>
#include <vector>

string oldStr = yourHTMLStuff();
ostringstream newStr;
ostream_iterator<string::value_type> newInsert(newStr);
vector<string::iterator> v;
string::iterator si;

do
{
si = find(oldStr.begin(), oldStr.end(),'<');
v.push_back(si);
si = find(oldStr.begin(), oldStr.end(),'>');
v.push_back(si);
}while(si != oldStr.end())

copy(oldStr.begin(), *v.begin(), newInsert);

for( v.iterator i = v.begin()+1; i+1 < v.end(); i += 2 )
copy(*i, *(i+1), newInsert);

copy(*v.end(), oldStr.end(), newInsert);

return newStr.str(); // like magic, all the tags have vanished

(i think) this is of course untested.

monkesdb

you'll need

#include <ostringstream>

ASKER CERTIFIED SOLUTION

Dexstar

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

monkesdb

i don't know CString. never the less, is it not possible to accomplish the same logic using CString that i have using stl. create a list of indecies which are the positions of the <'s and >'s then concatenate every other pair.

however, i think i see your problems.

int x = outStr.Find( '<' );
int y = outStr.Find( '>', x );
while ( (x != -1) && (y != -1) )
{
outStr.Delete( x, y - x + 1);
x = outStr.Find( '<', x ); // without the x here you start from the start each time
y = outStr.Find( '>', x );
}

grg99

You can't just look for pairs of " < >" strings, as HTML has several context-dependent places where those characters can appear with other meanings, not as HTML tag delimiters.

For example, in a <script> </script> environment, almost anything can appear, including the use of the delimiters as "less than" and "greater than" symbols.

They can also appear inside any quoted string, as well as any comment.

So the only fully correct way is to write a (relatively simple) HTML parser. basically you have to look at each character in sequence and do the right thing, in order to skip <script></script> sequences, and quoted strings, and comments.

About 60 lines of code at most, I'd guess.

Regards,

grg99

Dexstar

@monkesdb: Yeah, he could port your algorithm, but maybe he doesn't know STL string... Anyway, I don't think you need that X in there because it is okay to start from the beginning each time because you remove the first pair of <>'s each time.

@grg99: Maybe that's true if he wants it to be 100% right, but if he just wants something quick and dirty and that will mostly work, then I think his algorithm should be fine.

D*

jkr

Are you on Win32? If so, copying the HTML to the clipboard and getting it back with CF_TEXT should do it (like copying from a eb page and then pasting into notepad).

jkr

BTW, there is also no need to reinvent the wheel - there are lots of HTML parsers out there, e.g. 'libwww' from the w3c http://www.w3.org/Library/ or http://www.odin-consulting.com/OPP/

gkatz

switch programming languages to one with regular expressions such as PERL. You can write your program in 5 to 10 lines and it will run fairly quick. Perl can also be run inside of C++ if you have the rest of the program written in that language. C++ is a great language for some things buy why not use a language designed for tackling problems such as your own.

Remember, to someone with a hammer everything looks like a nail.

don't be afraid to try a new language

-gkatz

jconde

Will plain C code help you ? ... the only difference is you'll be working with char * instead of complex string /streaming classes.

My code works perfectly well in major and its used in major open-source applications ... its quite simple btw.

Just dorp a line if you want it or not.!

jconde

Actually, here's the function and a simple demo:

(even though my code looks much longer and maybe complex than previous suggestions, it's really light on CPU and works very fast.

#include <stdio.h>
#include <string.h>

void StripHtmlTags(char *rbuf)
{
char *tbuf, *buf, *p, *tp, *rp, c, lc;
int br, i=0, state=0, len;
len = strlen(rbuf);
buf = strdup(rbuf);
c = *buf;
lc = '\0';
p = buf;
rp = rbuf;
br = 0;
tp=NULL;
tbuf=NULL;
while(i<len)
{
switch (c)
{
case '<':
if (state == 0)
{
lc = '<';
state = 1;
}
break;

case '(':
if (state == 2)
{
if (lc != '\"')
{
lc = '(';
br++;
}
}
else
if (state == 0)
*(rp++) = c;
break;

case ')':
if (state == 2)
{
if (lc != '\"')
{
lc = ')';
br--;
}
}
else
if (state == 0)
*(rp++) = c;
break;

case '>':
if (state == 1)
{
lc = '>';
state = 0;
}
else
if (state == 2)
if (!br && lc != '\"' && *(p-1)=='?')
{
state = 0;
tp = tbuf;
}
break;

case '\"':
if (state == 2)
{
if (lc == '\"')
lc = '\0';
else
if (lc != '\\')
lc = '\"';
}
else
if (state == 0)
*(rp++) = c;
break;

case '?':
if (state==1 && *(p-1)=='<')
{
br=0;
state=2;
break;
}

default:
if (state == 0)
*(rp++) = c;
break;
}
c = *(++p);
i++;
}
*rp = '\0';
free(buf);
}

int main()
{
char x[100];
char *xp;
strcpy(x, "<html>This is an <b>HTML</b> test <font size=\"1\">that</font> <br>strips<p> tags</p></html>");
xp = x;
StripHtmlTags(xp);
printf("%s\n", xp);
return 0;
}

jconde

All that rests if for you to load up the contents of the file into a char *, send it to StripHtmlTags and write it back to the code. Using basic C functions such as FILE * fopen, fdopen or even open will be much faster than dealing with streams and wrappers in the String class you're using !

DanRollins

zeek_ja,
Please post a comment acknowledging the help these experts have provided. Thanks!
-- DanRollins, EE Page Editor

grg99

jconde's code is a very good start.

But it's going to get tripped up by the <script> tag. You need about 10 more lines to handle that properly.

Also <TABLE>'s are going to be unrecognizable. Another 20 lines to render these somewhat readdable.

zeek_ja

ASKER

Sorry guys.. was out of town... just got back this morining.

I had a chance to try Dexstar's method and it works great... i do realize however that it would get tripped up by <script>but for my purposes that is not an issue and therefore find that it is the answer to my question.

Thank you all for posting.

zeek