Stripping HTML from a file

I need to strip all HTML from a file so i can work with the remaining text.. i've been trying to use a while loop as follows...:

int x = outStr.Find( '<' );
while ( x != -1 )
{
      x = outStr.Find( '<' );
      int y = outStr.Find( '>' );
      outStr.Delete( x, y - x  + 1);
}

this craps out with a massive memory and cpu hogging..

anyone have any ideas...

zeek
zeek_jaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

monkesdbCommented:
ok, here goes. *cracks knuckles*...


#include <string>
#include <algorithm>
#include <vector>

string oldStr = yourHTMLStuff();
ostringstream newStr;
ostream_iterator<string::value_type> newInsert(newStr);
vector<string::iterator> v;
string::iterator si;

do
{
    si = find(oldStr.begin(), oldStr.end(),'<');
    v.push_back(si);
    si = find(oldStr.begin(), oldStr.end(),'>');
    v.push_back(si);    
}while(si != oldStr.end())

copy(oldStr.begin(), *v.begin(), newInsert);

for( v.iterator i = v.begin()+1; i+1 < v.end(); i += 2 )
    copy(*i, *(i+1), newInsert);

copy(*v.end(), oldStr.end(), newInsert);

return newStr.str();  // like magic, all the tags have vanished


(i think) this is of course untested.
0
monkesdbCommented:
you'll need

#include <ostringstream>
0
DexstarCommented:
@zeek_ja:

> int x = outStr.Find( '<' );
> while ( x != -1 )
> {
>      x = outStr.Find( '<' );
>      int y = outStr.Find( '>' );
>      outStr.Delete( x, y - x  + 1);
> }

Try this:
     int x = outStr.Find( '<' );
     int y = outStr.Find( '>', x );
     while ( (x != -1) && (y != -1) )
     {
          outStr.Delete( x, y - x  + 1);
          x = outStr.Find( '<' );
          y = outStr.Find( '>', x );
     }

@monkesdb:  Why would you provide an answer in stl strings when he was clearly using CStrings?  Switch libraries is not always an option.

Hope That Helps,
Dex*
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

monkesdbCommented:
i don't know CString. never the less, is it not possible to accomplish the same logic using CString that i have using stl. create a list of indecies which are the positions of the <'s and >'s then concatenate every other pair.

however, i think i see your problems.

     int x = outStr.Find( '<' );
     int y = outStr.Find( '>', x );
     while ( (x != -1) && (y != -1) )
     {
          outStr.Delete( x, y - x  + 1);
          x = outStr.Find( '<', x );        // without the x here you start from the start each time
          y = outStr.Find( '>', x );
     }
0
grg99Commented:
You can't just look for pairs of " < >" strings, as HTML has several context-dependent places where those characters can appear with other meanings, not as HTML tag delimiters.

For example, in a <script> </script>  environment, almost anything can appear, including the use of the delimiters as "less than" and "greater than" symbols.

They can also appear inside any quoted string, as well as any comment.

So the only fully correct way is to write a (relatively simple) HTML parser.  basically you have to look at each character in sequence and do the right thing, in order to skip <script></script> sequences, and quoted strings, and comments.

About 60 lines of code at most, I'd guess.

Regards,

grg99

0
DexstarCommented:
@monkesdb:  Yeah, he could port your algorithm, but maybe he doesn't know STL string...  Anyway, I don't think you need that X in there because it is okay to start from the beginning each time because you remove the first pair of <>'s each time.

@grg99:  Maybe that's true if he wants it to be 100% right, but if he just wants something quick and dirty and that will mostly work, then I think his algorithm should be fine.

D*
0
jkrCommented:
Are you on Win32? If so, copying the HTML to the clipboard and getting it back with CF_TEXT should do it (like copying from a eb page and then pasting into notepad).
0
jkrCommented:
BTW, there is also no need to reinvent the wheel - there are lots of HTML parsers out there, e.g. 'libwww' from the w3c http://www.w3.org/Library/ or http://www.odin-consulting.com/OPP/
0
gkatzCommented:
switch programming languages to one with regular expressions such as PERL.  You can write your program in 5 to 10 lines and it will run fairly quick.  Perl can also be run inside of C++ if you have the rest of the program written in that language.  C++ is a great language for some things buy why not use a language designed for tackling problems such as your own.

Remember, to someone with a hammer everything looks like a nail.  

don't be afraid to try a new language


-gkatz
0
jcondeCommented:
Will plain C code help you ? ... the only difference is you'll be working with char * instead of complex string /streaming classes.

My code works perfectly well in major and its used in major open-source applications ... its quite simple btw.

Just dorp a line if you want it or not.!
0
jcondeCommented:
Actually, here's the function and a simple demo:

(even though my code looks much longer and maybe complex than previous suggestions, it's really light on CPU and works very fast.

#include <stdio.h>
#include <string.h>
         
void StripHtmlTags(char *rbuf)
{
  char *tbuf, *buf, *p, *tp, *rp, c, lc;
  int br, i=0, state=0, len;
  len = strlen(rbuf);
  buf = strdup(rbuf);
  c = *buf;  
  lc = '\0';
  p = buf;
  rp = rbuf;
  br = 0;
  tp=NULL;
  tbuf=NULL;  
  while(i<len)
  {
    switch (c)
    {
    case '<':
      if (state == 0)
      {
        lc = '<';
        state = 1;  
      }
      break;
       
    case '(':
      if (state == 2)
      {
        if (lc != '\"')
        {  
          lc = '(';  
          br++;
        }  
      }
      else
        if (state == 0)
          *(rp++) = c;
        break;
       
    case ')':
      if (state == 2)
      {
        if (lc != '\"')
        {
          lc = ')';
          br--;
        }
      }
      else
        if (state == 0)
          *(rp++) = c;
        break;
       
    case '>':  
      if (state == 1)
      {
        lc = '>';
        state = 0;
      }
      else
        if (state == 2)
          if (!br && lc != '\"' && *(p-1)=='?')
          {
            state = 0;
            tp = tbuf;
          }  
          break;
 
    case '\"':
      if (state == 2)
      {
        if (lc == '\"')
         lc = '\0';
        else  
          if (lc != '\\')
            lc = '\"';
      }  
      else
        if (state == 0)
          *(rp++) = c;
        break;
   
    case '?':
      if (state==1 && *(p-1)=='<')
      {
        br=0;
        state=2;
        break;
      }
         
    default:
      if (state == 0)
        *(rp++) = c;  
      break;
    }
    c = *(++p);
    i++;
  }
  *rp = '\0';
  free(buf);
}
       
         
int main()
{
  char x[100];
  char *xp;
  strcpy(x, "<html>This is an <b>HTML</b> test <font size=\"1\">that</font> <br>strips<p> tags</p></html>");
  xp = x;
  StripHtmlTags(xp);
  printf("%s\n", xp);
  return 0;
}
0
jcondeCommented:
All that rests if for you to load up the contents of the file into a char *, send it to StripHtmlTags and write it back to the code.  Using basic C functions such as FILE * fopen, fdopen or even open will be much faster than dealing with streams and wrappers in the String class you're using !
0
DanRollinsCommented:
zeek_ja,
Please post a comment acknowledging the help these experts have provided.  Thanks!
-- DanRollins, EE Page Editor
0
grg99Commented:
jconde's code is a very good start.

 But it's going to get tripped up by the <script> tag.  You need about 10 more lines to handle that properly.

Also <TABLE>'s are going to be unrecognizable. Another 20 lines to render these somewhat readdable.

0
zeek_jaAuthor Commented:
Sorry guys.. was out of town... just got back this morining.

I had a chance to try Dexstar's method and it works great... i do realize however that it would get tripped up by <script>but for my purposes that is not an issue and therefore find that it is the answer to my question.

Thank you all for posting.

zeek
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.