Solved

Stripping HTML from a file

Posted on 2003-11-28
15
324 Views
Last Modified: 2010-04-01
I need to strip all HTML from a file so i can work with the remaining text.. i've been trying to use a while loop as follows...:

int x = outStr.Find( '<' );
while ( x != -1 )
{
      x = outStr.Find( '<' );
      int y = outStr.Find( '>' );
      outStr.Delete( x, y - x  + 1);
}

this craps out with a massive memory and cpu hogging..

anyone have any ideas...

zeek
0
Comment
Question by:zeek_ja
  • 3
  • 3
  • 2
  • +5
15 Comments
 
LVL 3

Expert Comment

by:monkesdb
ID: 9841567
ok, here goes. *cracks knuckles*...


#include <string>
#include <algorithm>
#include <vector>

string oldStr = yourHTMLStuff();
ostringstream newStr;
ostream_iterator<string::value_type> newInsert(newStr);
vector<string::iterator> v;
string::iterator si;

do
{
    si = find(oldStr.begin(), oldStr.end(),'<');
    v.push_back(si);
    si = find(oldStr.begin(), oldStr.end(),'>');
    v.push_back(si);    
}while(si != oldStr.end())

copy(oldStr.begin(), *v.begin(), newInsert);

for( v.iterator i = v.begin()+1; i+1 < v.end(); i += 2 )
    copy(*i, *(i+1), newInsert);

copy(*v.end(), oldStr.end(), newInsert);

return newStr.str();  // like magic, all the tags have vanished


(i think) this is of course untested.
0
 
LVL 3

Expert Comment

by:monkesdb
ID: 9841572
you'll need

#include <ostringstream>
0
 
LVL 19

Accepted Solution

by:
Dexstar earned 500 total points
ID: 9841686
@zeek_ja:

> int x = outStr.Find( '<' );
> while ( x != -1 )
> {
>      x = outStr.Find( '<' );
>      int y = outStr.Find( '>' );
>      outStr.Delete( x, y - x  + 1);
> }

Try this:
     int x = outStr.Find( '<' );
     int y = outStr.Find( '>', x );
     while ( (x != -1) && (y != -1) )
     {
          outStr.Delete( x, y - x  + 1);
          x = outStr.Find( '<' );
          y = outStr.Find( '>', x );
     }

@monkesdb:  Why would you provide an answer in stl strings when he was clearly using CStrings?  Switch libraries is not always an option.

Hope That Helps,
Dex*
0
 
LVL 3

Expert Comment

by:monkesdb
ID: 9842193
i don't know CString. never the less, is it not possible to accomplish the same logic using CString that i have using stl. create a list of indecies which are the positions of the <'s and >'s then concatenate every other pair.

however, i think i see your problems.

     int x = outStr.Find( '<' );
     int y = outStr.Find( '>', x );
     while ( (x != -1) && (y != -1) )
     {
          outStr.Delete( x, y - x  + 1);
          x = outStr.Find( '<', x );        // without the x here you start from the start each time
          y = outStr.Find( '>', x );
     }
0
 
LVL 22

Expert Comment

by:grg99
ID: 9842224
You can't just look for pairs of " < >" strings, as HTML has several context-dependent places where those characters can appear with other meanings, not as HTML tag delimiters.

For example, in a <script> </script>  environment, almost anything can appear, including the use of the delimiters as "less than" and "greater than" symbols.

They can also appear inside any quoted string, as well as any comment.

So the only fully correct way is to write a (relatively simple) HTML parser.  basically you have to look at each character in sequence and do the right thing, in order to skip <script></script> sequences, and quoted strings, and comments.

About 60 lines of code at most, I'd guess.

Regards,

grg99

0
 
LVL 19

Expert Comment

by:Dexstar
ID: 9842703
@monkesdb:  Yeah, he could port your algorithm, but maybe he doesn't know STL string...  Anyway, I don't think you need that X in there because it is okay to start from the beginning each time because you remove the first pair of <>'s each time.

@grg99:  Maybe that's true if he wants it to be 100% right, but if he just wants something quick and dirty and that will mostly work, then I think his algorithm should be fine.

D*
0
 
LVL 86

Expert Comment

by:jkr
ID: 9843086
Are you on Win32? If so, copying the HTML to the clipboard and getting it back with CF_TEXT should do it (like copying from a eb page and then pasting into notepad).
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 86

Expert Comment

by:jkr
ID: 9843373
BTW, there is also no need to reinvent the wheel - there are lots of HTML parsers out there, e.g. 'libwww' from the w3c http://www.w3.org/Library/ or http://www.odin-consulting.com/OPP/
0
 
LVL 3

Expert Comment

by:gkatz
ID: 9843628
switch programming languages to one with regular expressions such as PERL.  You can write your program in 5 to 10 lines and it will run fairly quick.  Perl can also be run inside of C++ if you have the rest of the program written in that language.  C++ is a great language for some things buy why not use a language designed for tackling problems such as your own.

Remember, to someone with a hammer everything looks like a nail.  

don't be afraid to try a new language


-gkatz
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844424
Will plain C code help you ? ... the only difference is you'll be working with char * instead of complex string /streaming classes.

My code works perfectly well in major and its used in major open-source applications ... its quite simple btw.

Just dorp a line if you want it or not.!
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844457
Actually, here's the function and a simple demo:

(even though my code looks much longer and maybe complex than previous suggestions, it's really light on CPU and works very fast.

#include <stdio.h>
#include <string.h>
         
void StripHtmlTags(char *rbuf)
{
  char *tbuf, *buf, *p, *tp, *rp, c, lc;
  int br, i=0, state=0, len;
  len = strlen(rbuf);
  buf = strdup(rbuf);
  c = *buf;  
  lc = '\0';
  p = buf;
  rp = rbuf;
  br = 0;
  tp=NULL;
  tbuf=NULL;  
  while(i<len)
  {
    switch (c)
    {
    case '<':
      if (state == 0)
      {
        lc = '<';
        state = 1;  
      }
      break;
       
    case '(':
      if (state == 2)
      {
        if (lc != '\"')
        {  
          lc = '(';  
          br++;
        }  
      }
      else
        if (state == 0)
          *(rp++) = c;
        break;
       
    case ')':
      if (state == 2)
      {
        if (lc != '\"')
        {
          lc = ')';
          br--;
        }
      }
      else
        if (state == 0)
          *(rp++) = c;
        break;
       
    case '>':  
      if (state == 1)
      {
        lc = '>';
        state = 0;
      }
      else
        if (state == 2)
          if (!br && lc != '\"' && *(p-1)=='?')
          {
            state = 0;
            tp = tbuf;
          }  
          break;
 
    case '\"':
      if (state == 2)
      {
        if (lc == '\"')
         lc = '\0';
        else  
          if (lc != '\\')
            lc = '\"';
      }  
      else
        if (state == 0)
          *(rp++) = c;
        break;
   
    case '?':
      if (state==1 && *(p-1)=='<')
      {
        br=0;
        state=2;
        break;
      }
         
    default:
      if (state == 0)
        *(rp++) = c;  
      break;
    }
    c = *(++p);
    i++;
  }
  *rp = '\0';
  free(buf);
}
       
         
int main()
{
  char x[100];
  char *xp;
  strcpy(x, "<html>This is an <b>HTML</b> test <font size=\"1\">that</font> <br>strips<p> tags</p></html>");
  xp = x;
  StripHtmlTags(xp);
  printf("%s\n", xp);
  return 0;
}
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844466
All that rests if for you to load up the contents of the file into a char *, send it to StripHtmlTags and write it back to the code.  Using basic C functions such as FILE * fopen, fdopen or even open will be much faster than dealing with streams and wrappers in the String class you're using !
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 9865003
zeek_ja,
Please post a comment acknowledging the help these experts have provided.  Thanks!
-- DanRollins, EE Page Editor
0
 
LVL 22

Expert Comment

by:grg99
ID: 9866581
jconde's code is a very good start.

 But it's going to get tripped up by the <script> tag.  You need about 10 more lines to handle that properly.

Also <TABLE>'s are going to be unrecognizable. Another 20 lines to render these somewhat readdable.

0
 

Author Comment

by:zeek_ja
ID: 9867321
Sorry guys.. was out of town... just got back this morining.

I had a chance to try Dexstar's method and it works great... i do realize however that it would get tripped up by <script>but for my purposes that is not an issue and therefore find that it is the answer to my question.

Thank you all for posting.

zeek
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Templates For Beginners Or How To Encourage The Compiler To Work For You Introduction This tutorial is targeted at the reader who is, perhaps, familiar with the basics of C++ but would prefer a little slower introduction to the more ad…
  Included as part of the C++ Standard Template Library (STL) is a collection of generic containers. Each of these containers serves a different purpose and has different pros and cons. It is often difficult to decide which container to use and …
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now