Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Stripping HTML from a file

Posted on 2003-11-28
15
Medium Priority
?
343 Views
Last Modified: 2010-04-01
I need to strip all HTML from a file so i can work with the remaining text.. i've been trying to use a while loop as follows...:

int x = outStr.Find( '<' );
while ( x != -1 )
{
      x = outStr.Find( '<' );
      int y = outStr.Find( '>' );
      outStr.Delete( x, y - x  + 1);
}

this craps out with a massive memory and cpu hogging..

anyone have any ideas...

zeek
0
Comment
Question by:zeek_ja
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +5
15 Comments
 
LVL 3

Expert Comment

by:monkesdb
ID: 9841567
ok, here goes. *cracks knuckles*...


#include <string>
#include <algorithm>
#include <vector>

string oldStr = yourHTMLStuff();
ostringstream newStr;
ostream_iterator<string::value_type> newInsert(newStr);
vector<string::iterator> v;
string::iterator si;

do
{
    si = find(oldStr.begin(), oldStr.end(),'<');
    v.push_back(si);
    si = find(oldStr.begin(), oldStr.end(),'>');
    v.push_back(si);    
}while(si != oldStr.end())

copy(oldStr.begin(), *v.begin(), newInsert);

for( v.iterator i = v.begin()+1; i+1 < v.end(); i += 2 )
    copy(*i, *(i+1), newInsert);

copy(*v.end(), oldStr.end(), newInsert);

return newStr.str();  // like magic, all the tags have vanished


(i think) this is of course untested.
0
 
LVL 3

Expert Comment

by:monkesdb
ID: 9841572
you'll need

#include <ostringstream>
0
 
LVL 19

Accepted Solution

by:
Dexstar earned 2000 total points
ID: 9841686
@zeek_ja:

> int x = outStr.Find( '<' );
> while ( x != -1 )
> {
>      x = outStr.Find( '<' );
>      int y = outStr.Find( '>' );
>      outStr.Delete( x, y - x  + 1);
> }

Try this:
     int x = outStr.Find( '<' );
     int y = outStr.Find( '>', x );
     while ( (x != -1) && (y != -1) )
     {
          outStr.Delete( x, y - x  + 1);
          x = outStr.Find( '<' );
          y = outStr.Find( '>', x );
     }

@monkesdb:  Why would you provide an answer in stl strings when he was clearly using CStrings?  Switch libraries is not always an option.

Hope That Helps,
Dex*
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 3

Expert Comment

by:monkesdb
ID: 9842193
i don't know CString. never the less, is it not possible to accomplish the same logic using CString that i have using stl. create a list of indecies which are the positions of the <'s and >'s then concatenate every other pair.

however, i think i see your problems.

     int x = outStr.Find( '<' );
     int y = outStr.Find( '>', x );
     while ( (x != -1) && (y != -1) )
     {
          outStr.Delete( x, y - x  + 1);
          x = outStr.Find( '<', x );        // without the x here you start from the start each time
          y = outStr.Find( '>', x );
     }
0
 
LVL 22

Expert Comment

by:grg99
ID: 9842224
You can't just look for pairs of " < >" strings, as HTML has several context-dependent places where those characters can appear with other meanings, not as HTML tag delimiters.

For example, in a <script> </script>  environment, almost anything can appear, including the use of the delimiters as "less than" and "greater than" symbols.

They can also appear inside any quoted string, as well as any comment.

So the only fully correct way is to write a (relatively simple) HTML parser.  basically you have to look at each character in sequence and do the right thing, in order to skip <script></script> sequences, and quoted strings, and comments.

About 60 lines of code at most, I'd guess.

Regards,

grg99

0
 
LVL 19

Expert Comment

by:Dexstar
ID: 9842703
@monkesdb:  Yeah, he could port your algorithm, but maybe he doesn't know STL string...  Anyway, I don't think you need that X in there because it is okay to start from the beginning each time because you remove the first pair of <>'s each time.

@grg99:  Maybe that's true if he wants it to be 100% right, but if he just wants something quick and dirty and that will mostly work, then I think his algorithm should be fine.

D*
0
 
LVL 86

Expert Comment

by:jkr
ID: 9843086
Are you on Win32? If so, copying the HTML to the clipboard and getting it back with CF_TEXT should do it (like copying from a eb page and then pasting into notepad).
0
 
LVL 86

Expert Comment

by:jkr
ID: 9843373
BTW, there is also no need to reinvent the wheel - there are lots of HTML parsers out there, e.g. 'libwww' from the w3c http://www.w3.org/Library/ or http://www.odin-consulting.com/OPP/
0
 
LVL 3

Expert Comment

by:gkatz
ID: 9843628
switch programming languages to one with regular expressions such as PERL.  You can write your program in 5 to 10 lines and it will run fairly quick.  Perl can also be run inside of C++ if you have the rest of the program written in that language.  C++ is a great language for some things buy why not use a language designed for tackling problems such as your own.

Remember, to someone with a hammer everything looks like a nail.  

don't be afraid to try a new language


-gkatz
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844424
Will plain C code help you ? ... the only difference is you'll be working with char * instead of complex string /streaming classes.

My code works perfectly well in major and its used in major open-source applications ... its quite simple btw.

Just dorp a line if you want it or not.!
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844457
Actually, here's the function and a simple demo:

(even though my code looks much longer and maybe complex than previous suggestions, it's really light on CPU and works very fast.

#include <stdio.h>
#include <string.h>
         
void StripHtmlTags(char *rbuf)
{
  char *tbuf, *buf, *p, *tp, *rp, c, lc;
  int br, i=0, state=0, len;
  len = strlen(rbuf);
  buf = strdup(rbuf);
  c = *buf;  
  lc = '\0';
  p = buf;
  rp = rbuf;
  br = 0;
  tp=NULL;
  tbuf=NULL;  
  while(i<len)
  {
    switch (c)
    {
    case '<':
      if (state == 0)
      {
        lc = '<';
        state = 1;  
      }
      break;
       
    case '(':
      if (state == 2)
      {
        if (lc != '\"')
        {  
          lc = '(';  
          br++;
        }  
      }
      else
        if (state == 0)
          *(rp++) = c;
        break;
       
    case ')':
      if (state == 2)
      {
        if (lc != '\"')
        {
          lc = ')';
          br--;
        }
      }
      else
        if (state == 0)
          *(rp++) = c;
        break;
       
    case '>':  
      if (state == 1)
      {
        lc = '>';
        state = 0;
      }
      else
        if (state == 2)
          if (!br && lc != '\"' && *(p-1)=='?')
          {
            state = 0;
            tp = tbuf;
          }  
          break;
 
    case '\"':
      if (state == 2)
      {
        if (lc == '\"')
         lc = '\0';
        else  
          if (lc != '\\')
            lc = '\"';
      }  
      else
        if (state == 0)
          *(rp++) = c;
        break;
   
    case '?':
      if (state==1 && *(p-1)=='<')
      {
        br=0;
        state=2;
        break;
      }
         
    default:
      if (state == 0)
        *(rp++) = c;  
      break;
    }
    c = *(++p);
    i++;
  }
  *rp = '\0';
  free(buf);
}
       
         
int main()
{
  char x[100];
  char *xp;
  strcpy(x, "<html>This is an <b>HTML</b> test <font size=\"1\">that</font> <br>strips<p> tags</p></html>");
  xp = x;
  StripHtmlTags(xp);
  printf("%s\n", xp);
  return 0;
}
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844466
All that rests if for you to load up the contents of the file into a char *, send it to StripHtmlTags and write it back to the code.  Using basic C functions such as FILE * fopen, fdopen or even open will be much faster than dealing with streams and wrappers in the String class you're using !
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 9865003
zeek_ja,
Please post a comment acknowledging the help these experts have provided.  Thanks!
-- DanRollins, EE Page Editor
0
 
LVL 22

Expert Comment

by:grg99
ID: 9866581
jconde's code is a very good start.

 But it's going to get tripped up by the <script> tag.  You need about 10 more lines to handle that properly.

Also <TABLE>'s are going to be unrecognizable. Another 20 lines to render these somewhat readdable.

0
 

Author Comment

by:zeek_ja
ID: 9867321
Sorry guys.. was out of town... just got back this morining.

I had a chance to try Dexstar's method and it works great... i do realize however that it would get tripped up by <script>but for my purposes that is not an issue and therefore find that it is the answer to my question.

Thank you all for posting.

zeek
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Suggested Courses

670 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question