?
Solved

Using CString Left and Right, killing HTML

Posted on 2002-04-09
8
Medium Priority
?
492 Views
Last Modified: 2008-02-26
Is there an easy way, if any, to take text in a CInternetFile and remove ALL the HTML tags, so that just plain text is there?

Right now I am connecting to an URL, opening a CInternetFile and reading it line by line, but all the HTML is really getting in the way, making it very hard to parse data. Any suggestions?

Thanks,
Dan
0
Comment
Question by:SuperMario
8 Comments
 
LVL 3

Expert Comment

by:Crius
ID: 6929425
Well... You can use a generic parser to rip out all the tags...

char *WorkData, *WorkOutput;
bool InTag = false;

WorkData = InData;
WorkOutput = OutData;

while(*WorkData)
{
   if(InTag)
   {
      if(*WorkData++ == '>')
         InTag = false;
   }
   else   //Not intag
   {
      if(*WorkData == '<')
         InTag = true;
      else
         *OutData++ = *WorkData;
      WorkData++;
   }
}

I haven't run this particular code, but that's the general idea... Hope it helps?
0
 
LVL 3

Expert Comment

by:Crius
ID: 6929430
Just another comment before I forget. This won't ignore > or < in double quotes, and stuff, and it doesn't use a counter, just a boolean, so it can be improved if you need it to be...
0
 
LVL 4

Accepted Solution

by:
pagladasu earned 300 total points
ID: 6929493
If the line is in a CString variable called strLine then you can do something like:

CString strExtract = strLine.Mid(strLine.Find('<') + 1, strLine.Find('>') - 1);

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 10

Expert Comment

by:makerp
ID: 6930912
once you have all your data from the internet function, do this then create a CString

char *removeHTML(char *str)
{
     bool copy = true;
     int current = 0;

     char *retval = (char*)malloc(strlen(str) + 1);

     for(int i=0;i<strlen(str);i++)
     {
          if(str[i] == '<')
               copy = false;
          if(str[i] == '>')
               copy = true;

          if(copy && str[i] != '>') retval[current++] = str[i];
     }
     retval[current++] = '\0';

     return (retval = (char*)realloc(retval,current));
}
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931389
Makerp's solution is a simpler one to understand than mine, but I'd like to mention if you do use it, you should at a minimum store the value of strlen(str) into a variable, and use it instead in the for loop and malloc. That way you don't have to evaluate the strlen each time.

As an alternative to using strlen() in the loop in any form, you could do a str[i] as the condition, which fails on the ending NULL character breaking you out of the loop..
0
 
LVL 10

Expert Comment

by:makerp
ID: 6931435
good point, yes never keep calling routines like strlen as they do linear searches through the string, imagine a 64k string
0
 
LVL 3

Author Comment

by:SuperMario
ID: 6931525
Actually, the way my program is structured allows CString.Mid to work very well. I forgot that it had another overridden prototype that accepts 2 parameters! There was also some logic error on my part within the function that I sorted out.

Thank you all for your input... sorry it was such a simple answer. I really appreciate the help!
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931554
Awesome to hear. :) Never a need to apologize for asking a simple question. Who knows what type of complicated answers could result. :)

char *removeHTML(char *str)
{
    char *retval = (char*)calloc(strlen(str) + 1, 1);
    char *WorkBegin, *WorkEnd;

    WorkBegin = str;
    while(1)
    {
         if(WorkEnd = strchr(WorkBegin, '<'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
         if(WorkEnd = strchr(WorkBegin, '>'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else       //Malformed HTML or incomplete string
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
    }
    return (retval = (char*)realloc(retval,strlen(retval)+1));
}
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Unlike C#, C++ doesn't have native support for sealing classes (so they cannot be sub-classed). At the cost of a virtual base class pointer it is possible to implement a pseudo sealing mechanism The trick is to virtually inherit from a base class…
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
Suggested Courses

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question