Solved

Using CString Left and Right, killing HTML

Posted on 2002-04-09
8
463 Views
Last Modified: 2008-02-26
Is there an easy way, if any, to take text in a CInternetFile and remove ALL the HTML tags, so that just plain text is there?

Right now I am connecting to an URL, opening a CInternetFile and reading it line by line, but all the HTML is really getting in the way, making it very hard to parse data. Any suggestions?

Thanks,
Dan
0
Comment
Question by:SuperMario
8 Comments
 
LVL 3

Expert Comment

by:Crius
ID: 6929425
Well... You can use a generic parser to rip out all the tags...

char *WorkData, *WorkOutput;
bool InTag = false;

WorkData = InData;
WorkOutput = OutData;

while(*WorkData)
{
   if(InTag)
   {
      if(*WorkData++ == '>')
         InTag = false;
   }
   else   //Not intag
   {
      if(*WorkData == '<')
         InTag = true;
      else
         *OutData++ = *WorkData;
      WorkData++;
   }
}

I haven't run this particular code, but that's the general idea... Hope it helps?
0
 
LVL 3

Expert Comment

by:Crius
ID: 6929430
Just another comment before I forget. This won't ignore > or < in double quotes, and stuff, and it doesn't use a counter, just a boolean, so it can be improved if you need it to be...
0
 
LVL 4

Accepted Solution

by:
pagladasu earned 100 total points
ID: 6929493
If the line is in a CString variable called strLine then you can do something like:

CString strExtract = strLine.Mid(strLine.Find('<') + 1, strLine.Find('>') - 1);

0
 
LVL 10

Expert Comment

by:makerp
ID: 6930912
once you have all your data from the internet function, do this then create a CString

char *removeHTML(char *str)
{
     bool copy = true;
     int current = 0;

     char *retval = (char*)malloc(strlen(str) + 1);

     for(int i=0;i<strlen(str);i++)
     {
          if(str[i] == '<')
               copy = false;
          if(str[i] == '>')
               copy = true;

          if(copy && str[i] != '>') retval[current++] = str[i];
     }
     retval[current++] = '\0';

     return (retval = (char*)realloc(retval,current));
}
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 3

Expert Comment

by:Crius
ID: 6931389
Makerp's solution is a simpler one to understand than mine, but I'd like to mention if you do use it, you should at a minimum store the value of strlen(str) into a variable, and use it instead in the for loop and malloc. That way you don't have to evaluate the strlen each time.

As an alternative to using strlen() in the loop in any form, you could do a str[i] as the condition, which fails on the ending NULL character breaking you out of the loop..
0
 
LVL 10

Expert Comment

by:makerp
ID: 6931435
good point, yes never keep calling routines like strlen as they do linear searches through the string, imagine a 64k string
0
 
LVL 3

Author Comment

by:SuperMario
ID: 6931525
Actually, the way my program is structured allows CString.Mid to work very well. I forgot that it had another overridden prototype that accepts 2 parameters! There was also some logic error on my part within the function that I sorted out.

Thank you all for your input... sorry it was such a simple answer. I really appreciate the help!
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931554
Awesome to hear. :) Never a need to apologize for asking a simple question. Who knows what type of complicated answers could result. :)

char *removeHTML(char *str)
{
    char *retval = (char*)calloc(strlen(str) + 1, 1);
    char *WorkBegin, *WorkEnd;

    WorkBegin = str;
    while(1)
    {
         if(WorkEnd = strchr(WorkBegin, '<'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
         if(WorkEnd = strchr(WorkBegin, '>'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else       //Malformed HTML or incomplete string
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
    }
    return (retval = (char*)realloc(retval,strlen(retval)+1));
}
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now