Solved

Using CString Left and Right, killing HTML

Posted on 2002-04-09
8
475 Views
Last Modified: 2008-02-26
Is there an easy way, if any, to take text in a CInternetFile and remove ALL the HTML tags, so that just plain text is there?

Right now I am connecting to an URL, opening a CInternetFile and reading it line by line, but all the HTML is really getting in the way, making it very hard to parse data. Any suggestions?

Thanks,
Dan
0
Comment
Question by:SuperMario
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 3

Expert Comment

by:Crius
ID: 6929425
Well... You can use a generic parser to rip out all the tags...

char *WorkData, *WorkOutput;
bool InTag = false;

WorkData = InData;
WorkOutput = OutData;

while(*WorkData)
{
   if(InTag)
   {
      if(*WorkData++ == '>')
         InTag = false;
   }
   else   //Not intag
   {
      if(*WorkData == '<')
         InTag = true;
      else
         *OutData++ = *WorkData;
      WorkData++;
   }
}

I haven't run this particular code, but that's the general idea... Hope it helps?
0
 
LVL 3

Expert Comment

by:Crius
ID: 6929430
Just another comment before I forget. This won't ignore > or < in double quotes, and stuff, and it doesn't use a counter, just a boolean, so it can be improved if you need it to be...
0
 
LVL 4

Accepted Solution

by:
pagladasu earned 100 total points
ID: 6929493
If the line is in a CString variable called strLine then you can do something like:

CString strExtract = strLine.Mid(strLine.Find('<') + 1, strLine.Find('>') - 1);

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 10

Expert Comment

by:makerp
ID: 6930912
once you have all your data from the internet function, do this then create a CString

char *removeHTML(char *str)
{
     bool copy = true;
     int current = 0;

     char *retval = (char*)malloc(strlen(str) + 1);

     for(int i=0;i<strlen(str);i++)
     {
          if(str[i] == '<')
               copy = false;
          if(str[i] == '>')
               copy = true;

          if(copy && str[i] != '>') retval[current++] = str[i];
     }
     retval[current++] = '\0';

     return (retval = (char*)realloc(retval,current));
}
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931389
Makerp's solution is a simpler one to understand than mine, but I'd like to mention if you do use it, you should at a minimum store the value of strlen(str) into a variable, and use it instead in the for loop and malloc. That way you don't have to evaluate the strlen each time.

As an alternative to using strlen() in the loop in any form, you could do a str[i] as the condition, which fails on the ending NULL character breaking you out of the loop..
0
 
LVL 10

Expert Comment

by:makerp
ID: 6931435
good point, yes never keep calling routines like strlen as they do linear searches through the string, imagine a 64k string
0
 
LVL 3

Author Comment

by:SuperMario
ID: 6931525
Actually, the way my program is structured allows CString.Mid to work very well. I forgot that it had another overridden prototype that accepts 2 parameters! There was also some logic error on my part within the function that I sorted out.

Thank you all for your input... sorry it was such a simple answer. I really appreciate the help!
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931554
Awesome to hear. :) Never a need to apologize for asking a simple question. Who knows what type of complicated answers could result. :)

char *removeHTML(char *str)
{
    char *retval = (char*)calloc(strlen(str) + 1, 1);
    char *WorkBegin, *WorkEnd;

    WorkBegin = str;
    while(1)
    {
         if(WorkEnd = strchr(WorkBegin, '<'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
         if(WorkEnd = strchr(WorkBegin, '>'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else       //Malformed HTML or incomplete string
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
    }
    return (retval = (char*)realloc(retval,strlen(retval)+1));
}
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
c++ substatte a varabe for a string in a LPCTSTR statment 8 95
Least Squares Curve Fitting 4 117
Header Errors LNK2019, LNK1120 - Unresolved Externals 4 255
Gaming Software 1 30
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question