Solved

Using CString Left and Right, killing HTML

Posted on 2002-04-09
8
469 Views
Last Modified: 2008-02-26
Is there an easy way, if any, to take text in a CInternetFile and remove ALL the HTML tags, so that just plain text is there?

Right now I am connecting to an URL, opening a CInternetFile and reading it line by line, but all the HTML is really getting in the way, making it very hard to parse data. Any suggestions?

Thanks,
Dan
0
Comment
Question by:SuperMario
8 Comments
 
LVL 3

Expert Comment

by:Crius
ID: 6929425
Well... You can use a generic parser to rip out all the tags...

char *WorkData, *WorkOutput;
bool InTag = false;

WorkData = InData;
WorkOutput = OutData;

while(*WorkData)
{
   if(InTag)
   {
      if(*WorkData++ == '>')
         InTag = false;
   }
   else   //Not intag
   {
      if(*WorkData == '<')
         InTag = true;
      else
         *OutData++ = *WorkData;
      WorkData++;
   }
}

I haven't run this particular code, but that's the general idea... Hope it helps?
0
 
LVL 3

Expert Comment

by:Crius
ID: 6929430
Just another comment before I forget. This won't ignore > or < in double quotes, and stuff, and it doesn't use a counter, just a boolean, so it can be improved if you need it to be...
0
 
LVL 4

Accepted Solution

by:
pagladasu earned 100 total points
ID: 6929493
If the line is in a CString variable called strLine then you can do something like:

CString strExtract = strLine.Mid(strLine.Find('<') + 1, strLine.Find('>') - 1);

0
Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

 
LVL 10

Expert Comment

by:makerp
ID: 6930912
once you have all your data from the internet function, do this then create a CString

char *removeHTML(char *str)
{
     bool copy = true;
     int current = 0;

     char *retval = (char*)malloc(strlen(str) + 1);

     for(int i=0;i<strlen(str);i++)
     {
          if(str[i] == '<')
               copy = false;
          if(str[i] == '>')
               copy = true;

          if(copy && str[i] != '>') retval[current++] = str[i];
     }
     retval[current++] = '\0';

     return (retval = (char*)realloc(retval,current));
}
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931389
Makerp's solution is a simpler one to understand than mine, but I'd like to mention if you do use it, you should at a minimum store the value of strlen(str) into a variable, and use it instead in the for loop and malloc. That way you don't have to evaluate the strlen each time.

As an alternative to using strlen() in the loop in any form, you could do a str[i] as the condition, which fails on the ending NULL character breaking you out of the loop..
0
 
LVL 10

Expert Comment

by:makerp
ID: 6931435
good point, yes never keep calling routines like strlen as they do linear searches through the string, imagine a 64k string
0
 
LVL 3

Author Comment

by:SuperMario
ID: 6931525
Actually, the way my program is structured allows CString.Mid to work very well. I forgot that it had another overridden prototype that accepts 2 parameters! There was also some logic error on my part within the function that I sorted out.

Thank you all for your input... sorry it was such a simple answer. I really appreciate the help!
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931554
Awesome to hear. :) Never a need to apologize for asking a simple question. Who knows what type of complicated answers could result. :)

char *removeHTML(char *str)
{
    char *retval = (char*)calloc(strlen(str) + 1, 1);
    char *WorkBegin, *WorkEnd;

    WorkBegin = str;
    while(1)
    {
         if(WorkEnd = strchr(WorkBegin, '<'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
         if(WorkEnd = strchr(WorkBegin, '>'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else       //Malformed HTML or incomplete string
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
    }
    return (retval = (char*)realloc(retval,strlen(retval)+1));
}
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Templates For Beginners Or How To Encourage The Compiler To Work For You Introduction This tutorial is targeted at the reader who is, perhaps, familiar with the basics of C++ but would prefer a little slower introduction to the more ad…
What is C++ STL?: STL stands for Standard Template Library and is a part of standard C++ libraries. It contains many useful data structures (containers) and algorithms, which can spare you a lot of the time. Today we will look at the STL Vector. …
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

785 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question