Solved

Using CString Left and Right, killing HTML

Posted on 2002-04-09
8
464 Views
Last Modified: 2008-02-26
Is there an easy way, if any, to take text in a CInternetFile and remove ALL the HTML tags, so that just plain text is there?

Right now I am connecting to an URL, opening a CInternetFile and reading it line by line, but all the HTML is really getting in the way, making it very hard to parse data. Any suggestions?

Thanks,
Dan
0
Comment
Question by:SuperMario
8 Comments
 
LVL 3

Expert Comment

by:Crius
ID: 6929425
Well... You can use a generic parser to rip out all the tags...

char *WorkData, *WorkOutput;
bool InTag = false;

WorkData = InData;
WorkOutput = OutData;

while(*WorkData)
{
   if(InTag)
   {
      if(*WorkData++ == '>')
         InTag = false;
   }
   else   //Not intag
   {
      if(*WorkData == '<')
         InTag = true;
      else
         *OutData++ = *WorkData;
      WorkData++;
   }
}

I haven't run this particular code, but that's the general idea... Hope it helps?
0
 
LVL 3

Expert Comment

by:Crius
ID: 6929430
Just another comment before I forget. This won't ignore > or < in double quotes, and stuff, and it doesn't use a counter, just a boolean, so it can be improved if you need it to be...
0
 
LVL 4

Accepted Solution

by:
pagladasu earned 100 total points
ID: 6929493
If the line is in a CString variable called strLine then you can do something like:

CString strExtract = strLine.Mid(strLine.Find('<') + 1, strLine.Find('>') - 1);

0
 
LVL 10

Expert Comment

by:makerp
ID: 6930912
once you have all your data from the internet function, do this then create a CString

char *removeHTML(char *str)
{
     bool copy = true;
     int current = 0;

     char *retval = (char*)malloc(strlen(str) + 1);

     for(int i=0;i<strlen(str);i++)
     {
          if(str[i] == '<')
               copy = false;
          if(str[i] == '>')
               copy = true;

          if(copy && str[i] != '>') retval[current++] = str[i];
     }
     retval[current++] = '\0';

     return (retval = (char*)realloc(retval,current));
}
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 3

Expert Comment

by:Crius
ID: 6931389
Makerp's solution is a simpler one to understand than mine, but I'd like to mention if you do use it, you should at a minimum store the value of strlen(str) into a variable, and use it instead in the for loop and malloc. That way you don't have to evaluate the strlen each time.

As an alternative to using strlen() in the loop in any form, you could do a str[i] as the condition, which fails on the ending NULL character breaking you out of the loop..
0
 
LVL 10

Expert Comment

by:makerp
ID: 6931435
good point, yes never keep calling routines like strlen as they do linear searches through the string, imagine a 64k string
0
 
LVL 3

Author Comment

by:SuperMario
ID: 6931525
Actually, the way my program is structured allows CString.Mid to work very well. I forgot that it had another overridden prototype that accepts 2 parameters! There was also some logic error on my part within the function that I sorted out.

Thank you all for your input... sorry it was such a simple answer. I really appreciate the help!
0
 
LVL 3

Expert Comment

by:Crius
ID: 6931554
Awesome to hear. :) Never a need to apologize for asking a simple question. Who knows what type of complicated answers could result. :)

char *removeHTML(char *str)
{
    char *retval = (char*)calloc(strlen(str) + 1, 1);
    char *WorkBegin, *WorkEnd;

    WorkBegin = str;
    while(1)
    {
         if(WorkEnd = strchr(WorkBegin, '<'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
         if(WorkEnd = strchr(WorkBegin, '>'))
             memcpy(retval, WorkBegin, WorkEnd-WorkBegin);
         else       //Malformed HTML or incomplete string
         {
             memcpy(retval, WorkBegin, strlen(WorkBegin));
             break;
         }
         WorkBegin = WorkEnd + 1;
    }
    return (retval = (char*)realloc(retval,strlen(retval)+1));
}
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Handling string inputs in C/Linux 23 179
Header of docx file 17 95
Create a path if not exists 7 67
Copy output image from TWindowsMediaPlayer 6 18
Often, when implementing a feature, you won't know how certain events should be handled at the point where they occur and you'd rather defer to the user of your function or class. For example, a XML parser will extract a tag from the source code, wh…
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

948 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now