Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 228
  • Last Modified:

Remove tabs, double spaces and HTML


I need a function that accepts a character point and length of data in it. It should then remove all occurances of tabs, multiple spaces (not more than one space character between words) and all HTML from the input.

Here myDoc is input and result is the final output stream.
void removeTabsNSpacesNHtml(char *myDoc, int len, char *result, int *resultLen)

To remove HTML, one option can be to remove eveything in <>.  The final result should be plain text.
0
rohgan
Asked:
rohgan
  • 5
  • 3
  • 2
  • +2
3 Solutions
 
ozoCommented:
Could the input look like
                <IMG SRC = "foo.gif"
                        ALT = "A > B">

                   <!-- <A comment> -->

                   <script>if (a<b && a>c)</script>

                   <# Just data #>

                   <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

                   <!-- This section commented out.
                       <B>You can't see me!</B>
                   -->
0
 
rohganAuthor Commented:
Mostly contain real HTML that needs to be removed. Its a good idea to remove <!-- content also along with any script.
0
 
rohganAuthor Commented:
I am just trying to capture the actual data displayed on screen and getting rid of the HTML. The code would be run on a certain paragraph of the page. Will appreciate if someone could provide a basic simple solution to remove spaces, tabs and HTML. Optimization or fool proofing can be done later. Additional checks are always welcome
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
sunnycoderCommented:
Hi rohgan,

Why reinvent the wheel :-) .. there are plenty of freeware tools which would do this for you ...
e.g.
http://www.snapfiles.com/get/killhtml.html
http://search.cpan.org/dist/HTML-Strip/

for more such utilities
http://www.google.com/search?hl=en&q=strip+html

Cheers!
sunnycoder
0
 
cryptosidCommented:
Sunnycoder is correct, its easier to use some readymade programs.

to remove spaces you could use a simple technique..


char sbuf[100],tbuf[100];
int i=0,j=0;

while(sbuf[i]!=NULL)
{
    if (sbuf[i]==' ')
    {
         tbuf[j]=sbuf[i];
         j++;i++;
         //skip spaces two or more spaces
          while(sbuf[i]!=NULL&&sbuf[i]==' ')
                i++;
           
    }else
       tbuf[j]=s[i];
    j++;
    i++;

}

this is what i could quickly think of for removing extra spaces.. you could work out some things better..

but this is the long way to do it.. some guys might have a simpler way to do it..

Regards,
Siddhesh
0
 
cryptosidCommented:
removing HTML statments might require you to create a Dynamic STACK , maybe use some POSTFIX algorithm or something like that to exclude the nested HTML statements.. thats a tedious job i think..
0
 
grg99Commented:
It's easy to do a half-bottomed job of this, rather hard to do it 100% correctly.  

The easy but flawed way is to just look at a character at a time, looking for spaces, tabs, and "<".

That will work for about 72% of web pages.

The fuly correct way requires you to accurately parse the HTML syntax, which is rather involved.

 The things that can trip the easy way include : Quoted strings with "<" in them, block comments, #hex and %hex characters, embedded scripts, and other things that can throw your simple-minded parsers for a loop.

0
 
rohganAuthor Commented:
Thanks all.

1. Could you please provide me code to remove tabs?

2. I dont mind using any ready made code for stripping HTML. But it seems all of it is in other languages. Anyone know any C code for this? If not, could you provide some basic code?
0
 
sunnycoderCommented:
Hi rohgan ,

Is there any particular reason why you wish to do this in C only? Other languages are better suited for this work.

I browsed through your profile and question history and it seems you are currently attending some courses. Is this a course assignment/project ?

sunnycoder
0
 
rohganAuthor Commented:
This is a commercial project, I think I am a commercial member here. The code is going to be used to provide input to a  larger program that is already coded in C. I just need these additional modules.
0
 
rohganAuthor Commented:
If an admin needs details of project to verify, I could provide them via email (not post on forum)
0
 
cryptosidCommented:
In the code above, just replace 'space character' with '\t' the tab character...

if (sbuf[i]=='\t')
    {
         tbuf[j]=sbuf[i];
         j++;i++;
         //skip spaces two or more spaces
          while(sbuf[i]!=NULL&&sbuf[i]=='\t')
                i++;
           
    }else
       tbuf[j]=s[i];

Hope that helps.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 3
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now