• C

Remove tabs, double spaces and HTML


I need a function that accepts a character point and length of data in it. It should then remove all occurances of tabs, multiple spaces (not more than one space character between words) and all HTML from the input.

Here myDoc is input and result is the final output stream.
void removeTabsNSpacesNHtml(char *myDoc, int len, char *result, int *resultLen)

To remove HTML, one option can be to remove eveything in <>.  The final result should be plain text.
rohganAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
Could the input look like
                <IMG SRC = "foo.gif"
                        ALT = "A > B">

                   <!-- <A comment> -->

                   <script>if (a<b && a>c)</script>

                   <# Just data #>

                   <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

                   <!-- This section commented out.
                       <B>You can't see me!</B>
                   -->
0
rohganAuthor Commented:
Mostly contain real HTML that needs to be removed. Its a good idea to remove <!-- content also along with any script.
0
rohganAuthor Commented:
I am just trying to capture the actual data displayed on screen and getting rid of the HTML. The code would be run on a certain paragraph of the page. Will appreciate if someone could provide a basic simple solution to remove spaces, tabs and HTML. Optimization or fool proofing can be done later. Additional checks are always welcome
0
Firewall Management 201 with Professor Wool

In this whiteboard video, Professor Wool highlights the challenges, benefits and trade-offs of utilizing zero-touch automation for security policy change management. Watch and Learn!

sunnycoderCommented:
Hi rohgan,

Why reinvent the wheel :-) .. there are plenty of freeware tools which would do this for you ...
e.g.
http://www.snapfiles.com/get/killhtml.html
http://search.cpan.org/dist/HTML-Strip/

for more such utilities
http://www.google.com/search?hl=en&q=strip+html

Cheers!
sunnycoder
0
cryptosidCommented:
Sunnycoder is correct, its easier to use some readymade programs.

to remove spaces you could use a simple technique..


char sbuf[100],tbuf[100];
int i=0,j=0;

while(sbuf[i]!=NULL)
{
    if (sbuf[i]==' ')
    {
         tbuf[j]=sbuf[i];
         j++;i++;
         //skip spaces two or more spaces
          while(sbuf[i]!=NULL&&sbuf[i]==' ')
                i++;
           
    }else
       tbuf[j]=s[i];
    j++;
    i++;

}

this is what i could quickly think of for removing extra spaces.. you could work out some things better..

but this is the long way to do it.. some guys might have a simpler way to do it..

Regards,
Siddhesh
0
cryptosidCommented:
removing HTML statments might require you to create a Dynamic STACK , maybe use some POSTFIX algorithm or something like that to exclude the nested HTML statements.. thats a tedious job i think..
0
grg99Commented:
It's easy to do a half-bottomed job of this, rather hard to do it 100% correctly.  

The easy but flawed way is to just look at a character at a time, looking for spaces, tabs, and "<".

That will work for about 72% of web pages.

The fuly correct way requires you to accurately parse the HTML syntax, which is rather involved.

 The things that can trip the easy way include : Quoted strings with "<" in them, block comments, #hex and %hex characters, embedded scripts, and other things that can throw your simple-minded parsers for a loop.

0
rohganAuthor Commented:
Thanks all.

1. Could you please provide me code to remove tabs?

2. I dont mind using any ready made code for stripping HTML. But it seems all of it is in other languages. Anyone know any C code for this? If not, could you provide some basic code?
0
sunnycoderCommented:
Hi rohgan ,

Is there any particular reason why you wish to do this in C only? Other languages are better suited for this work.

I browsed through your profile and question history and it seems you are currently attending some courses. Is this a course assignment/project ?

sunnycoder
0
rohganAuthor Commented:
This is a commercial project, I think I am a commercial member here. The code is going to be used to provide input to a  larger program that is already coded in C. I just need these additional modules.
0
rohganAuthor Commented:
If an admin needs details of project to verify, I could provide them via email (not post on forum)
0
cryptosidCommented:
In the code above, just replace 'space character' with '\t' the tab character...

if (sbuf[i]=='\t')
    {
         tbuf[j]=sbuf[i];
         j++;i++;
         //skip spaces two or more spaces
          while(sbuf[i]!=NULL&&sbuf[i]=='\t')
                i++;
           
    }else
       tbuf[j]=s[i];

Hope that helps.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.