[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

remove CSS and HTML tags

Posted on 2011-04-22
3
Medium Priority
?
414 Views
Last Modified: 2012-05-11
Hello. Can anyone suggest an effective way to remove CSS and HTML tags from pages as they are being scraped?
Thanks.
"
Girl in the Bath, Peninsula Hotel, Tokyo | Surly Bastard
div#fancy_inner {border-color:#BBBBBB}
div#fancy_close {right:-15px;top:-12px}
div#fancy_bg {background-color:#FFFFFF}
.wp-rotator-wrap {
padding: 0; margin: 0;
}
.wp-rotator-wrap .pane {
height: 300px;
width: 400px;
overflow: hidden;
position: relative;
padding: 0px;
margin: 0px;
}
.wp-rotator-wrap .elements {
height: 300px;
padding: 0px;
margin: 0px;
}
.wp-rotator-wrap .featured-cell {
width: 400px;
height: 300px;
display: block;
position: absolute;
top: 0;
left: 0;
margin: 0px;
padding: 0px;
}
.wp-rotator-wrap .featured-cell .image {
position: absolute;
top: 0;
left: 0;
}
.wp-rotator-wrap .featured-cell .info {
position: absolute;
left: 0;
bottom: 0px;
width: 400px;
height: 50px;
padding: 8px 8px;
overflow: hidden;
background: url(http://surlybastard.org/wp-content/plugins/wp-rotator/feature-bg.png) transparent;
color: #ddd;  
}
.wp-rotator-wrap .featured-cell .info h1 {
margin: 0;
padding: 0;
font-size: 15px;
color: #CCD;
}
.wp-rotator-wrap .current-cell { z-index: 500; }
Surly Bastard<br>
You annoy me.
Girl in the Bath, Peninsula Hotel, Tokyo<br>
<br>
I forget who she was.
Recent entries
Me by Andreas<br>
Burlesque<br>
For Japan<br>
Ribcage<br>
Lil Bastard<br>
Meta
Log in<br>
XHTML<br>
Recent comments
Blogroll
Documentation<br>
Plugins<br>
Suggest Ideas<br>
Support Forum<br>
Themes<br>
WordPress Blog<br>
WordPress Planet<br>
Pages
Test<br>
Cats
My Shit<br> (89)
Shit I like<br> (10)
Things I didnt create<br> (10)
Things that dont suck<br> (8)
Archives
Select month
April 2011  (3)
March 2011  (8)
February 2011  (5)
January 2011  (29)
December 2010  (26)
November 2010  (28)
Tag Cloud
All photos © Jim O'Connell, unless otherwise noted. 
"

Open in new window

0
Comment
Question by:onyourmark
  • 2
3 Comments
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 1336 total points
ID: 35447738
strip_tags() is your friend.
http://us.php.net/manual/en/function.strip-tags.php

If you want to post a sample of the inputs and your expected outputs, we might be able to give you more concrete assistance.  Best regards, ~Ray
0
 
LVL 12

Assisted Solution

by:Mohamed Abowarda
Mohamed Abowarda earned 664 total points
ID: 35449712
strip_tags() is a default HTML Tag stripper in PHP. However, it can't strip some of the tags, so this enhanced version called strip_html_tags will remove more html elements.

How to strip HTML tags, scripts, and styles from a web page:
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page
/**
 * Remove HTML tags, including invisible text such as style and
 * script code, and embedded objects.  Add line breaks around
 * block-level tags to prevent word joining after tag removal.
 */
function strip_html_tags( $text )
{
    $text = preg_replace(
        array(
          // Remove invisible content
            '@<head[^>]*?>.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags( $text );
}

Open in new window

0
 
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1336 total points
ID: 35450770
can't strip some of the tags

Which ones?  Have you got an example showing how it fails?  While it's a fairly brittle function, I think it works well if you have valid HTML.
http://us.php.net/manual/en/function.strip-tags.php
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this. Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it i…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses
Course of the Month19 days, 4 hours left to enroll

834 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question