Link to home
Start Free TrialLog in
Avatar of webstuck5
webstuck5

asked on

Get Rid of Spaces between > and <

I use the following code:

        $page_entire_code =~ s/> +?</></g;

to remove spaces between > and < in my HTML web pages. However, I noticed that it messes up my web page's breadcrumbs. For example:

  <div id="breadcrumb" itemprop="breadcrumb">
    <b>
      You are here: <a href="http://www.romancestuck.com/">RomanceStuck</a> > <a href="http://www.romancestuck.com/marriage/love-and-marriage.htm">Marriage</a> > 11 Tips for Improving a Strained Relationship
    </b>
  </div>

Open in new window


gets compressed to:

<div id="breadcrumb" itemprop="breadcrumb"><b>You are here: <a href="http://www.romancestuck.com/">RomanceStuck</a> ><a href="http://www.romancestuck.com/marriage/love-and-marriage.htm">Marriage</a> > 11 Tips for Improving a Strained Relationship</b></div>

Open in new window


The > after the RomanceStuck link doesn't have a space after it like it should. How can I change my Perl substitution line so that it doesn't mess up my breadcrumbs? I was thinking maybe I could say replace > that come after any characters except a space.

Thanks!
Avatar of ozo
ozo
Flag of United States of America image

Have you read
perldoc -q html
Using regexs for html is generally not a good idea.  Something like HTML::Packer comes to mind for minifying html.

Though, if you really need something quick and dirty, you can try using &gt; instead of a >
As Phil suggests, you should never use > (or <) in html - you should always use &gt; or &lt;.  If you do that, your regex will work fine.
Avatar of webstuck5
webstuck5

ASKER

I thought about using &gt; but https://support.google.com/webmasters/answer/185417?hl=en shows to use >. I don't want to chance Google not using my breadcrumb because of my use of &gt;.

$page_entire_code =~ s/([^ ]>) +?</$1</g; looks to work!

Thanks for all your help!
I've requested that this question be closed as follows:

Accepted answer: 0 points for webstuck5's comment #a39819291

for the following reason:

I figured out how to do it.
I am very, very surprised that Google says to use >.  The HTML spec strongly suggests never using < or > (always use &lt; and &gt;).  XHTML being based on XML means that < and > are illegal characters (the doc is invalid if it contains either in text).
That example is actually using a single right-pointing angle quotation mark (&rsaquo;), which looks awfully similar to a >.  I don't think they're saying that you *have* to do it that way though - it just happened to be the example.

I'd shy away from <> when possible, but glad to hear you came up with something that works for you.
I now see that the Google example doesn't actually use > but now I am more confused. In the Google example's HTML source, it shows the symbol as ›. I put &rsaquo; on my page as suggested but it shows as &rsaquo; when I view the page's HTML source. What is the difference between my page and the Google example page?
In the source, it'll show the raw html, so all of the codes (such as &rsaquo;) will be in their original form.

On the google example page, they used the actual symbol ›.
So, how can I use the actual › symbol? When I put it in my code, it showed as a weird question mark when I viewed the web page.
ASKER CERTIFIED SOLUTION
Avatar of Phil Phillips
Phil Phillips
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial