• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 396
  • Last Modified:

Separate One Very Long String into Many Short Lines of HTML Source by Inserting \n After Spaces or ">"

   My Perl script outputs HTML source for a whole page into one very l-o-n-g string named  $output.  Before saving it to a database text field, I want to separate this long string into many equal-length lines of HTML source code by inserting a newline "\n" after every 100 characters.   (I don't want to insert <br> or <br /> to make line breaks visible in the Web browser; I only want to affect the appearance of the source code itself.)
    However, if the 100th character is within a word, it should wait until the following *space* to insert the  \n .  
    If the 100th character is within a one-word tag (e.g. <strong> or <center> or <br> or </strong> or </center> or <ol>), it should wait until the ">" on the right side of the tag, before it inserts the  \n .  
    But if the tag has spaces, I think it's OK to insert a \n after the space, like <font \nface=Arial>.

    I want to have a function, like:   print wrap($output);  
and, for example, that would turn the string of 5,000 characters into 50 lines of about 100 characters each.

  In a similar question, an Expert offered the following code:
---------------------------------
use HTML::Parser;
$p = HTML::Parser->new( api_version => 3,
                         text_h => [\&text, "text"],
                         default_h => [sub { local $_=shift; $count -= length; print }, 'text'],
                        );
$limit=20;      
sub text{
    local $_=shift;
    $count = 0 if $count < 0;
    s/(\S{$count,})\s/$1\n/;
    s/(\n.{$limit,})\s/$1\n/g;
    $count = $limit - (rindex$_,"\n") + length;
    print;
}
$p->parse_file(*STDIN);
-----------------------------------
      However, it appears that code is designed to work with input from a file, rather than acting upon a string variable.
    Also, even if it could work as a function acting upon a string, I'm not sure how to arrange that, or whether it would break the lines according to my specifications above (after every 100 characters but not within a word or a one-word tag, etc.).  
    Can that code be arranged for use as a function like print wrap($output);  ?  If so, will it fit my specifications?  If not, what would? Expert guidance would be appreciated.
0
Randall-B
Asked:
Randall-B
  • 9
  • 5
  • 3
  • +1
2 Solutions
 
mjcoyneCommented:
Instead of:

$p->parse_file(*STDIN);

you probably want:

$p->parse( $output )
0
 
Randall-BAuthor Commented:
mjcoyne,
   When I tried that, and then did  print $p;  , the only output was:
                HTML::Parser=HASH(0x82bbd98)
Why?
0
 
ozoCommented:
you don't do print $p
you call $p->parse( $output ), which calls sub text (or whatever HTML::Parser->new told it to call)
which does the print, or whatever else you wanted to do with the parse
0
Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

 
Randall-BAuthor Commented:
That makes sense, but when I run  $p->parse( $output)  without a print statement, it doesn't show any results at all.  I see the line in that code which says "print;", but it does not seem to be printing anything.
0
 
Randall-BAuthor Commented:
Maybe the text-wrapping script at  http://www.infocopter.com/perl/recipe-string-wrap.htm  be modified to avoid adding newlines inside a word or one-word tag?
0
 
Randall-BAuthor Commented:
After testing the function from  http://www.infocopter.com/perl/recipe-string-wrap.htm , I believe it would work perfectly, if it could be changed to add the "\n" only after a space or ">".
    How would I revise that script to do this:
1. At the designated breakpoint, test whether the preceding character is a space or ">" .
2. If it is not a space or ">", move forward to the closest space or ">" and designate the breakpoint as coming after that next space or > .
3. Insert the "\n" at this new breakpoint.
0
 
Perl_DiverCommented:
this seems to be working pretty well:

sub PrintLine {
   if($Mode eq 'New') {
      $Word=~s|<font.*?>||ig;
      $Word=~s|<td>|<TD>|ig;
      $output .= "$Style{StartNew}$Word$Style{EndNew} ";
   }
   elsif($Mode eq 'Old') {
      $Word=~s|<[^<]+>||g;
      $output .= "$Style{StartOld}$Word$Style{EndOld} ";
   }
   else {
      $output .= $Word;
      $output .= ' ' if ($Word =~ />$/);
   }
   $output = wrap($output);
    print $output;
}
sub wrap {
   my $wp = 60;#<-- this should be the max line length
   my $line = reverse(shift);
   $line =~ s/(.{0,$wp})([\s<])/$1$2\n/g;  
   $line = reverse($line);
   return($line);
}

 I can't test with the PrintLine function but when I feed the wrap function some html encoded strings it produces fairly good output.




0
 
Randall-BAuthor Commented:
Perl_Diver,
    Thanks. This wrap() function is great. I just need a minor adjustment. Although it can usually break before  <  , it should not break right before </U> or </STRIKE>.  It is currently doing things like:  

<STRIKE>delete
</STRIKE>
    or
<U>add
</U>
When that happens, the browser is treating the  \n  as a space, and is wrongly underlining or striking the "space."  How can we prevent it from breaking right before </U> or </STRIKE>  ?  Thanks.
0
 
Perl_DiverCommented:
This could get very ugly very quickly if you keep adding exceptions, but see how this works:

sub wrap {
   no warnings;#<-- just in case
   my $wp = 72;#<-- this should be the max line length
   my $line = reverse(shift);
   $line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;  
   $line =~ s/\n+/\n/g;#<-- remove line if not needed
   $line = reverse($line);
   return($line);
}

remember, this is not an html formatter, this is just a crude way of wrapping html encoded text so that the html code doesn't break. *This method will never be perfect*.
0
 
Randall-BAuthor Commented:
Perl_Diver,
   Thanks. In my test, the <\/?u> and <\/?strike> did not affect the output. It is still breaking right before </U> and </STRIKE>.  I also tried <\/?U> and <\/?STRIKE> in case it was a case-sensitivity problem, but that did not affect the output, either. Maybe just a minor adjustment is needed?
0
 
Perl_DiverCommented:
My solution will never be perfect, but it will be very close. It would require much more than a minor adjustment to insure there was never a break at a start or end <u> <strike>  tag. Maybe someone else will be up to the challenge.
0
 
ozoCommented:
Couldn't you just break only where there is already a \s?
0
 
Perl_DiverCommented:
this might help:

replace this line:

 $line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;

with:

$line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

but as I said before, this will get real ugly real fast if you add anymore exceptions.  The above tries to get the first word after an opening tag or the last word before the closing tag and break on an internal space. But if you have something like this:

<u>nobreakanywhere</u>

it may not wrap correctly as the script will wrap it like this if there is no other wrap point found within the $wp limitation:

<u>
nobreakanywhere</u>
0
 
Perl_DiverCommented:
>>  Couldn't you just break only where there is already a \s?

see my above post. My original code alreay breaks on the first available space in the string:

$line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;

but of there is no wrap point in the string it will break on > as a last resort. My new regexp helps:

$line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

but it's ugly and getting unwieldly and produces funky source code depending on the $wp value.

Any comments or ideas appreciated.

 test script (with contrived test data aand a small value for $wp):


my $q = CGI->new;
print $q->header(),$q->start_html();
my $text = do{local $/; <DATA>};
$text= wrap($text);
print "As seen in browser\n\n";
print $text;
print "\n\n------------------ source code ---------------- \n\n";
print qq~<plaintext>
$text~;

sub wrap {
   no warnings;#<-- just in case
   my $wp = 14;#<-- this should be the max line length
   my $line = reverse(shift);
   $line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;
   $line =~ s/\n+/\n/g;#<-- remove line if not needed
   $line = reverse($line);
   return($line);
}
__DATA__
<head>
 <title>this is a test</title>
</head>
<body>
<h1><u>underlined text</u></h1>
<h1><strike>striked text</strike></h1>
0
 
ozoCommented:
You could break before the > instead of after
0
 
Randall-BAuthor Commented:
Perl_Diver,
    That seems to do just about what I need. I'll test it more this evening. Thanks.
0
 
Randall-BAuthor Commented:
Perl_Diver,
    I haven't been able to find any substantial bugs. Especially at a line width of about 90 (as I plan to use), it seems to work very well. Thanks!
0
 
Randall-BAuthor Commented:
Whoops, I spoke too soon. Now I know why I couldn't find any bugs:  I had an alternate RegEx line going, which was manually changing \n<\STRIKE> to <\STRIKE>\n  and  \n</U> to </U>\n .  When I commented out that extra RegEx, your code still produced line breaks before the </STRIKE> and </U> tags.
   However, after all this hard work, you deserve the points. Since this line

$line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

apparently isn't working, I've gone with the alternate idea of simply doing this later in the script:

$output=~s/\n<\/(STRIKE|U)>/<\/$1>\n/g;   Thanks.
0

Featured Post

Take Control of Web Hosting For Your Clients

As a web developer or IT admin, successfully managing multiple client accounts can be challenging. In this webinar we will look at the tools provided by Media Temple and Plesk to make managing your clients’ hosting easier.

  • 9
  • 5
  • 3
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now