Solved

Separate One Very Long String into Many Short Lines of HTML Source by Inserting \n After Spaces or ">"

Posted on 2006-11-22
18
387 Views
Last Modified: 2008-02-01
   My Perl script outputs HTML source for a whole page into one very l-o-n-g string named  $output.  Before saving it to a database text field, I want to separate this long string into many equal-length lines of HTML source code by inserting a newline "\n" after every 100 characters.   (I don't want to insert <br> or <br /> to make line breaks visible in the Web browser; I only want to affect the appearance of the source code itself.)
    However, if the 100th character is within a word, it should wait until the following *space* to insert the  \n .  
    If the 100th character is within a one-word tag (e.g. <strong> or <center> or <br> or </strong> or </center> or <ol>), it should wait until the ">" on the right side of the tag, before it inserts the  \n .  
    But if the tag has spaces, I think it's OK to insert a \n after the space, like <font \nface=Arial>.

    I want to have a function, like:   print wrap($output);  
and, for example, that would turn the string of 5,000 characters into 50 lines of about 100 characters each.

  In a similar question, an Expert offered the following code:
---------------------------------
use HTML::Parser;
$p = HTML::Parser->new( api_version => 3,
                         text_h => [\&text, "text"],
                         default_h => [sub { local $_=shift; $count -= length; print }, 'text'],
                        );
$limit=20;      
sub text{
    local $_=shift;
    $count = 0 if $count < 0;
    s/(\S{$count,})\s/$1\n/;
    s/(\n.{$limit,})\s/$1\n/g;
    $count = $limit - (rindex$_,"\n") + length;
    print;
}
$p->parse_file(*STDIN);
-----------------------------------
      However, it appears that code is designed to work with input from a file, rather than acting upon a string variable.
    Also, even if it could work as a function acting upon a string, I'm not sure how to arrange that, or whether it would break the lines according to my specifications above (after every 100 characters but not within a word or a one-word tag, etc.).  
    Can that code be arranged for use as a function like print wrap($output);  ?  If so, will it fit my specifications?  If not, what would? Expert guidance would be appreciated.
0
Comment
Question by:Randall-B
  • 9
  • 5
  • 3
  • +1
18 Comments
 
LVL 17

Expert Comment

by:mjcoyne
ID: 17996285
Instead of:

$p->parse_file(*STDIN);

you probably want:

$p->parse( $output )
0
 

Author Comment

by:Randall-B
ID: 17996866
mjcoyne,
   When I tried that, and then did  print $p;  , the only output was:
                HTML::Parser=HASH(0x82bbd98)
Why?
0
 
LVL 84

Expert Comment

by:ozo
ID: 17997081
you don't do print $p
you call $p->parse( $output ), which calls sub text (or whatever HTML::Parser->new told it to call)
which does the print, or whatever else you wanted to do with the parse
0
 

Author Comment

by:Randall-B
ID: 17997276
That makes sense, but when I run  $p->parse( $output)  without a print statement, it doesn't show any results at all.  I see the line in that code which says "print;", but it does not seem to be printing anything.
0
 

Author Comment

by:Randall-B
ID: 17997529
Maybe the text-wrapping script at  http://www.infocopter.com/perl/recipe-string-wrap.htm  be modified to avoid adding newlines inside a word or one-word tag?
0
 

Author Comment

by:Randall-B
ID: 17997912
After testing the function from  http://www.infocopter.com/perl/recipe-string-wrap.htm , I believe it would work perfectly, if it could be changed to add the "\n" only after a space or ">".
    How would I revise that script to do this:
1. At the designated breakpoint, test whether the preceding character is a space or ">" .
2. If it is not a space or ">", move forward to the closest space or ">" and designate the breakpoint as coming after that next space or > .
3. Insert the "\n" at this new breakpoint.
0
 
LVL 8

Expert Comment

by:Perl_Diver
ID: 18000255
this seems to be working pretty well:

sub PrintLine {
   if($Mode eq 'New') {
      $Word=~s|<font.*?>||ig;
      $Word=~s|<td>|<TD>|ig;
      $output .= "$Style{StartNew}$Word$Style{EndNew} ";
   }
   elsif($Mode eq 'Old') {
      $Word=~s|<[^<]+>||g;
      $output .= "$Style{StartOld}$Word$Style{EndOld} ";
   }
   else {
      $output .= $Word;
      $output .= ' ' if ($Word =~ />$/);
   }
   $output = wrap($output);
    print $output;
}
sub wrap {
   my $wp = 60;#<-- this should be the max line length
   my $line = reverse(shift);
   $line =~ s/(.{0,$wp})([\s<])/$1$2\n/g;  
   $line = reverse($line);
   return($line);
}

 I can't test with the PrintLine function but when I feed the wrap function some html encoded strings it produces fairly good output.




0
 

Author Comment

by:Randall-B
ID: 18000528
Perl_Diver,
    Thanks. This wrap() function is great. I just need a minor adjustment. Although it can usually break before  <  , it should not break right before </U> or </STRIKE>.  It is currently doing things like:  

<STRIKE>delete
</STRIKE>
    or
<U>add
</U>
When that happens, the browser is treating the  \n  as a space, and is wrongly underlining or striking the "space."  How can we prevent it from breaking right before </U> or </STRIKE>  ?  Thanks.
0
 
LVL 8

Assisted Solution

by:Perl_Diver
Perl_Diver earned 500 total points
ID: 18000853
This could get very ugly very quickly if you keep adding exceptions, but see how this works:

sub wrap {
   no warnings;#<-- just in case
   my $wp = 72;#<-- this should be the max line length
   my $line = reverse(shift);
   $line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;  
   $line =~ s/\n+/\n/g;#<-- remove line if not needed
   $line = reverse($line);
   return($line);
}

remember, this is not an html formatter, this is just a crude way of wrapping html encoded text so that the html code doesn't break. *This method will never be perfect*.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:Randall-B
ID: 18002975
Perl_Diver,
   Thanks. In my test, the <\/?u> and <\/?strike> did not affect the output. It is still breaking right before </U> and </STRIKE>.  I also tried <\/?U> and <\/?STRIKE> in case it was a case-sensitivity problem, but that did not affect the output, either. Maybe just a minor adjustment is needed?
0
 
LVL 8

Expert Comment

by:Perl_Diver
ID: 18004143
My solution will never be perfect, but it will be very close. It would require much more than a minor adjustment to insure there was never a break at a start or end <u> <strike>  tag. Maybe someone else will be up to the challenge.
0
 
LVL 84

Expert Comment

by:ozo
ID: 18004161
Couldn't you just break only where there is already a \s?
0
 
LVL 8

Accepted Solution

by:
Perl_Diver earned 500 total points
ID: 18004175
this might help:

replace this line:

 $line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;

with:

$line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

but as I said before, this will get real ugly real fast if you add anymore exceptions.  The above tries to get the first word after an opening tag or the last word before the closing tag and break on an internal space. But if you have something like this:

<u>nobreakanywhere</u>

it may not wrap correctly as the script will wrap it like this if there is no other wrap point found within the $wp limitation:

<u>
nobreakanywhere</u>
0
 
LVL 8

Expert Comment

by:Perl_Diver
ID: 18004196
>>  Couldn't you just break only where there is already a \s?

see my above post. My original code alreay breaks on the first available space in the string:

$line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;

but of there is no wrap point in the string it will break on > as a last resort. My new regexp helps:

$line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

but it's ugly and getting unwieldly and produces funky source code depending on the $wp value.

Any comments or ideas appreciated.

 test script (with contrived test data aand a small value for $wp):


my $q = CGI->new;
print $q->header(),$q->start_html();
my $text = do{local $/; <DATA>};
$text= wrap($text);
print "As seen in browser\n\n";
print $text;
print "\n\n------------------ source code ---------------- \n\n";
print qq~<plaintext>
$text~;

sub wrap {
   no warnings;#<-- just in case
   my $wp = 14;#<-- this should be the max line length
   my $line = reverse(shift);
   $line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;
   $line =~ s/\n+/\n/g;#<-- remove line if not needed
   $line = reverse($line);
   return($line);
}
__DATA__
<head>
 <title>this is a test</title>
</head>
<body>
<h1><u>underlined text</u></h1>
<h1><strike>striked text</strike></h1>
0
 
LVL 84

Expert Comment

by:ozo
ID: 18004241
You could break before the > instead of after
0
 

Author Comment

by:Randall-B
ID: 18004611
Perl_Diver,
    That seems to do just about what I need. I'll test it more this evening. Thanks.
0
 

Author Comment

by:Randall-B
ID: 18005254
Perl_Diver,
    I haven't been able to find any substantial bugs. Especially at a line width of about 90 (as I plan to use), it seems to work very well. Thanks!
0
 

Author Comment

by:Randall-B
ID: 18005287
Whoops, I spoke too soon. Now I know why I couldn't find any bugs:  I had an alternate RegEx line going, which was manually changing \n<\STRIKE> to <\STRIKE>\n  and  \n</U> to </U>\n .  When I commented out that extra RegEx, your code still produced line breaks before the </STRIKE> and </U> tags.
   However, after all this hard work, you deserve the points. Since this line

$line =~ s/(.{0,$wp})(\s|<u>\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

apparently isn't working, I've gone with the alternate idea of simply doing this later in the script:

$output=~s/\n<\/(STRIKE|U)>/<\/$1>\n/g;   Thanks.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video discusses moving either the default database or any database to a new volume.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now