asked on

Separate One Very Long String into Many Short Lines of HTML Source by Inserting \n After Spaces or ">"

My Perl script outputs HTML source for a whole page into one very l-o-n-g string named $output. Before saving it to a database text field, I want to separate this long string into many equal-length lines of HTML source code by inserting a newline "\n" after every 100 characters. (I don't want to insert or to make line breaks visible in the Web browser; I only want to affect the appearance of the source code itself.)
However, if the 100th character is within a word, it should wait until the following *space* to insert the \n .
If the 100th character is within a one-word tag (e.g. or <center> or or or </center> or <ol>), it should wait until the ">" on the right side of the tag, before it inserts the \n .
But if the tag has spaces, I think it's OK to insert a \n after the space, like .

I want to have a function, like: print wrap($output);
and, for example, that would turn the string of 5,000 characters into 50 lines of about 100 characters each.

In a similar question, an Expert offered the following code:
---------------------------------
use HTML::Parser;
$p = HTML::Parser->new( api_version => 3,
text_h => [\&text, "text"],
default_h => [sub { local $_=shift; $count -= length; print }, 'text'],
);
$limit=20;
sub text{
local $_=shift;
$count = 0 if $count < 0;
s/(\S{$count,})\s/$1\n/;
s/(\n.{$limit,})\s/$1\n/g;
$count = $limit - (rindex$_,"\n") + length;
print;
}
$p->parse_file(*STDIN);
-----------------------------------
However, it appears that code is designed to work with input from a file, rather than acting upon a string variable.
Also, even if it could work as a function acting upon a string, I'm not sure how to arrange that, or whether it would break the lines according to my specifications above (after every 100 characters but not within a word or a one-word tag, etc.).
Can that code be arranged for use as a function like print wrap($output); ? If so, will it fit my specifications? If not, what would? Expert guidance would be appreciated.

mjcoyne

Instead of:

$p->parse_file(*STDIN);

you probably want:

$p->parse( $output )

Randall-B

ASKER

mjcoyne,
When I tried that, and then did print $p; , the only output was:
HTML::Parser=HASH(0x82bbd98)
Why?

ozo

you don't do print $p
you call $p->parse( $output ), which calls sub text (or whatever HTML::Parser->new told it to call)
which does the print, or whatever else you wanted to do with the parse

Randall-B

ASKER

That makes sense, but when I run $p->parse( $output) without a print statement, it doesn't show any results at all. I see the line in that code which says "print;", but it does not seem to be printing anything.

Randall-B

ASKER

Maybe the text-wrapping script at http://www.infocopter.com/perl/recipe-string-wrap.htm be modified to avoid adding newlines inside a word or one-word tag?

Randall-B

ASKER

After testing the function from http://www.infocopter.com/perl/recipe-string-wrap.htm , I believe it would work perfectly, if it could be changed to add the "\n" only after a space or ">".
How would I revise that script to do this:
1. At the designated breakpoint, test whether the preceding character is a space or ">" .
2. If it is not a space or ">", move forward to the closest space or ">" and designate the breakpoint as coming after that next space or > .
3. Insert the "\n" at this new breakpoint.

Perl_Diver

this seems to be working pretty well:

sub PrintLine {
if($Mode eq 'New') {
$Word=~s|<font.*?>||ig;
$Word=~s|<td>|<TD>|ig;
$output .= "$Style{StartNew}$Word$Style{EndNew} ";
}
elsif($Mode eq 'Old') {
$Word=~s|<[^<]+>||g;
$output .= "$Style{StartOld}$Word$Style{EndOld} ";
}
else {
$output .= $Word;
$output .= ' ' if ($Word =~ />$/);
}
$output = wrap($output);
print $output;
}
sub wrap {
my $wp = 60;#<-- this should be the max line length
my $line = reverse(shift);
$line =~ s/(.{0,$wp})([\s<])/$1$2\n/g;
$line = reverse($line);
return($line);
}

I can't test with the PrintLine function but when I feed the wrap function some html encoded strings it produces fairly good output.

Randall-B

ASKER

Perl_Diver,
Thanks. This wrap() function is great. I just need a minor adjustment. Although it can usually break before < , it should not break right before or </STRIKE>. It is currently doing things like:

<STRIKE>delete
</STRIKE>
or
add

When that happens, the browser is treating the \n as a space, and is wrongly underlining or striking the "space." How can we prevent it from breaking right before or </STRIKE> ? Thanks.

SOLUTION

Perl_Diver

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Randall-B

ASKER

Perl_Diver,
Thanks. In my test, the <\/?u> and <\/?strike> did not affect the output. It is still breaking right before and </STRIKE>. I also tried <\/?U> and <\/?STRIKE> in case it was a case-sensitivity problem, but that did not affect the output, either. Maybe just a minor adjustment is needed?

Perl_Diver

My solution will never be perfect, but it will be very close. It would require much more than a minor adjustment to insure there was never a break at a start or end <strike> tag. Maybe someone else will be up to the challenge.

ozo

Couldn't you just break only where there is already a \s?

ASKER CERTIFIED SOLUTION

Perl_Diver

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Perl_Diver

>> Couldn't you just break only where there is already a \s?

see my above post. My original code alreay breaks on the first available space in the string:

$line =~ s/(.{0,$wp})(\s|<\/?u>|<\/?strike>|<)/$1$2\n/ig;

but of there is no wrap point in the string it will break on > as a last resort. My new regexp helps:

$line =~ s/(.{0,$wp})(\s|\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

but it's ugly and getting unwieldly and produces funky source code depending on the $wp value.

Any comments or ideas appreciated.

test script (with contrived test data aand a small value for $wp):

my $q = CGI->new;
print $q->header(),$q->start_html();
my $text = do{local $/; <DATA>};
$text= wrap($text);
print "As seen in browser\n\n";
print $text;
print "\n\n------------------ source code ---------------- \n\n";
print qq~<plaintext>
$text~;

sub wrap {
no warnings;#<-- just in case
my $wp = 14;#<-- this should be the max line length
my $line = reverse(shift);
$line =~ s/(.{0,$wp})(\s|\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;
$line =~ s/\n+/\n/g;#<-- remove line if not needed
$line = reverse($line);
return($line);
}
__DATA__
<head>
<title>this is a test</title>
</head>
<body>
<h1>underlined text</h1>
<h1><strike>striked text</strike></h1>

ozo

You could break before the > instead of after

Randall-B

ASKER

Perl_Diver,
That seems to do just about what I need. I'll test it more this evening. Thanks.

Randall-B

ASKER

Perl_Diver,
I haven't been able to find any substantial bugs. Especially at a line width of about 90 (as I plan to use), it seems to work very well. Thanks!

Randall-B

ASKER

Whoops, I spoke too soon. Now I know why I couldn't find any bugs: I had an alternate RegEx line going, which was manually changing \n<\STRIKE> to <\STRIKE>\n and \n to \n . When I commented out that extra RegEx, your code still produced line breaks before the </STRIKE> and tags.
However, after all this hard work, you deserve the points. Since this line

$line =~ s/(.{0,$wp})(\s|\s*\S+\s|\s\S+\s*<\/u>|<?strike>\s*\S+\s|\s\S+\s*<\/?strike>|<)/$1$2\n/ig;

apparently isn't working, I've gone with the alternate idea of simply doing this later in the script:

$output=~s/\n<\/(STRIKE|U)>/<\/$1>\n/g; Thanks.

Separate One Very Long String into Many Short Lines of HTML Source by Inserting \n After Spaces or &quot;&gt;&quot;

Separate One Very Long String into Many Short Lines of HTML Source by Inserting \n After Spaces or ">"