Link to home
Start Free TrialLog in
Avatar of Randall-B
Randall-B

asked on

Split Long String of HTML Source Between the Two Words or Tags Closest to the Middle

I have string variables containing several hundred characters of HTML source code, and I want to divide them into two lines when they print to a Web page, by inserting a "\n" about halfway through the string -- but only between two words or between two HTML Tags, and only if the string exceeds a certain length.  
   For example, it should test to see if the string is over 300 characters. If so, it should insert a "\n" close to the middle of the string without breaking up a word, and without breaking up the contents of any HTML tag.
   How would this be done?
Avatar of jingks03
jingks03

I try a quick attempt here, i'm guessing there will be quite a few replies to this
I think this does what you want... a little bulky i think though

#------------------------
my $file = shift @ARGV;
open FILE, $file;
my $data = "";
while (<FILE>) { $data .= $_; } # store source in string
my $slen = length($data);             # possibly a little inaccurate (i'm quessing \n and \s will count as characters)
my ($half1,$half2) = ('','');
if ($slen > 300) {
    my @words = split(/(>[^<]*\s|>)/,$data);
    my $i = 0;
    while (length($half1) < $slen/2) {
       $half1 .= shift @words;
       $i++;
    }
    if ($i%2) { $half1 .= shift @words; }
    $half2 = join('',@words);
    open oFILE, ">first_half.html";
    print oFILE $half1;
    close oFILE;
    open oFILE, ">second_half.html";
    print oFILE $half2;
    close oFILE;
}
#--------------------------------------
Avatar of ozo
"\n" will not divide lines on HTML unless it is in a <pre> tag
did you mean "<br />"?
Avatar of Randall-B

ASKER

ozo,
   I only want to break the line as it appears in the source code file, not in the browser window.  For reasons that I won't go into here, I'm trying to make all of my HTML source code lines reasonably short, like screen-width (without using wrap-text).  So I'm just looking for a way to break long source code lines into shorter source code lines, without affecting the output as seen on the Web page. That's why I want "\n" instead of <br />.  Thanks.
Does that mean you also want to avoid splitting inside of <pre>...</pre>?
SOLUTION
Avatar of Perl_Diver
Perl_Diver

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
jingks03,
   Maybe I should have explained that I'm not looking to perform this operation on a whole file, but rather on a few individual strings that are output from another function in my script.  For example, lets say the script does some operations and assigns a bunch of HTML code as the value of string variable "$String". I just need a function or regex or whatever to process that $String variable before "printing" it to STDOUT (which becomes part of the HTML source code for a Web page that the use will view in the browser. But I'm trying to make the lines of source code shorter, because it will also be used in an application that will store and compare different versions of the source code).
ozo,
   No, I don't think I would need to avoid splitting between any two tags, not even <pre> . . . </pre>. I just don't want to split inside the tags themselves.  For example, I don't want:  <pr
                                    e>

but this is fine:   <pre> . . .
                        </pre>
Perl Diver,
    Just by eyeballing your code, it looks like I will do what I need. I'm going to actually test it in my script now. I'll let you know the results. Thanks.
I didn't try taking into account to break strings between html tags, because I believe html tags can be broken internally on spaces with no problem.
It is valid HTML to put a "\n" insde of
<pre
>
or
</pre
>
Do you still want to disallow this?
How about inside of
<!--
comments
 -->
or
<script>if (a<b && a>c)</script>

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
oops... and forget about the "my $slen = length($String);" line...

need more coffee
Uh-oh, I just discovered something that I had forgotten . . .  And because of this, none of the proposed solutions will work.
  Although Perl Diver's code may be just about right if the lines of HTML code were actually composed of single long strings, I now remember that the "lines" of HTML code are actually composed of many separate "print" statements, which print "horizontally" across the page of HTML source until a certain number of print statements have run. (After that number of print statements, it inserts a "\n".)
    The problem is, some lines are short when composed of 20 print statements (each of which is generally made up of only 1 word).  But other lines turn out very very long when composed of 20 print statements, because some of those statements contain a bunch of html formatting tags that take up a lot of space on the source-code page.
    So the situation is a lot more complicated that I originally thought. Probably what I need to do is get rid of my current loop that inserts a "\n" after every 20 print statements, and use some kind of character counter, instead.  For example, a counter would keep track of the number of characters printed in a bunch of successive print statements, and the code would insert a "\n" after every so many (e.g. 300) characters (but without breaking individual words or html tags.
     To understand what's going on in the current Perl code, I am showing the print statement below (which needs to be modified as described in the previous paragraph):

$count=0;

sub PrintLine($$){
  $count=($count+1)%20;
   $Mode=shift;
   $Word=shift;
   chomp $Word;
   $Word=~s/^\s+//;
   $Word=~s/\s+$//;
   $Word=~s<</(html|body)>>~~ig;

 if($Word eq ''){
  return;
 }
 else{
  if($Mode eq 'New'){
    $Word=~s|<font.*?>||ig;
    if($Word=~/<td>/i){
      $Word=~s|<td>|<TD>$Style{StartNew}|ig;
      print $FH "$Word$Style{EndNew} " if $count<19;
      print $FH "$Word$Style{EndNew} \n" if $count==19;
    }
    else{
   print $FH "$Style{StartNew}$Word$Style{EndNew} " if $count<19;
   print $FH "$Style{StartNew}$Word$Style{EndNew} \n" if $count>=19;
    }
  }
  elsif($Mode eq 'Old'){
    $Word=~s|<[^<]+>||g;
 print $FH "$Style{StartOld}$Word$Style{EndOld} " if $count<19;
 print $FH "$Style{StartOld}$Word$Style{EndOld} \n" if $count>=19;
  }
  elsif($Mode eq 'Equal'){
     if($Word=~ />$/){
        print $FH "$Word" if $count<19;
     print $FH "$Word\n" if $count==19;}
     else{
        print $FH "$Word " if $count<19;
     print $FH "$Word \n" if $count==19;}
   }
 else{die"Illegal Mode for PrintLine: $Mode"}
#  print $FH $count;
 }
}

This print function is called hundreds of times in the script, and the counter makes it insert a " \n" (space and \n) after every 20 statements (otherwise, it only inserts a space, if less than 20 statements).  But, to make the lines of HTML code really about even length, it needs to be modified to add the "\n" after a set number of *characters* (without breaking inside of a word or inside of an html tag).
ozo,
   Yes, it should avoid breaks within a tag, such as:
<pr
 e>
      or
</p
re>
  (It should maintain it as "<pre>".)  The other examples you gave probably also need to be maintain without breaks or spaces, as well.  However, please see my revised question above. Sorry about the confusion.
jingks03
   That looks like it would have worked for the question as originally stated, but I'm sorry I mis-stated it. See the long revised question above. Thanks.
<pr
e>
is bad
but
<pre
>
is valid
(althogh I suppose it doesn't hurt to wait one more character before inserting the newline.)
beteween word1 and word2 in
<script>"word1 word2"</script>
is not inside <> but should probably not have a line break inserted,
<!-- comment --> is inside of <> but could safely have a line break inserted
do you need to take those into account?
ozo,
   I would like to avoid breaking up the <pre> or </pre> tag at all.
But I doubt that it would hurt to insert a "\n" inside of content between the script or comment tags, so it probably does not need to take those into account.  (And please see the revised question above.) Thanks.
Looks like no one is biting the bait for the revised question. Although I haven't tested jingks03's solution, it looks like it would do what I originally asked for, so I'll accept it.  
  The revised question is being move to a separate listing: https://www.experts-exchange.com/questions/22067302/Add-Newline-Character-After-every-X-Number-of-Characters-Ouput-by-Many-Successive-Print-Statements-Without-Breaking-Up-Words-or-HTML-Tags.html , as it is so different from the original question.  Experts, please go to the new question. Thanks.
Ah, in that case Randall, i think there has to be another slight change to it

$String =~ s/^(.{$half}.*?[\s\>])([^>]+\<.*\>[^>]*)$/$1\n$2/;

The pattern borken up should work as:
^(.{$half}     - match anything up to the halfway mark
.*?[\s\>])     - match up to the first space or > observed
([^>]+         - split point mush not be followed by a >
\<.*\>         - if HTML tags exists after split match fist < to last >
[^>]$)         - match to the end of the string

Sorry about not looking into the second, revised question.  looks a little to involved for a coffee break answer