Solved

Split Long String of HTML Source Between the Two Words or Tags Closest to the Middle

Posted on 2006-11-20
19
601 Views
Last Modified: 2008-02-01
I have string variables containing several hundred characters of HTML source code, and I want to divide them into two lines when they print to a Web page, by inserting a "\n" about halfway through the string -- but only between two words or between two HTML Tags, and only if the string exceeds a certain length.  
   For example, it should test to see if the string is over 300 characters. If so, it should insert a "\n" close to the middle of the string without breaking up a word, and without breaking up the contents of any HTML tag.
   How would this be done?
0
Comment
Question by:Randall-B
  • 9
  • 4
  • 4
  • +1
19 Comments
 
LVL 2

Expert Comment

by:jingks03
ID: 17981084
I try a quick attempt here, i'm guessing there will be quite a few replies to this
I think this does what you want... a little bulky i think though

#------------------------
my $file = shift @ARGV;
open FILE, $file;
my $data = "";
while (<FILE>) { $data .= $_; } # store source in string
my $slen = length($data);             # possibly a little inaccurate (i'm quessing \n and \s will count as characters)
my ($half1,$half2) = ('','');
if ($slen > 300) {
    my @words = split(/(>[^<]*\s|>)/,$data);
    my $i = 0;
    while (length($half1) < $slen/2) {
       $half1 .= shift @words;
       $i++;
    }
    if ($i%2) { $half1 .= shift @words; }
    $half2 = join('',@words);
    open oFILE, ">first_half.html";
    print oFILE $half1;
    close oFILE;
    open oFILE, ">second_half.html";
    print oFILE $half2;
    close oFILE;
}
#--------------------------------------
0
 
LVL 84

Expert Comment

by:ozo
ID: 17981125
"\n" will not divide lines on HTML unless it is in a <pre> tag
did you mean "<br />"?
0
 

Author Comment

by:Randall-B
ID: 17981196
ozo,
   I only want to break the line as it appears in the source code file, not in the browser window.  For reasons that I won't go into here, I'm trying to make all of my HTML source code lines reasonably short, like screen-width (without using wrap-text).  So I'm just looking for a way to break long source code lines into shorter source code lines, without affecting the output as seen on the Web page. That's why I want "\n" instead of <br />.  Thanks.
0
Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

 
LVL 84

Expert Comment

by:ozo
ID: 17981252
Does that mean you also want to avoid splitting inside of <pre>...</pre>?
0
 
LVL 8

Assisted Solution

by:Perl_Diver
Perl_Diver earned 225 total points
ID: 17981260
my $oldstring = 'your long string here";
my ($newstring,$f,$s) = ('','','');
if (length($oldstring) > 300) {
   $f = substr $oldstring,0,150;
   $s = substr $oldstring,150;
   if ($f !~ /\s$/) {
      $f =~ s/((\s)(\S))$//;
      $newstring = "$f\n$3$s";
   }
   else {
      $newstring = "$f\n$s";
   }
   print $newstring;      
}
else {
   print $oldstring;
}
0
 

Author Comment

by:Randall-B
ID: 17981263
jingks03,
   Maybe I should have explained that I'm not looking to perform this operation on a whole file, but rather on a few individual strings that are output from another function in my script.  For example, lets say the script does some operations and assigns a bunch of HTML code as the value of string variable "$String". I just need a function or regex or whatever to process that $String variable before "printing" it to STDOUT (which becomes part of the HTML source code for a Web page that the use will view in the browser. But I'm trying to make the lines of source code shorter, because it will also be used in an application that will store and compare different versions of the source code).
0
 

Author Comment

by:Randall-B
ID: 17981287
ozo,
   No, I don't think I would need to avoid splitting between any two tags, not even <pre> . . . </pre>. I just don't want to split inside the tags themselves.  For example, I don't want:  <pr
                                    e>

but this is fine:   <pre> . . .
                        </pre>
0
 

Author Comment

by:Randall-B
ID: 17981303
Perl Diver,
    Just by eyeballing your code, it looks like I will do what I need. I'm going to actually test it in my script now. I'll let you know the results. Thanks.
0
 
LVL 8

Expert Comment

by:Perl_Diver
ID: 17981326
I didn't try taking into account to break strings between html tags, because I believe html tags can be broken internally on spaces with no problem.
0
 
LVL 84

Expert Comment

by:ozo
ID: 17981386
It is valid HTML to put a "\n" insde of
<pre
>
or
</pre
>
Do you still want to disallow this?
How about inside of
<!--
comments
 -->
or
<script>if (a<b && a>c)</script>

0
 
LVL 2

Accepted Solution

by:
jingks03 earned 275 total points
ID: 17981454
Ah... yeah, i wrote that a little too fast.  Missed some of the reqs. So given the string $String;
# ----------------------------------
my $slen = length($String);
if (length($String) > 300) {
    my $half = length($String)/2;
    $String =~ s/^(.{$half}.*?[\s\>])([^<]*<?.*)$/$1\n$2/;
}
print $String;
# ------------------------------------
That should insert a "\n" into the string after the first ">" or "\s" after the half way point not inside a "<.>"
0
 
LVL 2

Expert Comment

by:jingks03
ID: 17981461
oops... and forget about the "my $slen = length($String);" line...

need more coffee
0
 

Author Comment

by:Randall-B
ID: 17981491
Uh-oh, I just discovered something that I had forgotten . . .  And because of this, none of the proposed solutions will work.
  Although Perl Diver's code may be just about right if the lines of HTML code were actually composed of single long strings, I now remember that the "lines" of HTML code are actually composed of many separate "print" statements, which print "horizontally" across the page of HTML source until a certain number of print statements have run. (After that number of print statements, it inserts a "\n".)
    The problem is, some lines are short when composed of 20 print statements (each of which is generally made up of only 1 word).  But other lines turn out very very long when composed of 20 print statements, because some of those statements contain a bunch of html formatting tags that take up a lot of space on the source-code page.
    So the situation is a lot more complicated that I originally thought. Probably what I need to do is get rid of my current loop that inserts a "\n" after every 20 print statements, and use some kind of character counter, instead.  For example, a counter would keep track of the number of characters printed in a bunch of successive print statements, and the code would insert a "\n" after every so many (e.g. 300) characters (but without breaking individual words or html tags.
     To understand what's going on in the current Perl code, I am showing the print statement below (which needs to be modified as described in the previous paragraph):

$count=0;

sub PrintLine($$){
  $count=($count+1)%20;
   $Mode=shift;
   $Word=shift;
   chomp $Word;
   $Word=~s/^\s+//;
   $Word=~s/\s+$//;
   $Word=~s<</(html|body)>>~~ig;

 if($Word eq ''){
  return;
 }
 else{
  if($Mode eq 'New'){
    $Word=~s|<font.*?>||ig;
    if($Word=~/<td>/i){
      $Word=~s|<td>|<TD>$Style{StartNew}|ig;
      print $FH "$Word$Style{EndNew} " if $count<19;
      print $FH "$Word$Style{EndNew} \n" if $count==19;
    }
    else{
   print $FH "$Style{StartNew}$Word$Style{EndNew} " if $count<19;
   print $FH "$Style{StartNew}$Word$Style{EndNew} \n" if $count>=19;
    }
  }
  elsif($Mode eq 'Old'){
    $Word=~s|<[^<]+>||g;
 print $FH "$Style{StartOld}$Word$Style{EndOld} " if $count<19;
 print $FH "$Style{StartOld}$Word$Style{EndOld} \n" if $count>=19;
  }
  elsif($Mode eq 'Equal'){
     if($Word=~ />$/){
        print $FH "$Word" if $count<19;
     print $FH "$Word\n" if $count==19;}
     else{
        print $FH "$Word " if $count<19;
     print $FH "$Word \n" if $count==19;}
   }
 else{die"Illegal Mode for PrintLine: $Mode"}
#  print $FH $count;
 }
}

This print function is called hundreds of times in the script, and the counter makes it insert a " \n" (space and \n) after every 20 statements (otherwise, it only inserts a space, if less than 20 statements).  But, to make the lines of HTML code really about even length, it needs to be modified to add the "\n" after a set number of *characters* (without breaking inside of a word or inside of an html tag).
0
 

Author Comment

by:Randall-B
ID: 17981527
ozo,
   Yes, it should avoid breaks within a tag, such as:
<pr
 e>
      or
</p
re>
  (It should maintain it as "<pre>".)  The other examples you gave probably also need to be maintain without breaks or spaces, as well.  However, please see my revised question above. Sorry about the confusion.
0
 

Author Comment

by:Randall-B
ID: 17981535
jingks03
   That looks like it would have worked for the question as originally stated, but I'm sorry I mis-stated it. See the long revised question above. Thanks.
0
 
LVL 84

Expert Comment

by:ozo
ID: 17981588
<pr
e>
is bad
but
<pre
>
is valid
(althogh I suppose it doesn't hurt to wait one more character before inserting the newline.)
beteween word1 and word2 in
<script>"word1 word2"</script>
is not inside <> but should probably not have a line break inserted,
<!-- comment --> is inside of <> but could safely have a line break inserted
do you need to take those into account?
0
 

Author Comment

by:Randall-B
ID: 17981622
ozo,
   I would like to avoid breaking up the <pre> or </pre> tag at all.
But I doubt that it would hurt to insert a "\n" inside of content between the script or comment tags, so it probably does not need to take those into account.  (And please see the revised question above.) Thanks.
0
 

Author Comment

by:Randall-B
ID: 17982190
Looks like no one is biting the bait for the revised question. Although I haven't tested jingks03's solution, it looks like it would do what I originally asked for, so I'll accept it.  
  The revised question is being move to a separate listing: http://www.experts-exchange.com/Programming/Programming_Languages/Perl/Q_22067302.html , as it is so different from the original question.  Experts, please go to the new question. Thanks.
0
 
LVL 2

Expert Comment

by:jingks03
ID: 17982518
Ah, in that case Randall, i think there has to be another slight change to it

$String =~ s/^(.{$half}.*?[\s\>])([^>]+\<.*\>[^>]*)$/$1\n$2/;

The pattern borken up should work as:
^(.{$half}     - match anything up to the halfway mark
.*?[\s\>])     - match up to the first space or > observed
([^>]+         - split point mush not be followed by a >
\<.*\>         - if HTML tags exists after split match fist < to last >
[^>]$)         - match to the end of the string

Sorry about not looking into the second, revised question.  looks a little to involved for a coffee break answer
0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
cpan issue 1 63
How to get all the API from website? 11 85
Formatting stings with pack and printf in perl 5 72
Vb script to unzip a files and rename the files 5 90
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Microsoft Active Directory, the widely used IT infrastructure, is known for its high risk of credential theft. The best way to test your Active Directory’s vulnerabilities to pass-the-ticket, pass-the-hash, privilege escalation, and malware attacks …

785 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question