Solved

Chopping/wrapping a string to a specific length which contains undisplayed control character sequences.

Posted on 2011-02-10
23
426 Views
Last Modified: 2012-05-11
Hi.

This one has thrown me for a while and I'm not sure how to proceed, so I'm asking ...

Take a string that will be displayed on a fixed width display. This could be on a line printer (old style dot matrix), or a text display (console or DOS).

The string will contain zero width elements. Think of them like HTML tags. The tags themselves don't occupy any space in the output.

The zero width elements will wrap a part of the text, so again, like HTML the <font>...</font> tags, rather than the <br /> or <img /> tags.

To get the text to display nicely, it is necessary to wrap it, but to not break the words.

For example:

Original Text:
To get the text to display nicely, it is necessary to wrap it, but to not break the words.

Straight cut at 20 characters:
12345678901234567890
To get the text to d
isplay nicely, it is
 necessary to wrap i
t, but to not break 
the words.

Without breaking the words:
12345678901234567890
To get the text to 
display nicely, it 
is necessary to wrap 
it, but to not break 
the words.

Open in new window


Now let's add some "tags". For this example, I'll use <tag>...</tag> around all words that start with T.

Original text:
<tag>To</tag> get <tag>the</tag> <tag>text</tag> <tag>to</tag> display nicely, it is necessary <tag>to</tag> wrap it, but <tag>to</tag> not break <tag>the</tag> words.

Straight cut at 20 characters:
12345678901234567890
<tag>To</tag> get <t
ag>the</tag> <tag>te
xt</tag> <tag>to</ta
g> display nicely, i
t is necessary <tag>
to</tag> wrap it, bu
t <tag>to</tag> not 
break <tag>the</tag>
 words.

Without breaking the words (or the tags open):
12345678901234567890
<tag>To</tag> get 
<tag>the</tag> 
<tag>text</tag> 
<tag>to</tag> 
display nicely, it 
is necessary 
<tag>to</tag> wrap 
it, but 
<tag>to</tag> not 
break 
<tag>the</tag> 
words.

Without breaking the words (or the tags open) and then removed as they are zero width:
12345678901234567890
To get 
the 
text 
to 
display nicely, it 
is necessary 
to wrap 
it, but 
to not 
break 
the 
words.

Open in new window


As you can see, treating the <tag>s as text results in a mess.

What is wanted is the following output :

Required output:
12345678901234567890
<tag>To</tag> get <tag>the</tag> <tag>text</tag> <tag>to</tag> 
display nicely, it 
is necessary <tag>to</tag> wrap 
it, but <tag>to</tag> not break 
<tag>the</tag> words.

Open in new window


The tag is dynamic. In the examples I've used so far, the text is a fixed string (albeit 2 different strings). In the real application, the tags have a start element and an end element and any number of characters in between.

Here is a more realistic example.

The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default #[0m security tokens.

Open in new window


The tokens start with the '#' and end with the 'm'.

The white space around the tags is also important. So, the GM in the first tag requires the leading and trailing spaces. So, this means that the white space is contextual. Making this a little harder to work with.

The final output needs to look like this ...

12345678901234567890
The new GeoMap
Redevelopment
project has an
access code of #[1;37;44m GM #[0m 
and uses the 
#[1;37;44m Default #[0m security
tokens.

Open in new window



If the content of a tag contains white space (other than the whitespace immediately adjacent to the inside of the tag), then it can be broken, but a trailing closing tag must be added to the end of the line and an opening tag must be added to the beginning of the line.

For example ...
Original text:
The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default security tokens #[0m.

Required output:
12345678901234567890
12345678901234567890
The new GeoMap
Redevelopment
project has an
access code of #[1;37;44m GM #[0m 
and uses the 
#[1;37;44m Default security #[0m
#[1;37;44m tokens #[0m.

Open in new window


I'm not looking for actual code (unless you have some), but an algorithm to allow me to write the code.

I'm using PHP, but the issue isn't a PHP issue per se, so any and all languages are acceptable - as I'm sure you'll comment your code!

So.

Any ideas/suggestions/etc. all gratefully accepted.

TIA

Richard Quadling.
0
Comment
Question by:RQuadling
  • 8
  • 4
  • 3
  • +5
23 Comments
 
LVL 34

Assisted Solution

by:Beverley Portlock
Beverley Portlock earned 101 total points
ID: 34863998
You can use str_replace to swap <br> tags to \n

You can use strip_tags to renove HTML tags,

You can use wordwrap to fold the code

http://www.php.net/str_replace
http://www.php.net/strip_tags
http://www.php.net/wordwrap

Finally you can ditch anything that does not match a given set of characters using a regex

$oldtext = strip_tags( $text );
$newtext = preg_replace( '#[^-a-z0-9\.,!]#is', '', $oldtext );

echo wordwrap( $newtext, 80 );
0
 
LVL 40

Author Comment

by:RQuadling
ID: 34864137
@bportlock, you've completely missed the point. The tags are zero width and must be preserved.

They control the behaviour of the output device (add colour, bold, underline, etc.).

My only route so far is to do a token by token analysis and track the in/out of the tags (no idea on nested tags as yet, but the quick analysis I've done on the historic data suggest not), with a look ahead for the closing tag so that if I have to break the tag's content, I've got the closing tag ready for before the break and the opening tag ready for the new line after the break.

The tags are NOT HTML, so strip_tags has no place - I only used that initially as an example to help get you all into the idea of what I'm doing.

Ditching the tags will simply convert everything to plain text.
0
 
LVL 86

Assisted Solution

by:jkr
jkr earned 34 total points
ID: 34864189
Well, basically you are dealing with strings that you split into substrings using 'space' as a delimiter to then rearrange them given the restrictions you cited above. The following (C++) snippet does that:
#include <iostream>
#include <list>
#include <sstream>
#include <fstream>
using namespace std;

template <typename T>
size_t SplitTextElements ( basic_string<T> strIn, const T cDelim, list<basic_string<T> >& lstResult) {

   size_t unPos = 0;
   size_t unCount = 0;
   size_t unFound;
   basic_string<T> strToken;

   while ( true) {

      unFound = strIn.find ( cDelim, unPos);

        if ( string::npos == unFound) {

        strToken = strIn.substr ( unPos, strIn.length() - unPos);
        lstResult.push_back ( strToken);
        break;
      }

      strToken = strIn.substr ( unPos, unFound - unPos);

      unPos = unFound + 1;

      ++unCount;

      lstResult.push_back ( strToken);
   }

   return unCount;
}

template<typename T>
basic_string<T> BreakIt(basic_string<T> strIn, const size_t szWidth) {

   basic_stringstream<T> ssRes;
   basic_stringstream<T> ss;
   list<basic_string<T> > lst;

   SplitTextElements<char>(strIn,' ', lst);

   for (list<basic_string<T> >::iterator i = lst.begin(); i != lst.end(); ++i) {

     size_t szCurLen = ss.str().length();
     size_t szToAdd = i->length();

     if ((szCurLen + szToAdd) < szWidth) {

       if (szCurLen) ss << ' ';

       ss << *i;

     } else {

       ssRes << ss.str() << "\r\n";

       ss.str("");

       ss << *i;
     }
   }


   return ssRes.str();
}

int main () {

   string s = "The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default security tokens #[0m.";

   cout << BreakIt<char>(s,20);

   return 0;
}

Open in new window


Output:
The new GeoMap
Redevelopment
project has an
access code of
#[1;37;44m GM #[0m
and uses the
#[1;37;44m Default
security tokens

Open in new window

0
 
LVL 20

Assisted Solution

by:Proculopsis
Proculopsis earned 100 total points
ID: 34864333

I hope I can improve on this but this is where I've got so far:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>http://www.experts-exchange.com/Programming/Languages/Scripting/JavaScript/Q__26812728.html</title>
<script src="http://code.jquery.com/jquery-1.4.3.min.js" language="javascript"></script>
<script language="JavaScript">

$(function() {

  $("#text").val().replace( /\w+|#\[[\d;]+m/g, formatter );
  $("#formatted").val( $("#formatted").val() + line + "\n" );

});

var escapeSequence = RegExp( /#\[[\d;]+m/ );
var ignore = 0;
var line = "";
function formatter() {
  var word = arguments[0];
  if ( escapeSequence.test( word ) ) {
    line += word + " ";
    ignore += word.length + 1;
  } else {
    if ( line.length + word.length - ignore > 20 ) {
      $("#formatted").val( $("#formatted").val() + line + "\n" );
      line = word + " ";
      ignore = 0;
    } else {
      line += word + " ";
    }
  }
}

</script>

</head>
<body>

<textarea id="text" cols="100" rows="20">The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default security tokens #[0m.</textarea>

<textarea id="formatted" cols="100" rows="20"></textarea>

</body>
</html>

Open in new window

0
 
LVL 34

Assisted Solution

by:Beverley Portlock
Beverley Portlock earned 101 total points
ID: 34864352
"...you've completely missed the point. The tags are zero width and must be preserved...."

Sorry - when I see tag-like things and the letters HTML, the red mist descends .....

I think in that case you're going to have to do what compiler writers do and that is to tokenise the text and then read the tokens according to some sort of lexical/symbol analyser.

So I think that you will have to parse the text into a series of tokens and then define a custom function that when given a fragment of tokens and data inspects it and returns a length. This will allow you to capture a logical length rather than the physical length of the text. You can them assemble the fragments based on their logical lengths

class fragment {
     public $physicalText;
     public $logicalLength;

     function __construct( $text ) {
          $this->physicalText = $text;
          $this->logicalLength = $this->length();
     }

     function length() {
          .... code to determine logical length
     }
}


Then the parser can simply assemble an array of fragments. A 'coder' section can then take the array of fragments and read them in sequence and use the logical length basis and produce the correct output.

The tricky bit is then to produce the code that determines the logical length based on the fragment's contents.
0
 
LVL 37

Assisted Solution

by:TommySzalapski
TommySzalapski earned 67 total points
ID: 34864360
Why don't you just do something like this? I won't bother writing actual code since I don't know PHP.

//functions
//gets the next word
string getWord(string mainString)
{
  loop through mainString
  if you find a '#' loop until an 'm'
  if you find a space (in the main loop) break out.
  return the string and remove it from mainString
}
//counts the characters
int wordLen(string word)
{
  loop through word
  count each character
  if you find a '#' loop until an 'm' (without counting)
  return the count
}

Then just keep adding words to a line until the sum of the counts (plus 1 each time for the spaces in between) gets too high

word = getWord(mainStr)
len = wordLen(word)
if currentLength + len > width
{
  add new line
  set currentLength = 0
}
add word to output
currentLength = currentLength + len

Open in new window

0
 
LVL 37

Assisted Solution

by:TommySzalapski
TommySzalapski earned 67 total points
ID: 34864377
The only thing you really need to do that you haven't yet is to write your own wordLen function that counts the string length but ignores the tags.
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 33 total points
ID: 34864422
Here's my contribution. It is only a starting point, but hopefully furthers the discussion  = )
<?php

    $originaltext = "The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default security tokens #[0m.";
    
    echo "$originaltext<br /><br />\n\n";
    
    $parts = preg_split("@(?<!#)(?=#)@", $originaltext);    // $parts will contain a series of string which start with "#...", except possibly for index 0.
                                                            //  *All* characters are preserved in the split--even whitespace--due to the nature of the pattern
                                                            //  Pattern effectively says to split wherever you find a hash mark preceded by something not a hash mark
    
    for ($i = 0; $i < count($parts); $i++)
    {
        $temp = $parts[$i];                                 // Create a temporary, modifiable copy
        
        if (strlen($temp) > 20)                                
        {
            while (strlen($temp) > 20)
            {
                if (preg_match("@^(.{1,20})(?!\S)(.*)@", $temp, $matches))       // Find up to 20 characters where the character immediately following
                {                                                                //   is a whitespace--essentially, only break on whitespace
                    print $matches[1] . "<br />\n";                              // We captured our 1-20 characters in capture group 1, so we print
                    $temp = $matches[2];                                         //   that group and assign group 2 (the remainder of the string) back
                }                                                                //   to $temp for further processing
                else                                                             // If we couldn't find 1-20 followed by a non-whitespace, dump the string
                {                                                                //   itself out.
                    print $temp . "<br />\n";
                    $temp = "";
                }
            }
            
            print $temp . "<br />\n";                                            // Dump the remainder (less than 20 remaining characters) out.
        }
        else
        {
            print $temp . "<br />\n";                                            // The string was already less than 20 characters, so dump it
        }
    }

?>

Open in new window

0
 
LVL 32

Assisted Solution

by:sarabande
sarabande earned 33 total points
ID: 34864709
try the following

void wrap_ign_tags(const char * input, const char * tagbegin, const char * tagend, int wrap, char * output, int outlen)
{
   int intag = 0;
   int inplen = strlen(input);
   int tbglen = strlen(tagbegin);
   int tenlen = strlen(tagend);
   int count = 0;
   int posspc = 0;
   int i, j;
   char c;
   const char * p = input;
   for (i = 0, j = 0; i<inplen; i++)
   {
       c = *p;
       
       if (strncmp(p, tagbegin, tbglen) == 0)
       {
           const char * q = strstr(p+tbglen, tagend);
           if (q != NULL)
           {
                int taglen = (q + tenlen - p);
                if (j +taglen >= outlen)
                   return;  // error
                strcpy(&output[j], p, taglen);
                j += taglen;
                continue; 
           }
           
       }  
       if (j >= outlen) 
           return; // error
       if (count+1 >= wrap)
       {
           if (posspc == 0)  // no space make hard break
           {
                if (j+1 >= outlen) 
                   return; // error
                output[j++] = '\n';
                count = 0;
           }
           else
           {
                output[posspc] = '\n';
                count = j - posspc;
                posspc = 0;
           }
       }     
       if (c == ' ')
       {
           if (count == 0)
              continue; // don't take leading space
           posspc = j;  // remember space position
       }
       output[j++] = c;
       count++;
   }
}

Open in new window



The code is not tested and it doesn't preserve needed spaces between tags. actually it doesn't care whether it is between tags.

Sara
0
 
LVL 6

Assisted Solution

by:MatthewP
MatthewP earned 132 total points
ID: 34864832
This may be a little limited it depends on what sort of tags you have, perhaps you can use it as the base of something if your incoming text doesn't fit this exactly. You'll get the idea here though.
<?php
$str="<tag>To</tag> get <tag>the</tag> <tag>text</tag> <tag>to</tag> display nicely, it is necessary <tag>to</tag> wrap it, but <tag>to</tag> not break <tag>the</tag> words.";
$array_words=preg_split("/ +/",$str);

$newstr=strip_tags($str);
$newstr=wordwrap($newstr,20);
$array_words_with_linebreaks=preg_split("/\n/",$newstr);

$final="";
$lbwordcount=0;
$i=0;
foreach ($array_words_with_linebreaks as $wordline){
        $lbwords=explode(" ",$wordline);
        $lbwordcount=$lbwordcount + count($lbwords);
        foreach ($lbwords as $discard){
                if ($array_words[$i]){
                        $final .= $array_words[$i];
                        $final .= " ";
                        $i++;
                }
        }
        $final .= "\n";
}
print $final;
?>

Open in new window

0
 
LVL 6

Assisted Solution

by:MatthewP
MatthewP earned 132 total points
ID: 34864971
Two un-necessary lines in the post above actually, and a couple of improvements.

Limitations - only deals with html style tags (but if your tags aren't html you can write a function to parse them out fairly easily). Also will not deal with tab characters, maybe you need to watch for these, replace with spaces etc but you should get the idea.

<?php
$str="<tag>To</tag> get <tag>the</tag> <tag>text</tag> <tag>to</tag> display nicely, it is necessary <tag>to</tag> wrap it, but <tag>to</tag> not break <tag>the</tag> words.";
$str=str_replace("\n","",$str); // remove linebreaks in original.

$max_characters=20; // set your max number of characters per line here

$array_words=preg_split("/ +/",$str);
$newstr=strip_tags($str);
$newstr=wordwrap($newstr,$max_characters);
$arr_words_no_linebreaks=preg_split("/\n/",$newstr);

$final="";
$lbwordcount=0;
$i=0;
foreach ($arr_words_no_linebreaks as $wordline){
        $lbwords=explode(" ",$wordline);
        $lbwordcount=$lbwordcount + count($lbwords);
        foreach ($lbwords as $discard){
                        $final .= $array_words[$i] . " ";
                        $i++;
        }
        $final .= "\n";
}
print $final;
?>

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 6

Assisted Solution

by:MatthewP
MatthewP earned 132 total points
ID: 34865135
.. er.. I hadn't read your bit further down about what format the tags were was just imagining HTML. Maybe some of this is still useful, not got time right now to go any further but was a fun exercise nonethe less :p may have another look tomorrow depending on what else gets posted here
0
 
LVL 40

Author Comment

by:RQuadling
ID: 34865258
WOW. A lot of excellent ideas here. I don't know if you read my initial question right to the end (or did you read it before I edited it).

The actual tags are unknown, but they are in pairs (opening and closing).

The output is going to be placed inline with other text.

I suppose an example output would be useful. The example is using ANSI text (as I can easily take a picture of it). The # characters are really the chr(27)/[ESC] character.

The importance of closing the tags where the content of the tag is split across multiple lines becomes relevant now. Hopefully.

Without the closing tag being assigned, the effect/behaviour introduced by the opening tag carries through to the next label (for want of a way of describing this issue).

If you think of the line art as boxes on a form to be filled in, each box is unconnected to any other.

I've attached an images showing the valid output and what happens if the closing tags for the broken tagged content are not present.








Corret output
+-12345678901234567890-+ +-12345678901234567890-+ +-12345678901234567890-+
+----------------------+ +----------------------+ +----------------------+ 
| The new GeoMap       | | The new GeoMap       | | The new GeoMap       | 
| Redevelopment        | | Redevelopment        | | Redevelopment        |
| project has an       | | project has an       | | project has an       | 
| access code of #[1;37;44m GM #[0m  | | access code of #[1;37;44m GM #[0m  | | access code of #[1;37;44m GM #[0m  | 
| and uses the         | | and uses the         | | and uses the         | 
| #[1;37;44m Default security #[0m   | | #[1;37;44m Default security #[0m   | | #[1;37;44m Default security #[0m   | 
| #[1;37;44m tokens #[0m.            | | #[1;37;44m tokens #[0m.            | | #[1;37;44m tokens #[0m.            | 
+----------------------+ +----------------------+ +----------------------+ 

Incorrect output
+-12345678901234567890-+ +-12345678901234567890-+ +-12345678901234567890-+
+----------------------+ +----------------------+ +----------------------+ 
| The new GeoMap       | | The new GeoMap       | | The new GeoMap       | 
| Redevelopment        | | Redevelopment        | | Redevelopment        |
| project has an       | | project has an       | | project has an       | 
| access code of #[1;37;44m GM #[0m  | | access code of #[1;37;44m GM #[0m  | | access code of #[1;37;44m GM #[0m  | 
| and uses the         | | and uses the         | | and uses the         | 
| #[1;37;44m Default security    | | #[1;37;44m Default security    | | #[1;37;44m Default security    | 
| tokens #[0m.            | | tokens #[0m.            | | tokens #[0m.            | 
+----------------------+ +----------------------+ +----------------------+

Open in new window

ANSILabels.png
0
 
LVL 40

Author Comment

by:RQuadling
ID: 34865350
In my example above, I'm using the same text and the same box sizes. In the live system, these are all unconnected. One box is the description of a product, another box is a set of notes and warnings with highlighting on keywords, etc.

0
 
LVL 40

Author Comment

by:RQuadling
ID: 34865461
OK. Some implementation.

Based upon the various techniques you've all mentioned, I think I'm getting somewhere using the following key function ...

$a_Tokens = preg_split('`(#\[[\d;]++m)|\b`', $s_OrigString, Null, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

This isn't perfect as I've not included the printer based codes, but essentially every escape sequence or word boundary is used to split the string.

I want to capture the splitter (that allows me to keep the escape sequences) but I want to throw away the empty strings (that gets rid of the word boundaries).

Using that I get output of ...

array(42) {
  [0]=>
  string(3) "The"
  [1]=>
  string(1) " "
  [2]=>
  string(3) "new"
  [3]=>
  string(1) " "
  [4]=>
  string(6) "GeoMap"
  [5]=>
  string(1) " "
  [6]=>
  string(13) "Redevelopment"
  [7]=>
  string(1) " "
  [8]=>
  string(7) "project"
  [9]=>
  string(1) " "
  [10]=>
  string(3) "has"
  [11]=>
  string(1) " "
  [12]=>
  string(2) "an"
  [13]=>
  string(1) " "
  [14]=>
  string(6) "access"
  [15]=>
  string(1) " "
  [16]=>
  string(4) "code"
  [17]=>
  string(1) " "
  [18]=>
  string(2) "of"
  [19]=>
  string(1) " "
  [20]=>
  string(10) "#[1;37;44m"
  [21]=>
  string(1) " "
  [22]=>
  string(2) "GM"
  [23]=>
  string(1) " "
  [24]=>
  string(4) "#[0m"
  [25]=>
  string(1) " "
  [26]=>
  string(3) "and"
  [27]=>
  string(1) " "
  [28]=>
  string(4) "uses"
  [29]=>
  string(1) " "
  [30]=>
  string(3) "the"
  [31]=>
  string(1) " "
  [32]=>
  string(10) "#[1;37;44m"
  [33]=>
  string(1) " "
  [34]=>
  string(7) "Default"
  [35]=>
  string(1) " "
  [36]=>
  string(8) "security"
  [37]=>
  string(1) " "
  [38]=>
  string(6) "tokens"
  [39]=>
  string(1) " "
  [40]=>
  string(4) "#[0m"
  [41]=>
  string(1) "."
}

From that, I think it is just a simple accumulation of strings, keeping track of length, open/close tags with the look ahead for the closing tag when an opening tag is found.

From what I've been reliably informed the application providing these strings cannot generate mismatched tags. No one has ever tried using nested tags. In an application that is over 10 years old.

So.

More in a bit.

Maybe.
0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 101 total points
ID: 34866664
OK, I've had some spare time. Here is a tokenizer which partly works and assembles the data with tokens intact. You need to correct the part that breaks the tokens up if they break your rules (that bit is marked near the bottom of the code). The results look like this at present

The new GeoMap Redevelopment
project has an access code of
#[1;37;44m GM #[0m and uses
the #[1;37;44m Default security tokens #[0m
 ANd here is more text.

The linelength can be set using a variable near the bottom of the code fragment. I hope it is of some use.


<?php

ini_set('display_errors',1); error_reporting(E_ALL);



class fragment {
     public  $physicalText;

     private $start;
     private $end;

     function __construct( $text, $s, $e ) {
          $this->physicalText = $text;
          $this->start = $s;
          $this->end   = $e;
     }

     function length() {
          return strlen($this->physicalText) - strlen($this->start) - strlen($this->end);
     }

     function isTokenText() {
          return ($this->start != '' || $this->end != '');
     }


}



class parseTokens {

     public  $tokenArray;

     private $data;
     private $tokenPattern;
     private $startToken;


     function __construct( $data, $tokenStart, $tokenEnd ) {
          $tokenArray = array();
          $this->data = $data;
          $this->tokenPattern = "!^(".$tokenStart."[^$tokenEnd]+".$tokenEnd.")([^".$tokenStart."]+)(".$tokenStart."[^$tokenEnd]+".$tokenEnd.")!s";
          $this->startToken   = "!(".$tokenStart."[^$tokenEnd]+".$tokenEnd.")!s";

          while ( $this->data != "" )
               $this->nextToken();

          print_r($this->tokenArray); // for debug.....
     }


     // Get the next token from the string. If the current token is a delimiter
     // then find another delimiter to go with it
     //
     function nextToken() {
          $space = strpos( $this->data, ' ' );
          $start = "";
          $end   = "";

          if ( $space !== false ) {
               $token = substr( $this->data, 0, $space);

               if ( preg_match( $this->startToken, $token ) >= 1 ) {
                    // Replace my original token with a pairs
                    //
                    preg_match( $this->tokenPattern, $this->data, $match );
                    $token = $match[0];
                    $start = $match[1];
                    $end   = $match[3];
               }

               // Store the new token and shorten the string remaining to be processed
               //
               $this->tokenArray [] = new fragment( $token, $start, $end );
               $this->data = substr( $this->data, strlen( $token ) + 1 );
          }
          else
               if ( strlen($this->data) > 0 ) {
                    $this->tokenArray [] = new fragment( trim($this->data), $start, $end );
                    $this->data = '';
               }
     }

}


$data = 'The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default security tokens #[0m. ANd here is more text.';

$p = new ParseTokens( $data, '#', 'm');

// Assemble parsed code into output
//
$output     = '';
$lineLength = 30;
$text       = "";


foreach( $p->tokenArray as $aToken ) {

     if ( strlen($text) + $aToken->length() > $lineLength ) {
          $output .= "$text\n";
          $text = '';
     }

     if ( $aToken->isTokenText() ) {
          $text .=  $aToken->physicalText . ' ';  // this bit needs correcting to suit the rules for tags
     }
     else
          $text .= $aToken->physicalText . ' ';
}

if ( $text != "" )
     $output .= "$text\n";

echo $output;

Open in new window

0
 
LVL 6

Assisted Solution

by:MatthewP
MatthewP earned 132 total points
ID: 34867195
This is looking pretty good above from bportlock. I've written a bit of a scrawl compared to this nice clear code above! But I'm going to post it anyway as it does seem to be working if I've understood it right. I've just continued along the lines I was going on originally rather than think again from scratch having read the whole thing - ie it's a bit of a hack!

Oh the end tag is still hard coded in here. I've used a modification of your regex from your last post on the start tag as I couldn't get yours to work as I wanted. But whether it's any good to you I don't know, I've stuck to the number followed by semicolon rule and not purely the #-m thing.

<?php
$max_characters=20; // set max characters per line
$str="The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default #[0m security tokens.";
$str .= " Here is some more text just to #[1;37;44m check #[0m out a few other #[1;37;44m instances #[0m of the tags #[1;37;44m to #[0m check it is #[1;37;44m all #[0m working.";
$str=str_replace("\n","",$str); // remove linebreaks in original
$array_words=preg_split("/ +/",$str);
$newstr=strip_custom_tags($str);
$newstr=wordwrap($newstr,$max_characters);
$arr_words_no_linebreaks=preg_split("/\n/",$newstr);

$arr_output_lines=array();
$each_line="";
$word_count=0;
$i=0;
$prepend_to_next_line="";
$linecounter=0;
foreach ($arr_words_no_linebreaks as $wordline){
        if ($prepend_to_next_line){
                $each_line .= $prepend_to_next_line;
        }
        $words_in_line=explode(" ",$wordline);
        $word_count=$word_count + count($words_in_line);
        $each_words_in_line_count=0;
        foreach ($words_in_line as $discard){
                $each_line .= $array_words[$i] . " ";
                $i++;
                $each_words_in_line_count++;
        }
        $prepend_to_next_line="";
        if (preg_match("/#\[[\d+;]*m ?$/",$each_line)){ // end of line, need to move it to next
                $words_on_line=explode(" ",$each_line);
                $remove_index=count($words_on_line)-2;
                $rebuild_final=array();
                $j=0;
                foreach ($words_on_line as $each_word){
                        if ($j != $remove_index){
                                array_push($rebuild_final,$each_word);
                        }
                        $j++;
                }
                $each_line=join(" ",$rebuild_final);
                $prepend_to_next_line=$words_on_line[$remove_index] . " ";
        } else {
                $prepend_to_next_line="";
        }
        array_push($arr_output_lines,$each_line);
        $each_line="";
        $linecounter++;
}

$final_line_count="";
foreach ($arr_output_lines as $line){
        if (preg_match("/^#\[0m/",$line)){
                $arr_output_lines[$final_line_count-1] .= "#[0m";
                $arr_output_lines[$final_line_count]=preg_replace("/#\[0m /","",$arr_output_lines[$final_line_count]);
        }
        $final_line_count++;
}
$output_str=join("\n",preg_replace("/ $/","",$arr_output_lines));
print $output_str;

function strip_custom_tags($str_in){

        $str_in=preg_replace("/#\[[\d+;]*m/","",$str_in);
        $str_in=preg_replace("/#\[0m/","",$str_in);
        return $str_in;

}
?>

Open in new window



0
 
LVL 40

Author Comment

by:RQuadling
ID: 34867485
I'm playing with the various logics. And as always, new "rules" appear.

Some of the text contains new lines. Go figure. I think adding |[\r\n]++ to the splitter regex should allow for them to be detected separately from other \s. And I suppose tabs too, but, at the moment, I've not found any data with a tab in it.

I've also found several instances where they have entered the data with extra spacing (just spaces) to get things to line up. So it may be necessary to replace multiple whitespace with a single whitespace as once the data has been chopped up, it won't fit to anything other than the new restricted width box. Easy enough to do before doing the line wrapping and wouldn't affect the code.

The most significant elements that seem to be working for me is to tokenize the string.


I have to be off this for a day or so. So back at in on Monday.

Thank you all for your comments and ideas. I'm quite pleased that my initial assumptions were correct and that I hadn't missed the obvious solution.

It seems to be fairly complex and I'm guessing this is why the feature was never added in the first place.

0
 
LVL 20

Assisted Solution

by:Proculopsis
Proculopsis earned 100 total points
ID: 34869854
Escape
I think the logic is correct in this version:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_26812728.html</title>
<script src="http://code.jquery.com/jquery-1.4.3.min.js" language="javascript"></script>
<script language="JavaScript">

$(function() {

  $("#text").val().replace( /\S+\s*/g, formatter );
  addLine( line );

});

var highlighting = false;
var escapeSequence = "";
var ignore = 0;
var line = "";

function formatter() {
  var escapeExpression = RegExp( /#\[[\d;]+m/ );
  var escapeExpressionOff = RegExp( /#\[0m/ );
  var word = arguments[0];
  if ( escapeExpression.test( word ) ) {
    line += word;
    ignore += word.length;
    highlighting = true;
    escapeSequence = word;
    if ( escapeExpressionOff.test( word ) ) {
      highlighting = false;
      escapeSequence = "";
    }
  } else {
    if ( line.length + word.length - ignore > 20 ) {
      line += (highlighting)? "#[0m" : "";

      addLine( line );

      line = escapeSequence + word;
      ignore = escapeSequence.length;
    } else {
      line += word;
    }
  }
}

function addLine( line ) {
  line = line.replace( /#\[[\d;]+m\s*#\[0m/, "" );
  $("#formatted").val( $("#formatted").val() + line + "\n" );

  line = line.replace( /#\[0m/, "</span>" );
  line = line.replace( /#\[[\d;]+m/, "<span class='highlight'>" );
  var newLine = $(".line").clone().removeAttr( "class" );
  $(".content",newLine).html( line );
  $(".line").before( newLine );
}

</script>

<style>
table { font-family: terminal; background-color: #000; color: #fff; border-collapse:collapse; }
span { white-space: pre }
.line { display: none; }
.highlight { background-color: #00f; }
</style>

</head>
<body>

<textarea id="text" cols="100" rows="8">The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default security tokens #[0m.</textarea>
<textarea id="formatted" cols="100" rows="8"></textarea>

<table>
<tr><td>+</td><td>-</td><td>12345678901234567890</td><td>-</td><td>+</td></tr>
<tr><td>+</td><td>-</td><td>--------------------</td><td>-</td><td>+</td></tr>
<tr class="line"><td>|</td><td></td><td class="content"></td><td></td><td>|</td></tr>
<tr><td>+</td><td>-</td><td>--------------------</td><td>-</td><td>+</td></tr>
</table>

</body>
</html>

Open in new window

0
 
LVL 40

Author Comment

by:RQuadling
ID: 34870534
Proculopsis .... oh so close!!!

The trailing padding isn't being taken into account when a link break is going to be inserted whilst "in tag".

Changing the text to

<textarea id="text" cols="100" rows="8">The new GeoMap Redevelopment project has an access code of #[1;37;44m GM #[0m and uses the #[1;37;44m Default to security token 1 #[0m.</textarea>

Open in new window


shows the issue.

0
 
LVL 20

Assisted Solution

by:Proculopsis
Proculopsis earned 100 total points
ID: 34871456

I think you just need to change this line to:

    if ( line.length + word.length - ignore >= 20 ) {
0
 
LVL 40

Author Comment

by:RQuadling
ID: 34993248
Been off project for a while - sorry about not getting back to you all.

I've got this working in so far as I can with the data I have available.

The user enters data and colours it. The colouring is translated to both ANSI and HP Colour LaserJet codes. Which are of considerable different lengths. I'd forgotten just how hard it is to do line mode output for a page printer.

Wrapping into the boxes all seems to be working.

I'm sure they'll come up with some new data which doesn't work.

Thanks to you all for your suggestions.

I'm doing a straight split for all contributors as this was an exercise in process rather than code.

Thanks again.

Richard.

0
 
LVL 40

Author Closing Comment

by:RQuadling
ID: 34993254
An excellent set of contributions to the process I needed to use to get the result I needed. Thank you all.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

JavaScript can be used in a browser to change parts of a webpage dynamically. It begins with the following pattern: If condition W is true, do thing X to target Y after event Z. Below are some tips and tricks to help you get started with JavaScript …
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now