Link to home
Start Free TrialLog in
Avatar of jay28lee
jay28lee

asked on

string manupulation (big5 characters) - now needs HTML Entity support

This is based on the original question answered in:

https://www.experts-exchange.com/questions/21478455/string-manupulation-big5-characters.html

Original solution works great, but now need some tweaking:

$mystring =~ s/(([\x00-\x7F]|[\x80-\xff].){0,14}).*/$1/s;

In the above example, a double-byte Chinese character is treated as 2 characters.

Now $mystring also contain a mix of HTML Entity elements, such as Japanese character: の

I would like to treat the HTML Entity element as 2 characters as well, how can integrate it into the above regular expression?
Avatar of kaufmed
kaufmed
Flag of United States of America image

Try:
s/(([\x00-\x7F]|[\x80-\xff].|&#\w+;){0,14}).*/$1/s;

Open in new window

Avatar of jay28lee
jay28lee

ASKER

doesn't quite work.  i would like to treat each occurrence of html entity, such as の as a 2-byte character.

eg. 3 occurrence of ののの will give me a match length of 6

the input string consist of both 2-byte Chinese character, ASCII character, and html entity elements.
kaufmed's solution appears to be consistent with the original regular expression as it counts 0-14 occurrences of items (not bytes) so a string of 14 C is fine as is a string of 14 ASCII characters or a mix of the two.

I don't think it's possible to do what you want in a single regular expression (mixing things counting as 1-byte and 2-bytes to end up with a string of 14 bytes).  Is it acceptable to add other lines before/after the regular expression?
yes, using multiple lines of regular expression is acceptable, in order to get the job done.
To verify, you want a max length of 14 bytes with ASCII characters as 1 byte, Chinese characters as 2 bytes, and HTML entities as 2 bytes - correct?

The original regex you posted only counts Chinese characters as 1 byte.

This should do what you want (assuming the answer to the above is yes).  It is easier and more straight-forward to replace your regex with this code.

my @chars = split //, $mystring;
$mystring = '';
my $i = 0;
while ($i <= 14 and @chars) {
    my $tmp = shift @chars;
    my $len = 1;
    if ($tmp =~ m{[\x80-\xff]}) {
        $tmp .= shift @chars;
        $len = 2;
    } elsif ($tmp eq '&') {
        $tmp .= shift @chars while (@chars and $tmp !~ m{;$});
        $len = 2;
    }
    last if ($i + $len > 14);
    $mystring .= $tmp;
}

Open in new window

Sigh - helps if I finish modifying the code.  Ignore the last code and use this.
my @chars = split //, $mystring;
$mystring = '';
while (@chars) {
    my $tmp = shift @chars;
    my $len = 1;
    if ($tmp =~ m{[\x80-\xff]}) {
        $tmp .= shift @chars;
        $len = 2;
    } elsif ($tmp eq '&') {
        $tmp .= shift @chars while ($tmp !~ m{;$});
        $len = 2;
    }
    last if (length($mystring) + $len > 14);
    $mystring .= $tmp;
}

Open in new window

the code looks right to me where you assigned the length of 2 for html entity, but when i tested it, chinese (big-5) and english portion work fine, but when it comes to html entity, it doesn't behaves as a chinese character.

for an &#12398; i think it's being treated as more than 2 in length.

for the code:

$mystring .= $tmp;

isn't $mystring grows with the addition of each html entity?  and the length($mystring) is going to give a length of 8 in each loop causing the loop to exit earlier?

another error exception to handle, a few chinese characters might contain special character, ascii (5C), \

which gives me an internal server error while using some test data. can this exception handling be implemented too?

thx.
how ever, the script did a good job not cutting the html entity character in between.
Oops.  You're right.  Modified code which should fix the problem of using length...

Sorry - it's been a busy day and I keep trying to update code in between doing other things (and apparently keep making obvious mistakes).
my @chars = split //, $mystring;
$mystring = '';
my $length = 0;
while (@chars) {
    my $tmp = shift @chars;
    my $len = 1;
    if ($tmp =~ m{[\x80-\xff]}) {
        $tmp .= shift @chars;
        $len = 2;
    } elsif ($tmp eq '&') {
        $tmp .= shift @chars while ($tmp !~ m{;$});
        $len = 2;
    }
    $length += $len;
    last if ($length > 14);
    $mystring .= $tmp;
}

Open in new window

works perfect now.

is it possible to handle and error exception where the input string contains character "backslash", ascii code (5C), this will cause an internal server error.  some big-5 chinese character contains this character.

another thing, the string is a sentence, and if it contains english, some english word got cut in between which becomes meaningless.

the chinese and html entity is perfect now.
Are you saying that 5c should be omitted even if it follows 80-ff (eg is the second byte of a big-5 character) or that it should only be omitted if it is by itself (eg ASCII 5c)?
5c only takes place as the second byte of the big-5 character.

currently, if I have ascii 5c by itself in the string, it gets omitted, which is fine.

if i have two consecutive ascii 5c, it's being treated as one occurrence of one 5c

but when i have big-5 character where the second byte is ascii 5c, it gives me a run time error.

i was thinking, if putting another ascii 5c as an escape character right after a big-5 character where its second byte is ascii 5c will resolve this issue?

or it should simply be omitted as you suggested?
I'm not sure.

What exactly is giving you an error?  If you are getting an error on a big-5 character like \x80\x5c, then something is not processing correctly (the second byte of a 2-byte big-5 char should never be interpreted as a separate character).

What is omitting the ASCII 5c by itself in the string?

What specifically are you doing with this 14 "byte" string?  I might be able to give better advice if I know what it's being used for.
Following is my complete test code with results in the comments:

#!/usr/local/bin/perl

use strict;

my $mystring = "\String For Testing"; 	# output: String For Tes
my $mystring = "\\String For Testing";	# output: \String For Te
my $mystring = "¿String For Testing";	# output: ¿String For T (behaves correctly)
my $mystring = "¿String For Testing";	# output: ?String For Te (where ? is the first byte A5, the big-5 character ascii A55C is splited)
my $mystring = "¿"; 			# output: internal server error (while execute in browser, run time error in terminal)

my @chars = split //, $mystring;
$mystring = '';
my $length = 0;
while (@chars) {
    my $tmp = shift @chars;
    my $len = 1;
    if ($tmp =~ m{[\x80-\xff]}) {
        $tmp .= shift @chars;
        $len = 2;
    } elsif ($tmp eq '&') {
        $tmp .= shift @chars while ($tmp !~ m{;$});
        $len = 2;
    }
    $length += $len;
    last if ($length > 14);
    $mystring .= $tmp;
}

print "Content-type: text/html\n\n"."$mystring"."\n";

Open in new window


Basically, I have a full body paragraph of string, with a mixture of big-5 chinese, english, and html entity.  It's not necessary 14 bytes, I'm just using 14 for testing.

To test the above code, the script needs to be saved in big-5 encoded document.

My situation is that I need to give a preview paragraph of length 14 (in this test example), and then I'll append, "..." at the end.

Other than the big-5 character issued mentioned above, I need to also handle not to cut up english word in half.  I would rather show less character than 14 than cutting up an english word.

Hope this clarify the issue I'm having trouble with.

Thanks.
The submission form won't allowed me to post big-5 character here in the utf-8 page here.

Line 7 the big-5 character is ascii A5C7
Line 8 and 9 the big-5 character is ascii A55C (where the backslash taking place)
This should fix the English word issue.

What would you like to do about the 5c issue?  Should it be stripped out or have another 5c appended after or something else?

#!/usr/local/bin/perl

use strict;
use warnings;

my $mystring = "\String For Testing";   # output: String For Tes
#my $mystring = "\\String For Testing";  # output: \String For Te
#my $mystring = "¿String For Testing";   # output: ¿String For T (behaves correctly)
#my $mystring = "¿String For Testing";   # output: ?String For Te (where ? is the first byte A5, the big-5 character ascii A55C is splited)
#my $mystring = "¿";                     # output: internal server error (while execute in browser, run time error in terminal)

my @chars = split //, $mystring;
$mystring = '';
my $length = 0;
while (@chars) {
    my $tmp = shift @chars;
    my $len = 1;
    if ($tmp =~ m{[\x80-\xff]}) {
        $tmp .= shift @chars;
        $len = 2;
    } elsif ($tmp eq '&') {
        $tmp .= shift @chars while ($tmp !~ m{;$});
        $len = 2;
    } elsif ($tmp ne ' ') {
        while ($chars[0] ne ' ' and $chars[0] =~ m{[\x00-\x7f]}) {
            $tmp .= shift @chars;
        }
        $len = length $tmp;
    }
    $length += $len;
    last if ($length > 14);
    $mystring .= $tmp;
}

print "Content-type: text/html\n\n"."$mystring"."\n";

Open in new window

One problem I just noticed is that you are using double-quoted strings for testing.  This will not work and will give errors (as \ is interpolated as a special character).  Try changing the double-quotes to single-quotes (eg '\String For Testing' and not "\String For Testing").
english portion doing exactly what i'm looking for.

changing from double to single-quotes works for up to line 9, it does successfully and correctly analyze line 9 with \ in the big-5 character.

however, still giving me an run time error for test data line 10, just the single big-5 character by itself (contains \).

but if i add anything right after the big-5 character, the error is gone, it seems like the string cannot end with a backslash \.

for the 5c issue, can you modify the code to append a trailing backslash?  i manually add trailing backslashes and it solves the problem.

one more question, in my real working script, the string is going to be assigned using

($id, $mystring) = split(/|/,$article);

will the string variable have the issue of single/double quotes?  or it won't be an issue for me if using this method?

thanks.
ASKER CERTIFIED SOLUTION
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
it's performing exact what i need right now.  thx.

btw, i removed the use warnings; as it's causing me an error, i'm using an old server which doesn't seem to have the warnings module installed.

what kind of situation will give $chars[0] an error originally?
If the @chars array runs out of characters while looking for the end of the word (eg if it is the last word in the input string).
Thanks.