?
Solved

regex - urgent

Posted on 2003-03-31
10
Medium Priority
?
223 Views
Last Modified: 2010-03-05
Hi All,

Can you please tell me what this regex actually does?

$data =~ s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse;

the regex seems to work only for latin1.

can u pls help me for multibyte characters like words with encoding shift-jis and euc-kr

Regards,
Lakshmi
0
Comment
Question by:lakshminair
  • 5
  • 5
10 Comments
 
LVL 5

Expert Comment

by:burtdav
ID: 8244094
The regex replaces characters € and above in that format (as an HTML entity ref).

But Perl works with bytes - not multibyte characters. See:
http://gershwin.ens.fr/vdaniel/Doc-Locale/Outils-Gnu-Linux/Perl/Html-Doc-at-www.perl.com/perlfaq6.html#How_can_I_match_strings_with_mul

Enter String::Multibyte.
http://search.cpan.org/author/SADAHIRO/String-Multibyte-1.03/Multibyte.pm

# Try this:
use String::Multibyte;
$sjis = String::Multibyte->new('ShiftJIS');
# split the string into (possibly multibyte) chars
@chars = $sjis->strsplit('', $data);
# replace multibyte and "extended" (low byte >127) chars with HTML entity reference string
$data = join map {length < 2 && ord < 128 ? $_ : '&#'.ord.';'} @chars;

# one last point - I hope you're escaping your ampersands before this to avoid ambiguities?
data =~ /\&/\&amp;/;
0
 

Author Comment

by:lakshminair
ID: 8244165
i need some more information - please tell me if there is some equivalent to Encode from_to function in perl 5.6
0
 
LVL 5

Expert Comment

by:burtdav
ID: 8244200
Can you explain that a different way?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:lakshminair
ID: 8244216
i seem to be having problems with the regex above for accent characters.
i had tried Encode function on perl 5.8 but it is not supported on perl 5.6 - i was told that pack command can be used to convert the utf string to iso-8859.
0
 
LVL 5

Expert Comment

by:burtdav
ID: 8249595
String::Multibyte should solve your problem - have you tried it?

If you want info on pack, check http://www.perldoc.com/perl5.6/pod/perlfunc.html
0
 

Author Comment

by:lakshminair
ID: 8251111
just one more doubt - is there some way i can detect the encoding of a string.

for eg: $str="abcd"; i need to know if the encoding of this string is utf-8 or iso-8859 etc.
pls help
0
 
LVL 5

Expert Comment

by:burtdav
ID: 8251347
All Perl knows is that it's a string; 1 char = 1 byte.

It shouldn't matter unless it's a multibyte encoding, in which case use String::Multibyte's islegal function. You could test different encodings it could be, and use any that are successful.
Using the above code's definition of $sjis:
if ($sjis->islegal($data)) {
  # $sjis should be able to handle the string
}
0
 

Author Comment

by:lakshminair
ID: 8251370
thanks for the input. i should have mentioned earlier itself that i cannot use any perl modules for this functionality. is there some rgex which i can use to detect this.
0
 
LVL 5

Accepted Solution

by:
burtdav earned 400 total points
ID: 8252530
That's a tough problem, then... all I can suggest is that you read http://gershwin.ens.fr/vdaniel/Doc-Locale/Outils-Gnu-Linux/Perl/Html-Doc-at-www.perl.com/perlfaq6.html#How_can_I_match_strings_with_mul (the bit about martian), and see if you can implement one of those hacks.
0
 

Author Comment

by:lakshminair
ID: 8252628
thanks a lot for the prompt response.
using pack command we are converting utf-8 to iso in a particular module. still we are facing some trouble in the other modules where we are seeing some corruption of data - seems like perl internally is setting some flag
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

621 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question