?
Solved

regex - urgent

Posted on 2003-03-31
10
Medium Priority
?
222 Views
Last Modified: 2010-03-05
Hi All,

Can you please tell me what this regex actually does?

$data =~ s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse;

the regex seems to work only for latin1.

can u pls help me for multibyte characters like words with encoding shift-jis and euc-kr

Regards,
Lakshmi
0
Comment
Question by:lakshminair
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 5
10 Comments
 
LVL 5

Expert Comment

by:burtdav
ID: 8244094
The regex replaces characters € and above in that format (as an HTML entity ref).

But Perl works with bytes - not multibyte characters. See:
http://gershwin.ens.fr/vdaniel/Doc-Locale/Outils-Gnu-Linux/Perl/Html-Doc-at-www.perl.com/perlfaq6.html#How_can_I_match_strings_with_mul

Enter String::Multibyte.
http://search.cpan.org/author/SADAHIRO/String-Multibyte-1.03/Multibyte.pm

# Try this:
use String::Multibyte;
$sjis = String::Multibyte->new('ShiftJIS');
# split the string into (possibly multibyte) chars
@chars = $sjis->strsplit('', $data);
# replace multibyte and "extended" (low byte >127) chars with HTML entity reference string
$data = join map {length < 2 && ord < 128 ? $_ : '&#'.ord.';'} @chars;

# one last point - I hope you're escaping your ampersands before this to avoid ambiguities?
data =~ /\&/\&amp;/;
0
 

Author Comment

by:lakshminair
ID: 8244165
i need some more information - please tell me if there is some equivalent to Encode from_to function in perl 5.6
0
 
LVL 5

Expert Comment

by:burtdav
ID: 8244200
Can you explain that a different way?
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:lakshminair
ID: 8244216
i seem to be having problems with the regex above for accent characters.
i had tried Encode function on perl 5.8 but it is not supported on perl 5.6 - i was told that pack command can be used to convert the utf string to iso-8859.
0
 
LVL 5

Expert Comment

by:burtdav
ID: 8249595
String::Multibyte should solve your problem - have you tried it?

If you want info on pack, check http://www.perldoc.com/perl5.6/pod/perlfunc.html
0
 

Author Comment

by:lakshminair
ID: 8251111
just one more doubt - is there some way i can detect the encoding of a string.

for eg: $str="abcd"; i need to know if the encoding of this string is utf-8 or iso-8859 etc.
pls help
0
 
LVL 5

Expert Comment

by:burtdav
ID: 8251347
All Perl knows is that it's a string; 1 char = 1 byte.

It shouldn't matter unless it's a multibyte encoding, in which case use String::Multibyte's islegal function. You could test different encodings it could be, and use any that are successful.
Using the above code's definition of $sjis:
if ($sjis->islegal($data)) {
  # $sjis should be able to handle the string
}
0
 

Author Comment

by:lakshminair
ID: 8251370
thanks for the input. i should have mentioned earlier itself that i cannot use any perl modules for this functionality. is there some rgex which i can use to detect this.
0
 
LVL 5

Accepted Solution

by:
burtdav earned 400 total points
ID: 8252530
That's a tough problem, then... all I can suggest is that you read http://gershwin.ens.fr/vdaniel/Doc-Locale/Outils-Gnu-Linux/Perl/Html-Doc-at-www.perl.com/perlfaq6.html#How_can_I_match_strings_with_mul (the bit about martian), and see if you can implement one of those hacks.
0
 

Author Comment

by:lakshminair
ID: 8252628
thanks a lot for the prompt response.
using pack command we are converting utf-8 to iso in a particular module. still we are facing some trouble in the other modules where we are seeing some corruption of data - seems like perl internally is setting some flag
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question