Solved

Perl & Mysql UTF-8 Encoding (encoding flag, really) problem

Posted on 2009-04-10
7
2,230 Views
Last Modified: 2012-08-14
I've run into a problem with encoding (UTF-8, to be clear) and perl.  I know:  join the club.
The short of it is:  I'm trying to read UTF-8  data from a mysql database and post it to a web page.  Unfortunately, it's not working out quite like that.

I believe I've narrowed the problem to the perl/mysql interaction.  The data is stored in the database in UTF-8 encoding, and I've set up the mysqld settings (as below) to (try to) enusre all connections to the DB understand that it's talking utf-8.  Unfortunately, when I test retrieved data for the utf-8 flag, it's showing not UTF-8.  Below the mysqld section is my test perl script.  My output when I run this is "NOT UTF8".  

That's not what I expected, and I think my encoding problems are stemming from already-encoded text getting encoded again as it's being outputted.

Can anyone help?
---------------------- my.cnf --------------------------------
 
[mysqld]
<snip>
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET NAMES utf8'
default-character-set=utf8
character-set-server=utf8
collation-server=utf8_general_ci
skip-character-set-client-handshake
 
 
---------------------- mysqltst.pl --------------------------------
#!/usr/bin/perl -w -CS
#Usage: ./mysqltst.pl DB_password
 
#use LWP::Simple;
use LWP::UserAgent;
use Data::Dumper;
use POSIX qw(strftime);
#use HTML::GenerateUtil;
#use DBI;
use DBD::mysql;
use Encode;
use encoding "utf8";
#use utf8;
 
#instantiate our objects
# Database vars
my $db ="<db-name>";
my $user = "<DB-USER>";
my $pass = shift;
my $host="localhost";
my $query = "SELECT .....'";
 
# Set up DB connection
my $dbh = DBI->connect("DBI:mysql:$db:$host", $user, $pass);
$dbh->{'mysql_enable_utf8'} = 1;
$dbh->do("set character set utf8");
$dbh->do("SET NAMES 'utf8'");
$dbh->do("SHOW VARIABLES LIKE 'character_set_%'"); 
my $sqlQuery  = $dbh->prepare($query)
or die "Can't prepare $query: $dbh->errstr\n";
my $rv = $sqlQuery->execute
or die "can't execute the query: $sqlQuery->errstr";
 
while (@row= $sqlQuery->fetchrow_array()) {
#get page names from DB
#my $pagename = Encode::decode("UTF-8",$row[0]);
my $pagename = ($row[0] );
#$pagename = utf8::downgrade($pagename) ;
#Get page source from DB
#my $content=utf8::decode($row[1]);
#my $content = Encode::decode("utf-8",$row[1]);
my $content=$row[1];
#Encode::_utf8_on($content);
if (! utf8::is_utf8($content)) {
  print "NOT UTF8\n";
} 
#print $pagename."\n";
#print $content;
}

Open in new window

0
Comment
Question by:laneduncan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
7 Comments
 
LVL 3

Accepted Solution

by:
GarthSnyder earned 500 total points
ID: 24120710
Welcome to hell! :-)  Two suggestions, Lane:

Make sure your DBI and DBD::mysql Perl modules are up to date. Proper marking of UTF8 strings has only been around for a couple of years. See this bug and this one.

I am a little suspicious of setting the mysql_enable_utf8 option after establishing the connection. Is this affirmatively documented to work? You might try including it in the connection string as shown below.

my $dbh = DBI->connect("DBI:mysql:$db:$host;mysql_enable_utf8=1;", $user, $pass);

Open in new window

0
 
LVL 10

Author Comment

by:laneduncan
ID: 24134317
Thanks.  I did some digging, and I see that my DBD module is version 1.23.00 r11.21.  It's the latest that Red Hat supplies, of course.

So I switched back to DBI, which is at v1.52.  I tried, in any event, the "Ugly workaround" mentioned in the bug (a neat trick, BTW), but that didn't seem to make any difference.

Man, as widespread as is Perl, I'd have thought things like this would have been worked out long ago.

I added the mysql_enable_utf8=1 parm to the connect string, as well; that makes a lot more sense.  Unhappily, I'm still getting the same result.

I'll see if I can get some up-to-date .pm files for the DB connect module.  Who's betting I'll avoid dependency hell?  :)

Thanks!
0
 
LVL 39

Expert Comment

by:Adam314
ID: 24139508
If your system is properly configured (eg: you have compiler and make programs installed), you should be able to install the latest version of modules directly from CPAN, and not have to worry about what red hat supplies.

To install in the default location, as root, run:
    cpan Module::Name
So, for example:
    cpan DBI
    cpan DBD::mysql

If you want to install to a different location (leaving the originals alone), download the files, and when you run:
    perl Makefile.PL
add the prefix option, like so:
    perl Makefile.PL PREFIX=/some/other/path
Then run the rest of the installation as normal.
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
LVL 10

Author Closing Comment

by:laneduncan
ID: 31569983
Thanks for those tips!
I did update the DBI module (DBD wouldn't compile, and even the CPAN install script wouldn't function for it), and things are better.

I don't think it's fixed the way things ought to work, but it works, and that's all I really care about at this point.

Here's what I had to do to get it to work:
I updated the DBI module to v1.607
I modified my connect string as you suggested (thanks for that!)

This, in itself, didn't work, but when I manually flagged the text as utf8, thusly:
Encode::_utf8_on($content);

I magically was able to decode it.  I should note that I'd tried this before updating the DBI module, and it hadn't worked, so that made a difference.

I don't think you're supposed to have to turn on the utf8 flag manually, but hey, it's working.  I'm all good with that.
0
 
LVL 10

Author Comment

by:laneduncan
ID: 24141003
Yeah, that's what I'd tried (both methods) on a reasonably stock and updated RHEL 5 box.  DBI went through fine, but DBD doesn't do good things, with output as below.

It finally ends with:
  /usr/bin/make -j3 -- NOT OK
Running make test
  Can't test without successful make
Running make install
  make had returned bad status, install seems impossible

Since things are functional, I'm not looking to spend much time to fix it (though I acknowledge it needs to be fixed, and it'll probably bite me again).  


dbdimp.c: At top level:
dbdimp.c:2153: error: expected \u2018)\u2019 before \u2018val\u2019
dbdimp.c: In function \u2018mysql_db_FETCH_attrib\u2019:
dbdimp.c:2182: error: \u2018imp_dbh_t\u2019 has no member named \u2018has_transactions\u2019
dbdimp.c:2199: error: \u2018imp_dbh_t\u2019 has no member named \u2018auto_reconnect\u2019
dbdimp.c:2199: warning: passing argument 2 of \u2018Perl_newSViv\u2019 makes integer from pointer without a cast
dbdimp.c:2204: error: \u2018imp_dbh_t\u2019 has no member named \u2018bind_type_guessing\u2019
dbdimp.c:2204: warning: passing argument 2 of \u2018Perl_newSViv\u2019 makes integer from pointer without a cast
dbdimp.c:2208: error: \u2018imp_dbh_t\u2019 has no member named \u2018pmysql\u2019
dbdimp.c:2212: error: \u2018imp_dbh_t\u2019 has no member named \u2018pmysql\u2019

Open in new window

0
 
LVL 3

Expert Comment

by:GarthSnyder
ID: 24141192
Well, it's you that's educating me, I'm afraid... I thought I understood how this works, but now that I go back and look at my UTF code, I realize that I'm getting exactly the same behavior you describe. And that's with DBI 1.607 and DBD::mysql 4.011 (released yesterday!), "SET NAMES utf8", and "SET CHARACTER SET utf8", and "mysql_utf8_enable=1" in the connect string. I get proper UTF8 data but Perl's UTF8 flag is not set. I just never realized it because I wasn't relying on the flag status.

I notice that the DBD::mysql man page claims the flag is set only "if necessary", but I have tried queries that return actual UTF code points and the flag is still not set. The encodings for the table and column are all fine.

It does seem to be working for this guy (see the comments) and some others. I wonder what the distinguishing feature is.
0
 
LVL 10

Author Comment

by:laneduncan
ID: 24141372
<GRIN> it's a mess, huh?  Maybe one day I'll get the gumption up to dig into this some more. As much pain as this has caused folks over the years, here's hoping it's a high priority on someone's plate!
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs.  This Article is intended to be a resource for such a task. Preface The other day, I was involved …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question