We help IT Professionals succeed at work.

UTF8 HTML to UTF8 Text

rstaveley
rstaveley asked
on
I have a template system that generates UTF8 HTML, and I want to generate multipart/alternative e-mails with it. I've handled the issue of inline CSS in the HTML, it is the text part that I'm trying to get right. What I'd like is to be able to take my UTF8 HTML and generate UTF8 text from it.

http://search.cpan.org/~gaas/HTML-Format-1.23/lib/HTML/FormatText.pm looks like a good way to convert HTML e-mail to text, but it is limited to Latin1. I wonder what the best approach would be for handling UTF8?

Comment
Watch Question

Commented:
When dealing with UTF8 text within a Perl script then we need to enable the UTF8 Common Library by requirement. I usually do this when using strict and warnings at the start of the script, like so:-

#!/usr/bin/perl
use warnings;
use strict;
use utf8;

You should be able then to squirt UTF8 data into object arrays, databases, text files and generally do whatever you want with it then.

I say generally because on rare occasions, you may run into an occasional hiccups where the UTF8 text matches a regex or field delimiter or something similar - most required perl modules on CPAN work just fine with UTF8 text but some have problems. To get around these problems usually I would put ina  simple hack that encodes the text into some format and then at this point and I would carry on regardless - the CPAN modules should handle the encoded text just fine.

I generally use the HTML::Entities module to do this 'encoding' or alternatively use simple substitution cyphers like the ones attached to encode the and decode the text into something safe for handling within the perl code.

I hope this knowledge and advice helps you with your problem. Let me know if you have any further questions.
#!/usr/bin/perl
use warnings;
use strict;
use utf8;

sub encodeUTF8 {
    my ($plaintext) = @_;
    $plaintext =~ s/([^a-zA-Z0-9])/"%".unpack('H*',$1)/eg;
    return $plaintext;
}

sub decodeUTF8 {
    my ($encoded) = @_;
    $encoded =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
    return $encoded;
}

Open in new window

Author

Commented:
You're right - it works.

I'm using...

  $text_body =~ s/<.+?>//sg;

...which is probably good enough for my purposes for the plain text. The trick is use utf8 - as you pointed out and I probably ought to have realised, though.
#!/usr/bin/perl

use strict;
use warnings;
use MIME::Lite;
use Encode qw/encode/;
use Time::Format qw/%time/;
use utf8;

my $subject = "¿¿¿¿¿¿¿ $time{'hh:mm:ss'} - Greetings from The Occident";
my $body = <<EOT;
<html>
<body>
<h2 style='color: gold'>Message at $time{'yyyy:mm:dd hh:mm:ss'}</h2>
<p style='color: green'>¿¿¿¿¿¿¿</p>
<p style="color: blue">Greetings from The Occident</p>
</body>
</html>
EOT

my $text_body = $body;
$text_body =~ s/<.+?>//sg;

# Create the multipart container
my $msg = MIME::Lite->new(
	From	=> 'Barny Rubble <barny.rubble@sample.com>',
	To	=> 'Fred Bassett <fred.bassett@sample.com>',
	Subject	=> encode('MIME-Header', $subject),
	Type	=> 'multipart/alternative',
);

#print STDERR encode("UTF-8", $text_body);

$msg->attach(
	Type	=> 'text/plain',
	Data	=> encode("UTF-8", $text_body),
	Encoding=> 'quoted-printable',
);

$msg->attach(
	Type	=> 'text/html',
	Data	=> encode("UTF-8", $body),
	Encoding=> 'quoted-printable',
);

$msg->send(
	'smtp', 'mail.sample.com',
	Hello => 'www.sample.com',	# Public HELO address
);

Open in new window

Author

Commented:
Those upside-down question marks of course had fine-looking Mandarin in them.

Author

Commented:
I failed to add charset=utf-8 in my code snippet, and it is needed, because the recipient needs to know that the character set is UTF-8. For the sake of the PAQ, if anyone happens upon this thread, please refer to the corrected code snippet below.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:

	

#!/usr/bin/perl

use strict;
use warnings;
use MIME::Lite;
use Encode qw/encode/;
use Time::Format qw/%time/;
use utf8;

my $subject = "¿¿¿¿¿¿¿ $time{'hh:mm:ss'} - Greetings from The Occident";
my $body = <<EOT;
<html>
<body>
<h2 style='color: gold'>Message at $time{'yyyy:mm:dd hh:mm:ss'}</h2>
<p style='color: green'>¿¿¿¿¿¿¿</p>
<p style="color: blue">Greetings from The Occident</p>
</body>
</html>
EOT

my $text_body = $body;
$text_body =~ s/<.+?>//sg;

# Create the multipart container
my $msg = MIME::Lite->new(
	From	=> 'Barny Rubble <barny.rubble@sample.com>',
	To	=> 'Fred Bassett <fred.bassett@sample.com>',
	Subject	=> encode('MIME-Header', $subject),
	Type	=> 'multipart/alternative',
);

#print STDERR encode("UTF-8", $text_body);

$msg->attach(
	Type	=> 'text/plain; charset=utf-8',
	Data	=> encode("UTF-8", $text_body),
	Encoding=> 'quoted-printable',
);

$msg->attach(
	Type	=> 'text/html; charset=utf-8',
	Data	=> encode("UTF-8", $body),
	Encoding=> 'quoted-printable',
);

$msg->send(
	'smtp', 'mail.sample.com',
	Hello => 'www.sample.com',	# Public HELO address
);

Open in new window