Link to home
Start Free TrialLog in
Avatar of beelineuk
beelineuk

asked on

Read/Write unicode files with ActivePerl on Windows

Hello,

Can anyone give me an example for how to deal with unicode with regards to reading and writing to files in Perl?

I'm using ActivePerl on windows and need to read in a unicode file, modify the contents, and then write this back out to a new file whilst retaining the Unicode format. I have it working fine in ASCII, but unicode is proving quite a challenge. A worked example would be ideal.

Thanks,

Mike
Avatar of kandura
kandura

What kind of encoding is the file in? UTF8?
Have a look at the -C command line switch. (http://www.perldoc.com/perl5.8.4/pod/perlrun.html#Command-Switches)
Avatar of beelineuk

ASKER

The encoding is Unicode as defined by Windows notepad. This is not UTF-8.

The -C doesn't help, nor does use Encode::unicode;
ASKER CERTIFIED SOLUTION
Avatar of kandura
kandura

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I get the following error, even when the source file contains plain text chars.

Wide character in print at C:\sample.pl line 6, <> line 1.





It's a warning, not an error. I saw it too, but I doubt it causes problems.
Ok finally got it the output to resemble the input with the following variation on the above. The problem I have now is that it's screwing up the carriage returns. If I don't use chomp I get an extra trailing byte that corrupts the data. With chomp I loose them completely. This is prob simple but how do I add the line break back in? Everything I've tried makes it go screwy again.

use Encode;

use open ":encoding(UTF-16LE)";
open my $out, ">:encoding(UTF-16LE)", "output.txt" or die;

while($line=<>) {
    chomp($line);
    print $out $line;
}
I noticed that too, which is why I did the binmode() on the output handle. I didn't need chomp after that.
I don't think you need to specify the output layer on your "open my $out" though, since the "use open" already takes care of that.
Ok, but then some of my special chars start disappearing such as ß.
Why would that happen? I tested this with a bit of russian, and even after adding a ß it kept working fine.
Did it happen after adding the binmode?
What kind of modifications are you doing to the text?
it's weird, if I use ur exact example the first ß is ok, after this I loose them all. This is my test file...

Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica

This is the output

Diamant-Weiß      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
Odd. Works fine for me. How does a hex dump of the output look?
Ok the original file ends each line with

0D 00 0A 00

The copy ends the first eachlike this

0D 00 00
Looks like "\n" characters have been dropped. Do you still have a "chomp()" in there? Any other line ending modifications? Are you doing any other modifications?
I was using Chomp without the binmode as this gives me the closest result to what I'm looking for. However, I have since found that even once the initial problem of reading and writing unicode files is overcome, as soon as you try to use regular expressions within the logic it all gets corrupted anyway as they don't handle UTF-16.

I've now aborted this technology for what I'm doing as it doesn't offer the required level of unicode support for what I need, and is more trouble than what it's worth. Despite this please have the points for your help on this matter, even though it did not reach a resolution.
I appreciate the gesture of giving me the points, but I'm not too thrilled about the C grade, or the fact that you gave up so easily.

I did get expected results with even the small snippet I gave you. I tried it with a regular expression that modified "Weiß" into "Schwartz", and that worked perfectly.

Of course I respect that you considered it more trouble than you're willing to go through, but I have the feeling it must have been a small thing that went wrong in your version of the script.
Had you given me a chance to look at it, I'm sure we could have gotten it to work.

No hard feelings though, and it's okay if you don't want to pursue this any further. The only thing is I feel I don't deserve a C grade.
The C Grade is only because I didn't get a working resolution for the question I originally asked, it's not meant to be a reflection of your help which I do appreciate. Believe me I didn't give up easily, I spent a few days on this trying every trick I could find, and reading everything I could find on the subject. ActivePerl documentation itself states that it does not fully implement UTF-16, so maybe it's not possible. Even if it is, I can't afford the time to spend on investigating this any further. The example I give is a very simplified version of a much more complex script that is effectively doing the same conversions, and it does not work with 16 bit Unicode input/output, though it works fine for ascii.  "Weiß" into "Schwartz" is not a valid senerio, as Schwartz in unicode is the same as it is in ascii.

In the simplest example of reading in a unicode file created on a windows platform and outputing a copy to a unicode file on the windows platform, the given solutions do not provide a identical match.