• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 984
  • Last Modified:

Read/Write unicode files with ActivePerl on Windows

Hello,

Can anyone give me an example for how to deal with unicode with regards to reading and writing to files in Perl?

I'm using ActivePerl on windows and need to read in a unicode file, modify the contents, and then write this back out to a new file whilst retaining the Unicode format. I have it working fine in ASCII, but unicode is proving quite a challenge. A worked example would be ideal.

Thanks,

Mike
0
beelineuk
Asked:
beelineuk
  • 8
  • 8
1 Solution
 
kanduraCommented:
What kind of encoding is the file in? UTF8?
Have a look at the -C command line switch. (http://www.perldoc.com/perl5.8.4/pod/perlrun.html#Command-Switches)
0
 
beelineukAuthor Commented:
The encoding is Unicode as defined by Windows notepad. This is not UTF-8.

The -C doesn't help, nor does use Encode::unicode;
0
 
kanduraCommented:
I see. Notepad uses the UTF-16LE encoding.

Here's something that worked for me:

use open ':encoding(UTF-16LE)';
open O, ">some.txt";
binmode O;

while(<>) {
    s/http/ftp/;
    print O $_;
}
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
beelineukAuthor Commented:
I get the following error, even when the source file contains plain text chars.

Wide character in print at C:\sample.pl line 6, <> line 1.





0
 
kanduraCommented:
It's a warning, not an error. I saw it too, but I doubt it causes problems.
0
 
beelineukAuthor Commented:
Ok finally got it the output to resemble the input with the following variation on the above. The problem I have now is that it's screwing up the carriage returns. If I don't use chomp I get an extra trailing byte that corrupts the data. With chomp I loose them completely. This is prob simple but how do I add the line break back in? Everything I've tried makes it go screwy again.

use Encode;

use open ":encoding(UTF-16LE)";
open my $out, ">:encoding(UTF-16LE)", "output.txt" or die;

while($line=<>) {
    chomp($line);
    print $out $line;
}
0
 
kanduraCommented:
I noticed that too, which is why I did the binmode() on the output handle. I didn't need chomp after that.
I don't think you need to specify the output layer on your "open my $out" though, since the "use open" already takes care of that.
0
 
beelineukAuthor Commented:
Ok, but then some of my special chars start disappearing such as ß.
0
 
kanduraCommented:
Why would that happen? I tested this with a bit of russian, and even after adding a ß it kept working fine.
Did it happen after adding the binmode?
What kind of modifications are you doing to the text?
0
 
beelineukAuthor Commented:
it's weird, if I use ur exact example the first ß is ok, after this I loose them all. This is my test file...

Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica

This is the output

Diamant-Weiß      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
0
 
kanduraCommented:
Odd. Works fine for me. How does a hex dump of the output look?
0
 
beelineukAuthor Commented:
Ok the original file ends each line with

0D 00 0A 00

The copy ends the first eachlike this

0D 00 00
0
 
kanduraCommented:
Looks like "\n" characters have been dropped. Do you still have a "chomp()" in there? Any other line ending modifications? Are you doing any other modifications?
0
 
beelineukAuthor Commented:
I was using Chomp without the binmode as this gives me the closest result to what I'm looking for. However, I have since found that even once the initial problem of reading and writing unicode files is overcome, as soon as you try to use regular expressions within the logic it all gets corrupted anyway as they don't handle UTF-16.

I've now aborted this technology for what I'm doing as it doesn't offer the required level of unicode support for what I need, and is more trouble than what it's worth. Despite this please have the points for your help on this matter, even though it did not reach a resolution.
0
 
kanduraCommented:
I appreciate the gesture of giving me the points, but I'm not too thrilled about the C grade, or the fact that you gave up so easily.

I did get expected results with even the small snippet I gave you. I tried it with a regular expression that modified "Weiß" into "Schwartz", and that worked perfectly.

Of course I respect that you considered it more trouble than you're willing to go through, but I have the feeling it must have been a small thing that went wrong in your version of the script.
Had you given me a chance to look at it, I'm sure we could have gotten it to work.

No hard feelings though, and it's okay if you don't want to pursue this any further. The only thing is I feel I don't deserve a C grade.
0
 
beelineukAuthor Commented:
The C Grade is only because I didn't get a working resolution for the question I originally asked, it's not meant to be a reflection of your help which I do appreciate. Believe me I didn't give up easily, I spent a few days on this trying every trick I could find, and reading everything I could find on the subject. ActivePerl documentation itself states that it does not fully implement UTF-16, so maybe it's not possible. Even if it is, I can't afford the time to spend on investigating this any further. The example I give is a very simplified version of a much more complex script that is effectively doing the same conversions, and it does not work with 16 bit Unicode input/output, though it works fine for ascii.  "Weiß" into "Schwartz" is not a valid senerio, as Schwartz in unicode is the same as it is in ascii.

In the simplest example of reading in a unicode file created on a windows platform and outputing a copy to a unicode file on the windows platform, the given solutions do not provide a identical match.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 8
  • 8
Tackle projects and never again get stuck behind a technical roadblock.
Join Now