Solved

Read/Write unicode files with ActivePerl on Windows

Posted on 2004-08-16
16
955 Views
Last Modified: 2012-06-21
Hello,

Can anyone give me an example for how to deal with unicode with regards to reading and writing to files in Perl?

I'm using ActivePerl on windows and need to read in a unicode file, modify the contents, and then write this back out to a new file whilst retaining the Unicode format. I have it working fine in ASCII, but unicode is proving quite a challenge. A worked example would be ideal.

Thanks,

Mike
0
Comment
Question by:beelineuk
  • 8
  • 8
16 Comments
 
LVL 18

Expert Comment

by:kandura
ID: 11809724
What kind of encoding is the file in? UTF8?
Have a look at the -C command line switch. (http://www.perldoc.com/perl5.8.4/pod/perlrun.html#Command-Switches)
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11809749
The encoding is Unicode as defined by Windows notepad. This is not UTF-8.

The -C doesn't help, nor does use Encode::unicode;
0
 
LVL 18

Accepted Solution

by:
kandura earned 250 total points
ID: 11810341
I see. Notepad uses the UTF-16LE encoding.

Here's something that worked for me:

use open ':encoding(UTF-16LE)';
open O, ">some.txt";
binmode O;

while(<>) {
    s/http/ftp/;
    print O $_;
}
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 2

Author Comment

by:beelineuk
ID: 11810681
I get the following error, even when the source file contains plain text chars.

Wide character in print at C:\sample.pl line 6, <> line 1.





0
 
LVL 18

Expert Comment

by:kandura
ID: 11810824
It's a warning, not an error. I saw it too, but I doubt it causes problems.
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11811070
Ok finally got it the output to resemble the input with the following variation on the above. The problem I have now is that it's screwing up the carriage returns. If I don't use chomp I get an extra trailing byte that corrupts the data. With chomp I loose them completely. This is prob simple but how do I add the line break back in? Everything I've tried makes it go screwy again.

use Encode;

use open ":encoding(UTF-16LE)";
open my $out, ">:encoding(UTF-16LE)", "output.txt" or die;

while($line=<>) {
    chomp($line);
    print $out $line;
}
0
 
LVL 18

Expert Comment

by:kandura
ID: 11811183
I noticed that too, which is why I did the binmode() on the output handle. I didn't need chomp after that.
I don't think you need to specify the output layer on your "open my $out" though, since the "use open" already takes care of that.
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11811443
Ok, but then some of my special chars start disappearing such as ß.
0
 
LVL 18

Expert Comment

by:kandura
ID: 11811682
Why would that happen? I tested this with a bit of russian, and even after adding a ß it kept working fine.
Did it happen after adding the binmode?
What kind of modifications are you doing to the text?
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11811736
it's weird, if I use ur exact example the first ß is ok, after this I loose them all. This is my test file...

Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica

This is the output

Diamant-Weiß      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
0
 
LVL 18

Expert Comment

by:kandura
ID: 11811963
Odd. Works fine for me. How does a hex dump of the output look?
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11812180
Ok the original file ends each line with

0D 00 0A 00

The copy ends the first eachlike this

0D 00 00
0
 
LVL 18

Expert Comment

by:kandura
ID: 11839626
Looks like "\n" characters have been dropped. Do you still have a "chomp()" in there? Any other line ending modifications? Are you doing any other modifications?
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11846851
I was using Chomp without the binmode as this gives me the closest result to what I'm looking for. However, I have since found that even once the initial problem of reading and writing unicode files is overcome, as soon as you try to use regular expressions within the logic it all gets corrupted anyway as they don't handle UTF-16.

I've now aborted this technology for what I'm doing as it doesn't offer the required level of unicode support for what I need, and is more trouble than what it's worth. Despite this please have the points for your help on this matter, even though it did not reach a resolution.
0
 
LVL 18

Expert Comment

by:kandura
ID: 11846911
I appreciate the gesture of giving me the points, but I'm not too thrilled about the C grade, or the fact that you gave up so easily.

I did get expected results with even the small snippet I gave you. I tried it with a regular expression that modified "Weiß" into "Schwartz", and that worked perfectly.

Of course I respect that you considered it more trouble than you're willing to go through, but I have the feeling it must have been a small thing that went wrong in your version of the script.
Had you given me a chance to look at it, I'm sure we could have gotten it to work.

No hard feelings though, and it's okay if you don't want to pursue this any further. The only thing is I feel I don't deserve a C grade.
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11847023
The C Grade is only because I didn't get a working resolution for the question I originally asked, it's not meant to be a reflection of your help which I do appreciate. Believe me I didn't give up easily, I spent a few days on this trying every trick I could find, and reading everything I could find on the subject. ActivePerl documentation itself states that it does not fully implement UTF-16, so maybe it's not possible. Even if it is, I can't afford the time to spend on investigating this any further. The example I give is a very simplified version of a much more complex script that is effectively doing the same conversions, and it does not work with 16 bit Unicode input/output, though it works fine for ascii.  "Weiß" into "Schwartz" is not a valid senerio, as Schwartz in unicode is the same as it is in ascii.

In the simplest example of reading in a unicode file created on a windows platform and outputing a copy to a unicode file on the windows platform, the given solutions do not provide a identical match.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question