Solved

Read/Write unicode files with ActivePerl on Windows

Posted on 2004-08-16
16
948 Views
Last Modified: 2012-06-21
Hello,

Can anyone give me an example for how to deal with unicode with regards to reading and writing to files in Perl?

I'm using ActivePerl on windows and need to read in a unicode file, modify the contents, and then write this back out to a new file whilst retaining the Unicode format. I have it working fine in ASCII, but unicode is proving quite a challenge. A worked example would be ideal.

Thanks,

Mike
0
Comment
Question by:beelineuk
  • 8
  • 8
16 Comments
 
LVL 18

Expert Comment

by:kandura
ID: 11809724
What kind of encoding is the file in? UTF8?
Have a look at the -C command line switch. (http://www.perldoc.com/perl5.8.4/pod/perlrun.html#Command-Switches)
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11809749
The encoding is Unicode as defined by Windows notepad. This is not UTF-8.

The -C doesn't help, nor does use Encode::unicode;
0
 
LVL 18

Accepted Solution

by:
kandura earned 250 total points
ID: 11810341
I see. Notepad uses the UTF-16LE encoding.

Here's something that worked for me:

use open ':encoding(UTF-16LE)';
open O, ">some.txt";
binmode O;

while(<>) {
    s/http/ftp/;
    print O $_;
}
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11810681
I get the following error, even when the source file contains plain text chars.

Wide character in print at C:\sample.pl line 6, <> line 1.





0
 
LVL 18

Expert Comment

by:kandura
ID: 11810824
It's a warning, not an error. I saw it too, but I doubt it causes problems.
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11811070
Ok finally got it the output to resemble the input with the following variation on the above. The problem I have now is that it's screwing up the carriage returns. If I don't use chomp I get an extra trailing byte that corrupts the data. With chomp I loose them completely. This is prob simple but how do I add the line break back in? Everything I've tried makes it go screwy again.

use Encode;

use open ":encoding(UTF-16LE)";
open my $out, ">:encoding(UTF-16LE)", "output.txt" or die;

while($line=<>) {
    chomp($line);
    print $out $line;
}
0
 
LVL 18

Expert Comment

by:kandura
ID: 11811183
I noticed that too, which is why I did the binmode() on the output handle. I didn't need chomp after that.
I don't think you need to specify the output layer on your "open my $out" though, since the "use open" already takes care of that.
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11811443
Ok, but then some of my special chars start disappearing such as ß.
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 18

Expert Comment

by:kandura
ID: 11811682
Why would that happen? I tested this with a bit of russian, and even after adding a ß it kept working fine.
Did it happen after adding the binmode?
What kind of modifications are you doing to the text?
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11811736
it's weird, if I use ur exact example the first ß is ok, after this I loose them all. This is my test file...

Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica
Diamant-Weiß      Celica

This is the output

Diamant-Weiß      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
Diamant-Wei      Celica
0
 
LVL 18

Expert Comment

by:kandura
ID: 11811963
Odd. Works fine for me. How does a hex dump of the output look?
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11812180
Ok the original file ends each line with

0D 00 0A 00

The copy ends the first eachlike this

0D 00 00
0
 
LVL 18

Expert Comment

by:kandura
ID: 11839626
Looks like "\n" characters have been dropped. Do you still have a "chomp()" in there? Any other line ending modifications? Are you doing any other modifications?
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11846851
I was using Chomp without the binmode as this gives me the closest result to what I'm looking for. However, I have since found that even once the initial problem of reading and writing unicode files is overcome, as soon as you try to use regular expressions within the logic it all gets corrupted anyway as they don't handle UTF-16.

I've now aborted this technology for what I'm doing as it doesn't offer the required level of unicode support for what I need, and is more trouble than what it's worth. Despite this please have the points for your help on this matter, even though it did not reach a resolution.
0
 
LVL 18

Expert Comment

by:kandura
ID: 11846911
I appreciate the gesture of giving me the points, but I'm not too thrilled about the C grade, or the fact that you gave up so easily.

I did get expected results with even the small snippet I gave you. I tried it with a regular expression that modified "Weiß" into "Schwartz", and that worked perfectly.

Of course I respect that you considered it more trouble than you're willing to go through, but I have the feeling it must have been a small thing that went wrong in your version of the script.
Had you given me a chance to look at it, I'm sure we could have gotten it to work.

No hard feelings though, and it's okay if you don't want to pursue this any further. The only thing is I feel I don't deserve a C grade.
0
 
LVL 2

Author Comment

by:beelineuk
ID: 11847023
The C Grade is only because I didn't get a working resolution for the question I originally asked, it's not meant to be a reflection of your help which I do appreciate. Believe me I didn't give up easily, I spent a few days on this trying every trick I could find, and reading everything I could find on the subject. ActivePerl documentation itself states that it does not fully implement UTF-16, so maybe it's not possible. Even if it is, I can't afford the time to spend on investigating this any further. The example I give is a very simplified version of a much more complex script that is effectively doing the same conversions, and it does not work with 16 bit Unicode input/output, though it works fine for ascii.  "Weiß" into "Schwartz" is not a valid senerio, as Schwartz in unicode is the same as it is in ascii.

In the simplest example of reading in a unicode file created on a windows platform and outputing a copy to a unicode file on the windows platform, the given solutions do not provide a identical match.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
perl match and sort unique result 2 122
PERL get the value for query 4 139
Perl - Mawk 2 69
Writing a parser for java language 4 61
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video discusses moving either the default database or any database to a new volume.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now