Solved

looking for Linux commandline tool for determining character encoding of a file

Posted on 2010-11-17
7
1,974 Views
Last Modified: 2012-05-10
I need to change the character-encoding of a large number of files, to UTF-8.  The problem is that they are not all the same encoding to start with.  I want to make a script that runs through and determines the current encoding for each file, then run iconv on the file to change to UTF-8.

I tried enca -i fileName but got:
$ enca -i fileName
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages

Thanks,
Frank
0
Comment
Question by:ibanja
  • 3
  • 3
7 Comments
 
LVL 7

Expert Comment

by:deisrobinson
ID: 34155724
You are using enca on text files only correct?
0
 
LVL 7

Accepted Solution

by:
deisrobinson earned 500 total points
ID: 34155742
Look,

"You can (or have to) use -L option to tell it the right language. Suppose, you downloaded some Russian HTML file, 'file.htm', it claims it's windows-1251 but it isn't. So you run

    enca -L ru file.htm"

In your case it would probably be "enca -L en file.htm"

Find out more  here:

http://linux.die.net/man/1/enca
0
 
LVL 7

Expert Comment

by:Hatrix76
ID: 34156248
try file

file <filename>

it should give you something like:

utf8-formatted text

best
0
Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

 

Author Comment

by:ibanja
ID: 34156735
I tried the file command.  I get:

$ file file.tex
file.tex: Non-ISO extended-ASCII text

enca didn't like "en"
$ enca -L en file.tex
enca: Language `en' is unknown or not supported.

So I tried:

$ enca -L none file.tex
Unrecognized encoding

It seems to recognize some encodings but not all.  I guess this is the best I'll get.
0
 

Author Comment

by:ibanja
ID: 34156773
// You are using enca on text files only correct?
Yes, I'm using on text files, only.

0
 
LVL 7

Expert Comment

by:deisrobinson
ID: 34156783
Great, use the `enca --list languages'  command to find out what the abbreviation is for english. I just guessed it would be 'en'. Good luck!
0
 

Author Closing Comment

by:ibanja
ID: 34167286
I didn't get any "english" listings, that's why I went with "none."  It seems to work.

Thanks!

$ enca --list languages
belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
  bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
      czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
   estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
   croatian: CP1250 ISO-8859-2 IBM852 macce CORK
  hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
 lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
    latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
     polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
    russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
     slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
    slovene: ISO-8859-2 CP1250 IBM852 macce CORK
  ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
    chinese: GBK BIG5 HZ
       none:
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Setting up Secure Ubuntu server on VMware 1.      Insert the Ubuntu Server distribution CD or attach the ISO of the CD which is in the “Datastore”. Note that it is important to install the x64 edition on servers, not the X86 editions. 2.      Power on th…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now