looking for Linux commandline tool for determining character encoding of a file

I need to change the character-encoding of a large number of files, to UTF-8.  The problem is that they are not all the same encoding to start with.  I want to make a script that runs through and determines the current encoding for each file, then run iconv on the file to change to UTF-8.

I tried enca -i fileName but got:
$ enca -i fileName
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages

Thanks,
Frank
ibanjaAsked:
Who is Participating?
 
deisrobinsonConnect With a Mentor Commented:
Look,

"You can (or have to) use -L option to tell it the right language. Suppose, you downloaded some Russian HTML file, 'file.htm', it claims it's windows-1251 but it isn't. So you run

    enca -L ru file.htm"

In your case it would probably be "enca -L en file.htm"

Find out more  here:

http://linux.die.net/man/1/enca
0
 
deisrobinsonCommented:
You are using enca on text files only correct?
0
 
Hatrix76Commented:
try file

file <filename>

it should give you something like:

utf8-formatted text

best
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
ibanjaAuthor Commented:
I tried the file command.  I get:

$ file file.tex
file.tex: Non-ISO extended-ASCII text

enca didn't like "en"
$ enca -L en file.tex
enca: Language `en' is unknown or not supported.

So I tried:

$ enca -L none file.tex
Unrecognized encoding

It seems to recognize some encodings but not all.  I guess this is the best I'll get.
0
 
ibanjaAuthor Commented:
// You are using enca on text files only correct?
Yes, I'm using on text files, only.

0
 
deisrobinsonCommented:
Great, use the `enca --list languages'  command to find out what the abbreviation is for english. I just guessed it would be 'en'. Good luck!
0
 
ibanjaAuthor Commented:
I didn't get any "english" listings, that's why I went with "none."  It seems to work.

Thanks!

$ enca --list languages
belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
  bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
      czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
   estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
   croatian: CP1250 ISO-8859-2 IBM852 macce CORK
  hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
 lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
    latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
     polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
    russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
     slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
    slovene: ISO-8859-2 CP1250 IBM852 macce CORK
  ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
    chinese: GBK BIG5 HZ
       none:
0
All Courses

From novice to tech pro — start learning today.