Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

looking for Linux commandline tool for determining character encoding of a file

Posted on 2010-11-17
7
2,007 Views
Last Modified: 2012-05-10
I need to change the character-encoding of a large number of files, to UTF-8.  The problem is that they are not all the same encoding to start with.  I want to make a script that runs through and determines the current encoding for each file, then run iconv on the file to change to UTF-8.

I tried enca -i fileName but got:
$ enca -i fileName
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages

Thanks,
Frank
0
Comment
Question by:ibanja
  • 3
  • 3
7 Comments
 
LVL 7

Expert Comment

by:deisrobinson
ID: 34155724
You are using enca on text files only correct?
0
 
LVL 7

Accepted Solution

by:
deisrobinson earned 500 total points
ID: 34155742
Look,

"You can (or have to) use -L option to tell it the right language. Suppose, you downloaded some Russian HTML file, 'file.htm', it claims it's windows-1251 but it isn't. So you run

    enca -L ru file.htm"

In your case it would probably be "enca -L en file.htm"

Find out more  here:

http://linux.die.net/man/1/enca
0
 
LVL 7

Expert Comment

by:Hatrix76
ID: 34156248
try file

file <filename>

it should give you something like:

utf8-formatted text

best
0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 

Author Comment

by:ibanja
ID: 34156735
I tried the file command.  I get:

$ file file.tex
file.tex: Non-ISO extended-ASCII text

enca didn't like "en"
$ enca -L en file.tex
enca: Language `en' is unknown or not supported.

So I tried:

$ enca -L none file.tex
Unrecognized encoding

It seems to recognize some encodings but not all.  I guess this is the best I'll get.
0
 

Author Comment

by:ibanja
ID: 34156773
// You are using enca on text files only correct?
Yes, I'm using on text files, only.

0
 
LVL 7

Expert Comment

by:deisrobinson
ID: 34156783
Great, use the `enca --list languages'  command to find out what the abbreviation is for english. I just guessed it would be 'en'. Good luck!
0
 

Author Closing Comment

by:ibanja
ID: 34167286
I didn't get any "english" listings, that's why I went with "none."  It seems to work.

Thanks!

$ enca --list languages
belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
  bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
      czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
   estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
   croatian: CP1250 ISO-8859-2 IBM852 macce CORK
  hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
 lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
    latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
     polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
    russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
     slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
    slovene: ISO-8859-2 CP1250 IBM852 macce CORK
  ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
    chinese: GBK BIG5 HZ
       none:
0

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question