Link to home
Start Free TrialLog in
Avatar of sonic2000
sonic2000

asked on

Is Binary?

I read that it's possible to determine if a file is a binary or a text file by checking the first 1024 bytes. How to do that? Need some directions in doing it will be good.
Avatar of pankajtiwary
pankajtiwary

Which system are you talking about? In Unix and other flovours of unix there is no difference between a text file and a binary file. Sometimes an executable is called a binary but there is no concept of binary file in Unix.
Avatar of sonic2000

ASKER

i am using SunOS.
My concept of binary file here is when i use a cat, it will display nicely..
hope it's clear.
Avatar of sunnycoder
Hi sonic2000,

you can use the file command to get the type of file

file <filename>

Most OS store each file with some header information which determines the type of file. Also most files that are generated by other programs such as a compiled ELF binary have a particular format and a magic number which helps in identification

Look into the source code of file utility for the exact implementation. Source code should be available for atleast linux platform

Sunnycoder
Windows has no advanced facility like this. You just have a file with data in it! Pretty crappy really but only a bad workman blames his tools.

The theory about checking the first 1024 bytes is only a guestimate. What you do is read 1024 bytes from the file and make sure all characters in it are printable (most use 'isprint' but make your own by all means. The idea is that is that if there is anything other that 0-9 a-f A-F punctuation and white space then it is probably a binary file. You can be a bit more intelligent but that's the basic algorithm.

Paul
How to get the magic number?
sonic2000,

I don't think there is a 'magic number' - 1024 or something else - that will conclusively decide.

I think it should be pretty easy to check the entire file - the program will be quite fast.

You can either use isprint() macro, or check if any character (byte) has ASCII value
higher than 127. If it is a pure text file, all ASCII values must be 127 or below.
(you can be stricter by checking for only 32-127, plus 9 (tab), 13 (cr) and 10 (lf).

Do you want code for how to do this?

- stochastic
btw, since you are using SunOS, sunnycoder's suggestion should work just fine.

- sto
stochastic,
 I will code it myself :) Thanks a lot.. Can you tell me more on the allowed ascii values and which are not allowed?
Its not a matter of 'allowed' and 'not allowed', its more a matter of 'likely' or 'unlikely'. Take a look at the ASCII character set and decide for yourself which characters would suggest to you that the file was binary. 'cr' and 'lf' are comm on in text files but you can also get 'ff' (form feed) sometimes. Even more rare but possible are FS, GS and RS. You then need to decide if DEL (0x7f) can appear. Generally speaking 32 - 126 are OK and tab, cr and lf are common, the rest is your choice.

Paul
What Paul said is absolutely right, but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

- sto
>> but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

Did you forget £, Yen and accented characters? I did!

sonic,

As you can see, even wise old codgers like Sto and I cant get it right first time.

Paul
ASKER CERTIFIED SOLUTION
Avatar of stochastic
stochastic
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Seconded!!
'though I'd add FF (12) and do some more research on the likely percentage of accented characters in various european countries.

Otherwise excellent idea Jaime.

All those in favour say Aye!!
Thanks Paul.
Please also add 26 (EOF) as mentioned by stochastic
This is probably the only thing that you can absolutely be sure of:

A text file will never contain '\0' - the ASCII zero, the first character of the ASCII table, value 0.
Aye for the Jaime Olivares test for Textuality :-)
(doesn't it feel nice to give a name to the test?)

- sto