sonic2000
asked on
Is Binary?
I read that it's possible to determine if a file is a binary or a text file by checking the first 1024 bytes. How to do that? Need some directions in doing it will be good.
Which system are you talking about? In Unix and other flovours of unix there is no difference between a text file and a binary file. Sometimes an executable is called a binary but there is no concept of binary file in Unix.
ASKER
i am using SunOS.
My concept of binary file here is when i use a cat, it will display nicely..
hope it's clear.
My concept of binary file here is when i use a cat, it will display nicely..
hope it's clear.
Hi sonic2000,
you can use the file command to get the type of file
file <filename>
Most OS store each file with some header information which determines the type of file. Also most files that are generated by other programs such as a compiled ELF binary have a particular format and a magic number which helps in identification
Look into the source code of file utility for the exact implementation. Source code should be available for atleast linux platform
Sunnycoder
you can use the file command to get the type of file
file <filename>
Most OS store each file with some header information which determines the type of file. Also most files that are generated by other programs such as a compiled ELF binary have a particular format and a magic number which helps in identification
Look into the source code of file utility for the exact implementation. Source code should be available for atleast linux platform
Sunnycoder
Windows has no advanced facility like this. You just have a file with data in it! Pretty crappy really but only a bad workman blames his tools.
The theory about checking the first 1024 bytes is only a guestimate. What you do is read 1024 bytes from the file and make sure all characters in it are printable (most use 'isprint' but make your own by all means. The idea is that is that if there is anything other that 0-9 a-f A-F punctuation and white space then it is probably a binary file. You can be a bit more intelligent but that's the basic algorithm.
Paul
The theory about checking the first 1024 bytes is only a guestimate. What you do is read 1024 bytes from the file and make sure all characters in it are printable (most use 'isprint' but make your own by all means. The idea is that is that if there is anything other that 0-9 a-f A-F punctuation and white space then it is probably a binary file. You can be a bit more intelligent but that's the basic algorithm.
Paul
ASKER
How to get the magic number?
sonic2000,
I don't think there is a 'magic number' - 1024 or something else - that will conclusively decide.
I think it should be pretty easy to check the entire file - the program will be quite fast.
You can either use isprint() macro, or check if any character (byte) has ASCII value
higher than 127. If it is a pure text file, all ASCII values must be 127 or below.
(you can be stricter by checking for only 32-127, plus 9 (tab), 13 (cr) and 10 (lf).
Do you want code for how to do this?
- stochastic
I don't think there is a 'magic number' - 1024 or something else - that will conclusively decide.
I think it should be pretty easy to check the entire file - the program will be quite fast.
You can either use isprint() macro, or check if any character (byte) has ASCII value
higher than 127. If it is a pure text file, all ASCII values must be 127 or below.
(you can be stricter by checking for only 32-127, plus 9 (tab), 13 (cr) and 10 (lf).
Do you want code for how to do this?
- stochastic
btw, since you are using SunOS, sunnycoder's suggestion should work just fine.
- sto
- sto
ASKER
stochastic,
I will code it myself :) Thanks a lot.. Can you tell me more on the allowed ascii values and which are not allowed?
I will code it myself :) Thanks a lot.. Can you tell me more on the allowed ascii values and which are not allowed?
Its not a matter of 'allowed' and 'not allowed', its more a matter of 'likely' or 'unlikely'. Take a look at the ASCII character set and decide for yourself which characters would suggest to you that the file was binary. 'cr' and 'lf' are comm on in text files but you can also get 'ff' (form feed) sometimes. Even more rare but possible are FS, GS and RS. You then need to decide if DEL (0x7f) can appear. Generally speaking 32 - 126 are OK and tab, cr and lf are common, the rest is your choice.
Paul
Paul
What Paul said is absolutely right, but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.
- sto
- sto
>> but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.
Did you forget £, Yen and accented characters? I did!
sonic,
As you can see, even wise old codgers like Sto and I cant get it right first time.
Paul
Did you forget £, Yen and accented characters? I did!
sonic,
As you can see, even wise old codgers like Sto and I cant get it right first time.
Paul
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Seconded!!
'though I'd add FF (12) and do some more research on the likely percentage of accented characters in various european countries.
Otherwise excellent idea Jaime.
All those in favour say Aye!!
Otherwise excellent idea Jaime.
All those in favour say Aye!!
Thanks Paul.
Please also add 26 (EOF) as mentioned by stochastic
Please also add 26 (EOF) as mentioned by stochastic
This is probably the only thing that you can absolutely be sure of:
A text file will never contain '\0' - the ASCII zero, the first character of the ASCII table, value 0.
A text file will never contain '\0' - the ASCII zero, the first character of the ASCII table, value 0.
Aye for the Jaime Olivares test for Textuality :-)
(doesn't it feel nice to give a name to the test?)
- sto
(doesn't it feel nice to give a name to the test?)
- sto