• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 357
  • Last Modified:

Is Binary?

I read that it's possible to determine if a file is a binary or a text file by checking the first 1024 bytes. How to do that? Need some directions in doing it will be good.
0
sonic2000
Asked:
sonic2000
  • 5
  • 5
  • 3
  • +4
2 Solutions
 
pankajtiwaryCommented:
Which system are you talking about? In Unix and other flovours of unix there is no difference between a text file and a binary file. Sometimes an executable is called a binary but there is no concept of binary file in Unix.
0
 
sonic2000Author Commented:
i am using SunOS.
My concept of binary file here is when i use a cat, it will display nicely..
hope it's clear.
0
 
sunnycoderCommented:
Hi sonic2000,

you can use the file command to get the type of file

file <filename>

Most OS store each file with some header information which determines the type of file. Also most files that are generated by other programs such as a compiled ELF binary have a particular format and a magic number which helps in identification

Look into the source code of file utility for the exact implementation. Source code should be available for atleast linux platform

Sunnycoder
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
PaulCaswellCommented:
Windows has no advanced facility like this. You just have a file with data in it! Pretty crappy really but only a bad workman blames his tools.

The theory about checking the first 1024 bytes is only a guestimate. What you do is read 1024 bytes from the file and make sure all characters in it are printable (most use 'isprint' but make your own by all means. The idea is that is that if there is anything other that 0-9 a-f A-F punctuation and white space then it is probably a binary file. You can be a bit more intelligent but that's the basic algorithm.

Paul
0
 
sonic2000Author Commented:
How to get the magic number?
0
 
stochasticCommented:
sonic2000,

I don't think there is a 'magic number' - 1024 or something else - that will conclusively decide.

I think it should be pretty easy to check the entire file - the program will be quite fast.

You can either use isprint() macro, or check if any character (byte) has ASCII value
higher than 127. If it is a pure text file, all ASCII values must be 127 or below.
(you can be stricter by checking for only 32-127, plus 9 (tab), 13 (cr) and 10 (lf).

Do you want code for how to do this?

- stochastic
0
 
stochasticCommented:
btw, since you are using SunOS, sunnycoder's suggestion should work just fine.

- sto
0
 
sonic2000Author Commented:
stochastic,
 I will code it myself :) Thanks a lot.. Can you tell me more on the allowed ascii values and which are not allowed?
0
 
PaulCaswellCommented:
Its not a matter of 'allowed' and 'not allowed', its more a matter of 'likely' or 'unlikely'. Take a look at the ASCII character set and decide for yourself which characters would suggest to you that the file was binary. 'cr' and 'lf' are comm on in text files but you can also get 'ff' (form feed) sometimes. Even more rare but possible are FS, GS and RS. You then need to decide if DEL (0x7f) can appear. Generally speaking 32 - 126 are OK and tab, cr and lf are common, the rest is your choice.

Paul
0
 
stochasticCommented:
What Paul said is absolutely right, but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

- sto
0
 
PaulCaswellCommented:
>> but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

Did you forget £, Yen and accented characters? I did!

sonic,

As you can see, even wise old codgers like Sto and I cant get it right first time.

Paul
0
 
stochasticCommented:
> even wise old codgers like Sto  .....

I like the "wise" part :-)  LOL

True, about many printable characters >127.

On another track, if you consider the two file open modes in C, namely "a" (ascii) and "b" (binary),
(remember - fopen(something, "r") vs. fopen(something, "rb")  )

the things that make the difference are (imho) for the "a" mode:

* cr+lf are read as a single character, namely lf ('\n')
* ^z (ascii 26) is considered EOF
* null (ascii 0) is ignored? (I am not sure)

but to be sure, >127 are tolerated, and left alone.

- stochastic
0
 
Jaime OlivaresSoftware ArchitectCommented:
I propose this rule of thumb:

Consider valid characters:
-All between >=32 and <=127
-Also characters 9,10,13
-None other below 32
-Also, consider up to 5% of characters above 127 (or maybe up to 10%)

I think this will match more than 95% of cases, just make a test with many known files, non Unicode of course.

Good luck,
Jaime.
0
 
PaulCaswellCommented:
Seconded!!
0
 
PaulCaswellCommented:
'though I'd add FF (12) and do some more research on the likely percentage of accented characters in various european countries.

Otherwise excellent idea Jaime.

All those in favour say Aye!!
0
 
Jaime OlivaresSoftware ArchitectCommented:
Thanks Paul.
Please also add 26 (EOF) as mentioned by stochastic
0
 
aib_42Commented:
This is probably the only thing that you can absolutely be sure of:

A text file will never contain '\0' - the ASCII zero, the first character of the ASCII table, value 0.
0
 
stochasticCommented:
Aye for the Jaime Olivares test for Textuality :-)
(doesn't it feel nice to give a name to the test?)

- sto
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 5
  • 5
  • 3
  • +4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now