Solved

Is Binary?

Posted on 2004-09-17
18
345 Views
Last Modified: 2013-11-15
I read that it's possible to determine if a file is a binary or a text file by checking the first 1024 bytes. How to do that? Need some directions in doing it will be good.
0
Comment
Question by:sonic2000
  • 5
  • 5
  • 3
  • +4
18 Comments
 
LVL 4

Expert Comment

by:pankajtiwary
Comment Utility
Which system are you talking about? In Unix and other flovours of unix there is no difference between a text file and a binary file. Sometimes an executable is called a binary but there is no concept of binary file in Unix.
0
 

Author Comment

by:sonic2000
Comment Utility
i am using SunOS.
My concept of binary file here is when i use a cat, it will display nicely..
hope it's clear.
0
 
LVL 45

Expert Comment

by:sunnycoder
Comment Utility
Hi sonic2000,

you can use the file command to get the type of file

file <filename>

Most OS store each file with some header information which determines the type of file. Also most files that are generated by other programs such as a compiled ELF binary have a particular format and a magic number which helps in identification

Look into the source code of file utility for the exact implementation. Source code should be available for atleast linux platform

Sunnycoder
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
Windows has no advanced facility like this. You just have a file with data in it! Pretty crappy really but only a bad workman blames his tools.

The theory about checking the first 1024 bytes is only a guestimate. What you do is read 1024 bytes from the file and make sure all characters in it are printable (most use 'isprint' but make your own by all means. The idea is that is that if there is anything other that 0-9 a-f A-F punctuation and white space then it is probably a binary file. You can be a bit more intelligent but that's the basic algorithm.

Paul
0
 

Author Comment

by:sonic2000
Comment Utility
How to get the magic number?
0
 
LVL 8

Expert Comment

by:stochastic
Comment Utility
sonic2000,

I don't think there is a 'magic number' - 1024 or something else - that will conclusively decide.

I think it should be pretty easy to check the entire file - the program will be quite fast.

You can either use isprint() macro, or check if any character (byte) has ASCII value
higher than 127. If it is a pure text file, all ASCII values must be 127 or below.
(you can be stricter by checking for only 32-127, plus 9 (tab), 13 (cr) and 10 (lf).

Do you want code for how to do this?

- stochastic
0
 
LVL 8

Expert Comment

by:stochastic
Comment Utility
btw, since you are using SunOS, sunnycoder's suggestion should work just fine.

- sto
0
 

Author Comment

by:sonic2000
Comment Utility
stochastic,
 I will code it myself :) Thanks a lot.. Can you tell me more on the allowed ascii values and which are not allowed?
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
Its not a matter of 'allowed' and 'not allowed', its more a matter of 'likely' or 'unlikely'. Take a look at the ASCII character set and decide for yourself which characters would suggest to you that the file was binary. 'cr' and 'lf' are comm on in text files but you can also get 'ff' (form feed) sometimes. Even more rare but possible are FS, GS and RS. You then need to decide if DEL (0x7f) can appear. Generally speaking 32 - 126 are OK and tab, cr and lf are common, the rest is your choice.

Paul
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 8

Expert Comment

by:stochastic
Comment Utility
What Paul said is absolutely right, but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

- sto
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
>> but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

Did you forget £, Yen and accented characters? I did!

sonic,

As you can see, even wise old codgers like Sto and I cant get it right first time.

Paul
0
 
LVL 8

Accepted Solution

by:
stochastic earned 25 total points
Comment Utility
> even wise old codgers like Sto  .....

I like the "wise" part :-)  LOL

True, about many printable characters >127.

On another track, if you consider the two file open modes in C, namely "a" (ascii) and "b" (binary),
(remember - fopen(something, "r") vs. fopen(something, "rb")  )

the things that make the difference are (imho) for the "a" mode:

* cr+lf are read as a single character, namely lf ('\n')
* ^z (ascii 26) is considered EOF
* null (ascii 0) is ignored? (I am not sure)

but to be sure, >127 are tolerated, and left alone.

- stochastic
0
 
LVL 55

Assisted Solution

by:Jaime Olivares
Jaime Olivares earned 25 total points
Comment Utility
I propose this rule of thumb:

Consider valid characters:
-All between >=32 and <=127
-Also characters 9,10,13
-None other below 32
-Also, consider up to 5% of characters above 127 (or maybe up to 10%)

I think this will match more than 95% of cases, just make a test with many known files, non Unicode of course.

Good luck,
Jaime.
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
Seconded!!
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
'though I'd add FF (12) and do some more research on the likely percentage of accented characters in various european countries.

Otherwise excellent idea Jaime.

All those in favour say Aye!!
0
 
LVL 55

Expert Comment

by:Jaime Olivares
Comment Utility
Thanks Paul.
Please also add 26 (EOF) as mentioned by stochastic
0
 
LVL 7

Expert Comment

by:aib_42
Comment Utility
This is probably the only thing that you can absolutely be sure of:

A text file will never contain '\0' - the ASCII zero, the first character of the ASCII table, value 0.
0
 
LVL 8

Expert Comment

by:stochastic
Comment Utility
Aye for the Jaime Olivares test for Textuality :-)
(doesn't it feel nice to give a name to the test?)

- sto
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

I annotated my article on ransomware somewhat extensively, but I keep adding new references and wanted to put a link to the reference library.  Despite all the reference tools I have on hand, it was not easy to find a way to do this easily. I finall…
Let’s list some of the technologies that enable smooth teleworking. 
The viewer will learn common shortcuts with easy ways to remember them. The viewer will then learn where to find all of the keyboard shortcuts, how to create/change them, and how to speed up their workflow.
An overview on how to enroll an hourly employee into the employee database and how to give them access into the clock in terminal.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now