?
Solved

Is Binary?

Posted on 2004-09-17
18
Medium Priority
?
354 Views
Last Modified: 2013-11-15
I read that it's possible to determine if a file is a binary or a text file by checking the first 1024 bytes. How to do that? Need some directions in doing it will be good.
0
Comment
Question by:sonic2000
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 5
  • 3
  • +4
18 Comments
 
LVL 4

Expert Comment

by:pankajtiwary
ID: 12082331
Which system are you talking about? In Unix and other flovours of unix there is no difference between a text file and a binary file. Sometimes an executable is called a binary but there is no concept of binary file in Unix.
0
 

Author Comment

by:sonic2000
ID: 12082346
i am using SunOS.
My concept of binary file here is when i use a cat, it will display nicely..
hope it's clear.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12082373
Hi sonic2000,

you can use the file command to get the type of file

file <filename>

Most OS store each file with some header information which determines the type of file. Also most files that are generated by other programs such as a compiled ELF binary have a particular format and a magic number which helps in identification

Look into the source code of file utility for the exact implementation. Source code should be available for atleast linux platform

Sunnycoder
0
Get MongoDB database support online, now!

At Percona’s web store you can order your MongoDB database support needs in minutes. No hassles, no fuss, just pick and click. Pay online with a credit card. Handle your MongoDB database support now!

 
LVL 16

Expert Comment

by:PaulCaswell
ID: 12082413
Windows has no advanced facility like this. You just have a file with data in it! Pretty crappy really but only a bad workman blames his tools.

The theory about checking the first 1024 bytes is only a guestimate. What you do is read 1024 bytes from the file and make sure all characters in it are printable (most use 'isprint' but make your own by all means. The idea is that is that if there is anything other that 0-9 a-f A-F punctuation and white space then it is probably a binary file. You can be a bit more intelligent but that's the basic algorithm.

Paul
0
 

Author Comment

by:sonic2000
ID: 12082528
How to get the magic number?
0
 
LVL 8

Expert Comment

by:stochastic
ID: 12082544
sonic2000,

I don't think there is a 'magic number' - 1024 or something else - that will conclusively decide.

I think it should be pretty easy to check the entire file - the program will be quite fast.

You can either use isprint() macro, or check if any character (byte) has ASCII value
higher than 127. If it is a pure text file, all ASCII values must be 127 or below.
(you can be stricter by checking for only 32-127, plus 9 (tab), 13 (cr) and 10 (lf).

Do you want code for how to do this?

- stochastic
0
 
LVL 8

Expert Comment

by:stochastic
ID: 12082549
btw, since you are using SunOS, sunnycoder's suggestion should work just fine.

- sto
0
 

Author Comment

by:sonic2000
ID: 12082636
stochastic,
 I will code it myself :) Thanks a lot.. Can you tell me more on the allowed ascii values and which are not allowed?
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 12082706
Its not a matter of 'allowed' and 'not allowed', its more a matter of 'likely' or 'unlikely'. Take a look at the ASCII character set and decide for yourself which characters would suggest to you that the file was binary. 'cr' and 'lf' are comm on in text files but you can also get 'ff' (form feed) sometimes. Even more rare but possible are FS, GS and RS. You then need to decide if DEL (0x7f) can appear. Generally speaking 32 - 126 are OK and tab, cr and lf are common, the rest is your choice.

Paul
0
 
LVL 8

Expert Comment

by:stochastic
ID: 12083979
What Paul said is absolutely right, but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

- sto
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 12084133
>> but > 127 is certainly "not allowed". No file would qualify to be a text file if it has characters with the high bit set.

Did you forget £, Yen and accented characters? I did!

sonic,

As you can see, even wise old codgers like Sto and I cant get it right first time.

Paul
0
 
LVL 8

Accepted Solution

by:
stochastic earned 100 total points
ID: 12084418
> even wise old codgers like Sto  .....

I like the "wise" part :-)  LOL

True, about many printable characters >127.

On another track, if you consider the two file open modes in C, namely "a" (ascii) and "b" (binary),
(remember - fopen(something, "r") vs. fopen(something, "rb")  )

the things that make the difference are (imho) for the "a" mode:

* cr+lf are read as a single character, namely lf ('\n')
* ^z (ascii 26) is considered EOF
* null (ascii 0) is ignored? (I am not sure)

but to be sure, >127 are tolerated, and left alone.

- stochastic
0
 
LVL 55

Assisted Solution

by:Jaime Olivares
Jaime Olivares earned 100 total points
ID: 12084420
I propose this rule of thumb:

Consider valid characters:
-All between >=32 and <=127
-Also characters 9,10,13
-None other below 32
-Also, consider up to 5% of characters above 127 (or maybe up to 10%)

I think this will match more than 95% of cases, just make a test with many known files, non Unicode of course.

Good luck,
Jaime.
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 12084572
Seconded!!
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 12084654
'though I'd add FF (12) and do some more research on the likely percentage of accented characters in various european countries.

Otherwise excellent idea Jaime.

All those in favour say Aye!!
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12084730
Thanks Paul.
Please also add 26 (EOF) as mentioned by stochastic
0
 
LVL 7

Expert Comment

by:aib_42
ID: 12091079
This is probably the only thing that you can absolutely be sure of:

A text file will never contain '\0' - the ASCII zero, the first character of the ASCII table, value 0.
0
 
LVL 8

Expert Comment

by:stochastic
ID: 12099459
Aye for the Jaime Olivares test for Textuality :-)
(doesn't it feel nice to give a name to the test?)

- sto
0

Featured Post

Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article, you will read about the trends across the human resources departments for the upcoming year. Some of them include improving employee experience, adopting new technologies, using HR software to its full extent, and integrating artifi…
This guide will walk you through the essential considerations and tech stack for building scalable websites. Know how to grow your business the smart way!
This video will demonstrate how to find the puppet warp tool from the edit menu and where to put the points to edit.
The viewer will learn how to successfully download and install the SARDU utility on Windows 8, without downloading adware.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question