Link to home
Start Free TrialLog in
Avatar of headzoo
headzoo

asked on

PHP: Detect if uploaded file is text based

Howdy,
  I'm wondering if there are already some functions or classes available to return whether a file (most likely uploaded) is text based or not.  I can probably roll my own, but if something is out there already I'll use that.
  Something like:

$textBased = is_text($fileName);

  The MIME type returned when uploading files isn't very good for this.  For example, uploading a PHP script the MIME type is "application/octet-stream", but we know a PHP script is just plain text.
  Again, I could make my own, and I don't really want anyone going through the trouble of making something just to answer my question.  Just wondering if there is already something available that can do that.

Thanks!
Avatar of sjohnstone1234
sjohnstone1234
Flag of United Kingdom of Great Britain and Northern Ireland image

What would you define as being a "plain text" file?

I suppose it could be "a file containing only printable characters", so if there are any bytes below 32 except for 9, 10 or 13 (tab, newline, carriage return) then the file *isn't* text. But will you be accepting UTF-8 or other character sets which might contain characters that aren't normally printable? And does, for example, HTML count as "plain text"?

Anyway, if my first suggestion is what you're after, you could use something like this:

<?php

$data = file_get_contents($fileName);

if(preg_match('/[\x00-\x08\x0b-\x0c\x0e\x1f]/', $data) === 0) {
    // file is text
}

?>
Avatar of headzoo
headzoo

ASKER

The purpose for a "is_text" function is to determine if it's safe to run a DIFF type function on two files, to find any differences between the two.  I think obviously you don't want to do something like that on binary files.  Besides it would be pointless.

A preg_match also crossed my mind, but then I don't know how to work that with UTF-8.  I only plan on using DIFF on text files containing English, but that doesn't mean the files won't be encoded in UTF-8 or something else.
ASKER CERTIFIED SOLUTION
Avatar of sjohnstone1234
sjohnstone1234
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial