headzoo
asked on
PHP: Detect if uploaded file is text based
Howdy,
I'm wondering if there are already some functions or classes available to return whether a file (most likely uploaded) is text based or not. I can probably roll my own, but if something is out there already I'll use that.
Something like:
$textBased = is_text($fileName);
The MIME type returned when uploading files isn't very good for this. For example, uploading a PHP script the MIME type is "application/octet-stream" , but we know a PHP script is just plain text.
Again, I could make my own, and I don't really want anyone going through the trouble of making something just to answer my question. Just wondering if there is already something available that can do that.
Thanks!
I'm wondering if there are already some functions or classes available to return whether a file (most likely uploaded) is text based or not. I can probably roll my own, but if something is out there already I'll use that.
Something like:
$textBased = is_text($fileName);
The MIME type returned when uploading files isn't very good for this. For example, uploading a PHP script the MIME type is "application/octet-stream"
Again, I could make my own, and I don't really want anyone going through the trouble of making something just to answer my question. Just wondering if there is already something available that can do that.
Thanks!
ASKER
The purpose for a "is_text" function is to determine if it's safe to run a DIFF type function on two files, to find any differences between the two. I think obviously you don't want to do something like that on binary files. Besides it would be pointless.
A preg_match also crossed my mind, but then I don't know how to work that with UTF-8. I only plan on using DIFF on text files containing English, but that doesn't mean the files won't be encoded in UTF-8 or something else.
A preg_match also crossed my mind, but then I don't know how to work that with UTF-8. I only plan on using DIFF on text files containing English, but that doesn't mean the files won't be encoded in UTF-8 or something else.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I suppose it could be "a file containing only printable characters", so if there are any bytes below 32 except for 9, 10 or 13 (tab, newline, carriage return) then the file *isn't* text. But will you be accepting UTF-8 or other character sets which might contain characters that aren't normally printable? And does, for example, HTML count as "plain text"?
Anyway, if my first suggestion is what you're after, you could use something like this:
<?php
$data = file_get_contents($fileNam
if(preg_match('/[\x00-\x08
// file is text
}
?>