Detect if a text file contains single-byte or unicode.

I need to detect whether or not a text file contains unicode text or single-byte text. After a little research I suspect it might be something as simple as counting the number of high ASCII characters and, if more than some small number, assume the file is unicode. Is there a better way? Is there an API call to detect the type of text in a string or byte array?

LVL 81
zorvek (Kevin Jones)ConsultantAsked:
Who is Participating?

Take a look at
Gives some good information about determining if it is or not.  It does mention the IsTextUnicode API ( ) though it seems with newer unicode types it is not compatable.  Looks like you're gonna have to build a function for it.  I'd offer to help, but I know you know what you're doing.

Of course, there are similar functions at http:Q_21836497.html#16611812 though they don't look to be as detailed as the article at codesnipers seems to say they should be.

does this work?

Private Function IsUnicode(s As String) As Boolean

      If Len(s) = LenB(s) Then
         IsUnicode = False
         IsUnicode = True
      End If
   End Function
Option Explicit

Private Declare Function IsTextUnicode Lib "advapi32" ( _
    ByVal lpBuffer As String, _
    ByVal cb As Long, _
    lpi As Long) As Long

Public Function isUni(bchar As String) As Boolean
   If Len(bchar) > 1 Then
    isUni = IsTextUnicode(ByVal bchar, 4, &HF)
    'You must enter atleast 2 bytes to check
   End If
End Function
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

zorvek (Kevin Jones)ConsultantAuthor Commented:
Sorry...not done yet. I am still trying to get the code from egl1044 to function correctly. I'll be posting more comments soon seeking additional assistance with this.

[ fanpages ]IT Services ConsultantCommented:
[ ]

"...Since the sequence 0xFEFF is exceedingly rare at the outset of regular non-Unicode text files, it can serve as an implicit marker or signature to identify the file as a Unicode file. Applications that read both Unicode and non-Unicode text files should use the presence of this sequence as an indicator that the file is most likely a Unicode file. (Compare this technique to using the MS-DOS EOF marker to terminate text files.)

When an application finds 0xFEFF at the beginning of a text file, it typically processes the file as though it were a Unicode file, although it may also perform further heuristic checks to verify that this is true. Such a check could be as simple as testing whether the variation in the low-order bytes is much higher than the variation in the high-order bytes. For example, if ASCII text is converted to Unicode text, every second byte is zero. Also, checking both for the linefeed and carriage-return characters (0x000A and 0x000D) and for even or odd file size can provide a strong indicator of the nature of the file.

When an application finds 0xFFFE at the beginning of a text file, it interprets it to mean the file is a byte-reversed Unicode file. The application can either swap the order of the bytes or alert the user that an error has occurred.

The Unicode byte-order mark character is not found in any code page, so it disappears if data is converted to ANSI. Unlike other Unicode characters, it is not replaced by a default character when it is converted. If a byte-order mark is found in the middle of a file, it is not interpreted as a Unicode character and has no effect on text output.

The Unicode value 0xFFFF is illegal in plain text files and cannot be passed between Win32 functions. The value 0xFFFF is reserved for an application's private use."


zorvek (Kevin Jones)ConsultantAuthor Commented:
I still have not had time to get this to work. My tests thus far have proven that it does not work but I do not yet have enough information to post follow-up information/questions. As none of the above answers have been proven to work I can therefore not allow any of them to be selected as an answer as that will provide false information to future viewers of this question. I also do not have the time right now, not the appropriate Windows installations, to fully test the above scenarios or any derivatives of such.

I therefore ask that the question either be left alone for the time being or deleted. If deleted I will repost at a later date with as much of the information above as is relevant.

Remember that being a responsible EE member is not just maintaining questions, it's making sure the EE database provides good information to future viewers.

zorvek (Kevin Jones)ConsultantAuthor Commented:
I have not been able to get any of the above solutions to work yet. But I am confident an answer does lie somewhere above. The problem I have is the machine I need to test these potential solutions is only occasionally available to me and I am being pulled in other directions. I, like you, like a clean TA and try to encourage askers to clean up sooner versus later. But I also appreciate the occasional difficult situation and the need to add good content to the database.

So, for the record, I am confident that an answer to this problem lies above. However, I have been unable to get any of the above answers to work reliably. By closing the question I will be unable to post additional information after one week so the final correct answer will remain a challenge for any who follow.

Since you have forced my hand (I don't want the above information deleted) I'm going to mark all of the answers above as correct and you, Mr. Rollins, can live with the fact that the database now has one more incomplete thread.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.