Link to home
Start Free TrialLog in
Avatar of Nick_72
Nick_72

asked on

Read a unicode TextFile with ReadLn

Hi,

When I assign TextFile to a unicode formatted txt file which contains 'normal' characters which fits in one byte, I get strange result since every other byte is marked as #0. So when I try to get the result of a read line, I just get the first byte.
How should I do to read a unicode textfile? I prefer to use ReadLn, is it possible?

Thanks,
Nick
SOLUTION
Avatar of Mike Littlewood
Mike Littlewood
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm afraid you're going to have to write your own readline to read Unicode text files, unless you have a specific need and can use a Unicode enable control. Perhaps you can enlighten us here.

Unicode files contain in the first two bytes in hex FF FE and you can use this fact to determine the file type. There after you have to read TWO bytes at a time looking for the combination 0D 00 0A 00.

The question is, when you get a line in what are you going to do with it? Do you want to convert it to ANSI or what?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Nick_72
Nick_72

ASKER

Although you might already have realized it, I feel I should clearify a bit further:

I managed to read a whole line to a string, but when I try to display it (for debug purpose with ShowMessage()) I get just the first byte in the message box since there is this null terminator as the second byte in the first unicode character. So ReadLn works for the whole line.

mikelittlewood:
StringReplace seems to be an option, and it seems easier to read the whole file first although it should work for the result of a call to ReadLn too.

BigRat:

>>Unicode files contain in the first two bytes in hex FF FE and you can use this fact to determine the file type
Great! I should implement this check.

>>There after you have to read TWO bytes at a time looking for the combination 0D 00 0A 00.
Hmm...I'm with you with the 'two bytes' issue, but what is the combination 0D 00 0A 00 and what should I do with it..?

>>Do you want to convert it to ANSI or what?
That was my initial thought, since I scan logfiles and check the lines for specific values - and if found, appropriate action is taken.
I have assumed that there are only ANSI characters in these files, but when I think of it, I can't be 100% sure.

What about the WideString type. It's purpose is to handle two-byte characters isn't it?

Thanks,
Nick
Avatar of Nick_72

ASKER

Mokule, didn't see your post, I'll check it out thanks.
>>What about the WideString type.

Follow up the link, it reads the entire file into a string. That *might* cause you problems.

In any event scanning until the Unicode CR/LF sequence and packing that into a WideString the next step might be to use the Windows API WideStringToMultiByte and convert the 16-bit chacaters to 8 bits. I suppose you are searching only for ASCII sequences?
Avatar of Nick_72

ASKER

Alright it works ok, but the entire file is read into the variable. Now I need them line by line. I tried to use TStringList to split it with the Delimiter and DelimitedText properties. But I can't get that part to work. Even if I convert the WideString to AnsiString it won't work. I have tried to place both #10 and #$A as delimiter but it just use space as delimiter.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial