We help IT Professionals succeed at work.

User can't read UTF-8 encoded text file

Medium Priority
721 Views
Last Modified: 2012-05-06
I run an open source project that distributes SQL scripts that are run during installation. One user reported that when he opened the SQL script (which is just a text file ending in .sql), it looked corrupted (see attached snippet). After a little back and forth (read thread at http://www.galleryserverpro.com/forum/yaf_postsm1894.aspx) I was able to determine that if I changed the text encoding to ASCII (it was UTF-8 little endian), the user was able to view and run it.

I don't understand encoding issues very well and I don't understand how this guy could not open the SQL file. He reported the issue occurred on Microsoft Windows Server 2003 RC2, Standard x64 Edition, Service Pack 2. He *was* able to open the UTF-8 encoded version on an XP machine.

I have two questions:
1. Do you know why he could not open a file encoded in UTF-8 little endian? What would you have to do to a WinServer 2003 setup to cause this behavior?

2. Should I distribute my SQL scripts under a different encoding? All the characters can be encoded as ASCII, but I don't want to unintentionally introduce new issues by using an "old-fashioned" encoding.

Thanks,
Roger Martin
Gallery Server Pro - Open source web gallery for photos, video, audio, and documents
/**********************************************************************/


 


core Gallery Server Pro operation. This script runs on SQL Server 2005 and later.

Open in new window

Comment
Watch Question

Commented:
I wonder what would happen if the guy had opened the file in NotePad and changed the encoding...  I presume that would fix it on his PC?
You might wanna read up on the  Unicode Byte Order Mask (BOM) "tag" that gets added to text files:  http://en.wikipedia.org/wiki/Byte-order_mark
 

Author

Commented:
Yes, I presume it would have, since that is essentially what I did to the file before I gave it back to him.

Interestingly, the code snippet in my original post is not showing the strange characters that I pasted into it. So I attached a screen shot that represents the "corrupt" file the user saw when he opened it. When I copied all these strange characters into this post, Experts Exchange filtered them out.

My core question still stands about which is the best encoding to use for widely distributed SQL script files...

strange-encoding.jpg
Commented:
Well, I was wondering about the encoding of SQL Server itself... and whether or not that had to do with anything.   (The thinking was... if the encoding of the text file matched the default encoding of SQL Server)
UTF-8 is the default for Visual Studio and the SQL Server Management Studio...   so I'd stick with that.
So, tell us how/where the *.sql files were created... and specifically (if created via SQL Server), the "Language" and "Server Collation" values of SQL Server

Not the solution you were looking for? Getting a personalized solution is easy.

Ask the Experts

Author

Commented:
Not sure what you mean by "encoding of SQL Server itself". You aren't confusing encoding with collation, are you?

I believe I created the original SQL script by copying the output from the scripting tool built in to Visual Studio 2005 Database Edition into a blank Notepad file. But that was long ago and I may have moved things around.

I based my original post on the conversation I had with that user many months ago. When I looked just now, I see that Notepad++ reports the file encoding as "UCS-2 Little Endian" - It doesn't even have an option for UTF-8 Little Endian. To add to the confusion, Visual Studio 2008 reports the same file to be in "Unicode - CodePage 1200". I don't understand why the two programs report different values for the same file - maybe those are two names that refer to the same thing?

If I use Visual Studio to create a blank text file, it wants to use "Unicode (UTF-8 with signature) - Codepage 65001".

I just don't understand enough to decide whether to move to the encoding VS wants to use for txt files or stay with the current ("UCS-2 Little Endian" or "Unicode - CodePage 1200", depending on which program I use). I just want these SQL scripts to be readable by my web installer around the world.

Commented:
So, the "Language" and "Server Collation" values of SQL Server are "English" and "SQL_Latin1_General_CP1_CI_AS"?

Author

Commented:
Yes, on my PC they are. I never found out what the user had.

Author

Commented:
Core questions were not fully addressed; user never followed up on my answer to his/her question...
Access more of Experts Exchange with a free account
Thanks for using Experts Exchange.

Create a free account to continue.

Limited access with a free account allows you to:

  • View three pieces of content (articles, solutions, posts, and videos)
  • Ask the experts questions (counted toward content limit)
  • Customize your dashboard and profile

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.