Greek Characters convert to '?' s on export from SQL using VB.net
We have developed a web app hosted on Azure, which copies the records in a SQL View to a MySQL Db for a Wordpress Website.
All is working fine, but it is a scientific website and some of the records contain greek letters. These characters are not being carried over correctly, and display as question marks on the website.
On investigation, I have read that others had success with a similar issue by forcing conversion of the field containing the greek characters to UTF-8 with a SQL command on the My SQL destination Db, others found converting it to Latin1 resolved their issue. However, I had success with neither of these. I also discovered that the greek characters appear to be converted to question marks before being sent to the MYSQL: When outputting the field values with console.writeline() in the vb code, the values expressed display '?'s instead of greek characters before they are exported to the MySQL Db.
So I'm wondering if there is any known process which I can put those fields through to ensure the greek characters are kept and remain untranslated to '?'s in the export functionality.
Incase it helps, following is a sample of the code used to write the values from the SQL view to the MySQL table:
Dim DBAAW As New MasterDBEntities '(this is the SQL Db) 'first find the Antibody Record by its HCConstructNumber Dim WARecord = (From wq In DBAAW.WP_Whole_Antibodies_on_Sale Where wq.HCConstructNumber = HCC Select wq).FirstOrDefault() If (WARecord IsNot Nothing) Then With WARecord SyncHelp.AddNewMeta(postID, "wpcf-specificitystatement", .SpecificityStatement) Console.WriteLine("Synonyms: " & .Synonyms) SyncHelp.AddNewMeta(postID, "wpcf-synonyms", .Synonyms) End With End If
'And this is what SyncHelp.AddNewMeta does to update the post's record in MySQL:
Public Shared Sub AddNewMeta(p1d As Long, metakey As String, metaval As String) Dim PMMta As New absolute_fix1Entities '(the MySQL Db) Dim xdel As Integer Dim meta = (From s In PMMta.antibody_postmeta Where s.post_id = p1d And s.meta_key = metakey Select s).FirstOrDefault() If meta IsNot Nothing Then meta.meta_value = Mval End IfEnd Sub
So you see I am simply writing the value stored in the SQL View's [SpecificityStatement] and [Synonym] field, to the corresponding [meta_value] field in the MySQL's post_meta table for Wordpress.
This works fine for all string data types, but not for the greek characters in them.
Is there something else I should be doing here to keep Greek characters as they are (and not change them to '?'s) ?
Codepages and encodings are a whole book of a topic.
Besides database collation and codepage you also have a collation setting in making the connection to the db, there is a PHP default charset, then you have the web servers encoding default, which must match what you specify in htmls head section, too. And last not least you can also specify the input encoding via accept-charset attribute.
So I count 6 places where you might have a mismatch that leads to such conversion errors. A ? simple denotes a character not convertable. If you have all these things at UTF-8 or Latin-1, anything holding all charsyou need, the parts of the puzzle snap into each other. I'd go for UTF-8, as it is a norm in the web.
Unfortunately there's no one size fits all solution to such problems, so you better look at all these places and fix where there is a wrong setting. If you already have data, that's another part of the problem, because simply setting another charset doesn't change data to it, just like changing the label on a jar doesn't change content.
On top of that there isn't only a database server default charset, each database can have its own setting and each table and each field. In regard of data you might need to live with a data loss, to try and make the transition correctly, first step must be a backup of the data in it's current state, no matter how wrong it might be, if you can see greek in PHPMyAdmin or Workbench, there's hope for the data you have.
Some things to look at:
PHP.ini: default_charset = "utf-8";
Some PHP functions have parameters for a specific encoding: htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")
http headers sent: header('Content-Type: text/html; charset=utf-8');
HTML head: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML forms: <form ... accept-charset="utf-8">
MySQL Connection: mysql_set_charset()
and the MySQL database itself, start with http://archive.oreilly.com/pub/post/turning_mysql_data_in_latin1_t.html
And last not least, this topic is a moving target, knowledge about it has to be kept up to date, this is changing too often and when Javascript and ajax is in play there's another point of failure, especially with third party APIs, which can output JSON or XML in any other encoding.
Bye, Olaf.
Tim Brocklehurst
ASKER
Hi Olaf
Thank you for your kind explanation of this area. It is clearly something for which there might be numerous explanations at various different points in the process. However, I believe we have narrowed down the source of the problem, so we can at least focus on the area we need to, to fix it.
Previously, before we automated the export/import of the data from SQL to MySQL, it was done by exporting comma delimited text files, and importing them to Wordpress using an importer plugin. The Greek Characters exported and imported fine then, and appeared as they should do in Wordpress.
Furthermore, the headers of the Wordpress pages contain the
line so I am in little doubt that the destination is able to display greek characters as it should.
The characters obviously display OK in the source SQL database. Here is how the Synonyms field looks as a record in SQL Server - (the 'γ' of 'Fcγ III/II Receptor' is a greek character):
However, when we export the value of that field using VB.NET - and expose the synonyms field to the console in MS Azure, it displays like this in the log, while the code is running and before it is output to MySQL. Note the 'γ' of 'Fcγ III/II Receptor is converted to '?' .
So this surely suggests that whatever is converting the greek characters to question marks, is doing it at the time when the field value is grabbed from SQL db by the VB code... isn't that correct?
In which case, can anyone give a solution in which the VB code can be modified to take account of this change, and keep the greek characters in place. It was possible previously when exporting them to .txt files using MS Access, so surely it must be possible using VB too, but just using a different method than we are currently, to write the value direct to the MySQL Db.
Is it possible to emulate the process of exporting comma delimited txt files for those fields in which greek characters appear? - Or is there some other method we can use at this stage which might keep the greek characters greek?
Thank you
Olaf Doschke
What's between the initial SQL Server data and the display in Azure surely is more than VB, you export to a file, I assume? Did you look at it before this uploads to Azure? Also this display of the ? might be due to Azures web interface, when its encoding is not matching UTF-8.
I have no insight into Azure WebJobs, but this looks more like a log of an import Job than the import file itself, doesn't it?
The misconversion can still happen in several places to look at, eg still a mismatch in the encoding the header of the Azure http responses contain with the html meta tag of this page can cause the questionmark. <meta charset=UTF-8> is just one of the two encoding for a html response, it's in the html <head> part, and depending on the HTML being HTML5 of 4 this could also be:
What you show is fine with HTML5 pages, so what is the doctype?
I also wrote about headers of the HTTP protocol even before the <html> tag, like the headers of mails many mail clients only display when you ask for it. HTTP headers are shown by Firefox, if you right click and choose "View Page Info". Headers of the protocol specifying encoding, length, authentication and other protocol specifics. For encoding this header is:
This is part of the HTTP protocol and does not show up in the view source html your browser shows you.
It surely would be a good idea to look into this WebJob and extend it for some analysis, eg see how it connects to MySQL, what it reads from your CSV and whether that already shows ? or if they only show up by querying the imported data. CSV surely is no format containing encoding aka charset informations.
That is a good point - that it might be Azure's own page structure which is displaying the '?' and not the mis-encryption of the characters by the export code.
I have investigated further, as you suggested, and am working it down by process of elimination. So far, I have determined that the following are true:
Using VB.NET to write a value from a field in Table A in the SQL Server Db directly to Table B in the same SQL Server Db works.
Copying the from the SQL Db and pasting it into the corresponding field manually in Wordpress, works.
The MySQL Db and Wordpress can display greek characters (when copied and pasted).
So based on that, can you suggest what area I should look at next in trying to find a fix for this?
Thank you
Tim
Olaf Doschke
Well, in regard to whether Azure's own page structure is displaying the '?' I'd look into the HTML source for doctype (Is it <!DOCTYPE html>? That would fit the meta charset tag) and into the HTTP headers (Firefox offers the display of them via already mentioned "View Page Info").
The same three things (doctype, html meta charset tag, and HTTP Content-Type header) can be checked on your Wordpress site.
And the other thing to check would be about the WEebJob importing data, how does it interpret the CSV, how does it make the MySQL connection and does it turn the charset to utf8 there, too. Also see http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html
Besides that, what is the MySQL server/database/table and field collation and charset?
You didn't answer all questions, so you gave no chance for further help. If I would have told in advance, this would have been too much information and advice to handle. I don't think you need additional coding. If the script still sees it as greek characters, you have a problem in the collation of the MySQL connection. You didn't talked about that at all, you only checked all server, html, and database charsets, but if it's on the way from that code and it is still seeing greek character, it has to be the database connection.To begin with, I poiuted out all these points of failure to you, including connection and it's collation and charset setting, especially mentioned this twice.
Anyway, I don't mind, if you want to stay with going for html entities, as it solves your problem.
Additional coding was required to identify and edit the offending fields. The coding is included in the solution to assist others with the same problem.
Olaf Doschke
It would needed, if all transfer of text would not have any irreversible conversion. At some point the text must go from UTF-8 to another (eg Ansi) character set not containing greek characters and reconversion to UTF-8 yields ? instead.
Going for html entities is only solving a partial problem, this is of no help for future assists. You cope with the inability of any transfer encoding to not contain greek chars by using the html entitiy codes, which only use chars even available in ASCII.
I know you doubled checked any text file and database and table charsets are UTF-8, but that's still not sufficient. The text only gets from A to B via transfer, and on of these transfers obviously causes that irreversible conversion. If everything including connection charset is correct, you don't get such questionmark chars.
Besides that, there are much more characters in UTF-8, for which no HTML entity encoding exists and thus using them also only helps with a subset of characters.
You got over your problem and that's a fine situation taking the pressure out, but it can be solved better simply with right encoding/charset also during any transfers. Let alone look into how many different utf8 charsets there are? For example use utf8mb4 from MySQL 5.5.3 upwards, as it supports chars beyond xFFFD.
If this is your only solution I can only deduct you didn't read up on everything in enough detail to rue out any conversion errors.
Besides database collation and codepage you also have a collation setting in making the connection to the db, there is a PHP default charset, then you have the web servers encoding default, which must match what you specify in htmls head section, too. And last not least you can also specify the input encoding via accept-charset attribute.
So I count 6 places where you might have a mismatch that leads to such conversion errors. A ? simple denotes a character not convertable. If you have all these things at UTF-8 or Latin-1, anything holding all charsyou need, the parts of the puzzle snap into each other. I'd go for UTF-8, as it is a norm in the web.
Unfortunately there's no one size fits all solution to such problems, so you better look at all these places and fix where there is a wrong setting. If you already have data, that's another part of the problem, because simply setting another charset doesn't change data to it, just like changing the label on a jar doesn't change content.
On top of that there isn't only a database server default charset, each database can have its own setting and each table and each field. In regard of data you might need to live with a data loss, to try and make the transition correctly, first step must be a backup of the data in it's current state, no matter how wrong it might be, if you can see greek in PHPMyAdmin or Workbench, there's hope for the data you have.
Some things to look at:
PHP.ini: default_charset = "utf-8";
Some PHP functions have parameters for a specific encoding: htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")
http headers sent: header('Content-Type: text/html; charset=utf-8');
HTML head: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML forms: <form ... accept-charset="utf-8">
MySQL Connection: mysql_set_charset()
and the MySQL database itself, start with http://archive.oreilly.com/pub/post/turning_mysql_data_in_latin1_t.html
And last not least, this topic is a moving target, knowledge about it has to be kept up to date, this is changing too often and when Javascript and ajax is in play there's another point of failure, especially with third party APIs, which can output JSON or XML in any other encoding.
Bye, Olaf.