Solved

Greek Characters convert to '?' s on export from SQL using VB.net

Posted on 2016-08-21
9
31 Views
Last Modified: 2016-09-04
We have developed a web app hosted on Azure, which copies the records in a SQL View to a MySQL Db for a Wordpress Website.
All is working fine, but it is a scientific website and some of the records contain greek letters. These characters are not being carried over correctly, and display as question marks on the website.

On investigation, I have read that others had success with a similar issue by forcing conversion of the field containing the greek characters to UTF-8 with a SQL command on the My SQL destination Db, others found converting it to Latin1 resolved their issue. However, I had success with neither of these.  I also discovered that the greek characters appear to be converted to question marks before being sent to the MYSQL: When outputting the field values with console.writeline() in the vb code, the values expressed display '?'s instead of greek characters before they are exported to the MySQL Db.

So I'm wondering if there is any known process which I can put those fields through to ensure the greek characters are kept and remain untranslated to '?'s in the export functionality.

Incase it helps, following is a sample of the code used to write the values from the SQL view to the MySQL table:

        Dim DBAAW As New MasterDBEntities '(this is the SQL Db)
        'first find the Antibody Record by its HCConstructNumber
        Dim WARecord = (From wq In DBAAW.WP_Whole_Antibodies_on_Sale
                        Where wq.HCConstructNumber = HCC
                        Select wq).FirstOrDefault()
        If (WARecord IsNot Nothing) Then
            With WARecord
			SyncHelp.AddNewMeta(postID, "wpcf-specificitystatement", .SpecificityStatement)
                        Console.WriteLine("Synonyms: " & .Synonyms)
                        SyncHelp.AddNewMeta(postID, "wpcf-synonyms", .Synonyms)
	     End With
	End If

Open in new window

           
            'And this is what SyncHelp.AddNewMeta does to update the post's record in MySQL:
		
Public Shared Sub AddNewMeta(p1d As Long, metakey As String, metaval As String)
        Dim PMMta As New absolute_fix1Entities '(the MySQL Db)
        Dim xdel As Integer
        Dim meta = (From s In PMMta.antibody_postmeta
                              Where s.post_id = p1d And s.meta_key = metakey
                              Select s).FirstOrDefault()
        If meta IsNot Nothing Then
		            meta.meta_value = Mval
        End If
End Sub

Open in new window


So you see I am simply writing the value stored in the SQL View's [SpecificityStatement] and [Synonym] field, to the corresponding [meta_value] field in the MySQL's post_meta table for Wordpress.

This works fine for all string data types, but not for the greek characters in them.

Is there something else I should  be doing here to keep Greek characters as they are (and not change them to '?'s)  ?

Thank you
Tim
0
Comment
Question by:Tim Brocklehurst
  • 5
  • 4
9 Comments
 
LVL 29

Expert Comment

by:Olaf Doschke
ID: 41764281
Codepages and encodings are a whole book of a topic.

Besides database collation and codepage you also have a collation setting in making the connection to the db, there is a PHP default charset, then you have the web servers encoding default, which must match what you specify in htmls head section, too. And last not least you can also specify the input encoding via accept-charset attribute.

So I count 6 places where you might have a mismatch that leads to such conversion errors. A ? simple denotes a character not convertable. If you have all these things at UTF-8 or Latin-1, anything holding all charsyou need, the parts of the puzzle snap into each other. I'd go for UTF-8, as it is a norm in the web.

Unfortunately there's no one size fits all solution to such problems, so you better look at all these places and fix where there is a wrong setting. If you already have data, that's another part of the problem, because simply setting another charset doesn't change data to it, just like changing the label on a jar doesn't change content.

On top of that there isn't only a database server default charset, each database can have its own setting and each table and each field. In regard of data you might need to live with a data loss, to try and make the transition correctly, first step must be a backup of the data in it's current state, no matter how wrong it might be, if you can see greek in PHPMyAdmin or Workbench, there's hope for the data you have.

Some things to look at:
PHP.ini: default_charset = "utf-8";
Some PHP functions have parameters for a specific encoding: htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")
http headers sent: header('Content-Type: text/html; charset=utf-8');
HTML head: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML forms: <form ... accept-charset="utf-8">
MySQL Connection: mysql_set_charset()
and the MySQL database itself, start with http://archive.oreilly.com/pub/post/turning_mysql_data_in_latin1_t.html

And last not least, this topic is a moving target, knowledge about it has to be kept up to date, this is changing too often and when Javascript and ajax is in play there's another point of failure, especially with third party APIs, which can output JSON or XML in any other encoding.

Bye, Olaf.
1
 
LVL 1

Author Comment

by:Tim Brocklehurst
ID: 41764933
Hi Olaf

Thank you for your kind explanation of this area. It is clearly something for which there might be numerous explanations at various different points in the process. However, I believe we have narrowed down the source of the problem, so we can at least focus on the area we need to, to fix it.

Previously, before we automated the export/import of the data from SQL to MySQL, it was done by exporting comma delimited text files, and importing them to Wordpress using an importer plugin. The Greek Characters exported and imported fine then, and appeared as they should do in Wordpress.

Furthermore, the headers of the Wordpress pages contain the
<meta charset=UTF-8>

Open in new window

line so I am in little doubt that the destination is able to display greek characters as it should.

The characters obviously display OK in the source SQL database. Here is how the Synonyms field looks as a record in SQL Server - (the 'γ' of 'Fcγ III/II Receptor' is a greek character):

Displaying greek char in source SQL Db
However, when we export the value of that field using VB.NET - and expose the synonyms field to the console in MS Azure, it displays like this in the log, while the code is running and before it is output to MySQL. Note the 'γ' of 'Fcγ III/II Receptor is converted to '?' .

Azure Console Log
So this surely suggests that whatever is converting the greek characters to question marks, is doing it at the time when the field value is grabbed from SQL db by the VB code... isn't that correct?

In which case, can anyone give a solution in which the VB code can be modified to take account of this change, and keep the greek characters in place. It was possible previously when exporting them to .txt files using MS Access, so surely it must be possible using VB too, but just using a different method than we are currently, to write the value direct to the MySQL Db.

Is it possible to emulate the process of exporting comma delimited txt files for those fields in which greek characters appear? - Or is there some other method we can use at this stage which might keep the greek characters greek?

Thank you
0
 
LVL 29

Expert Comment

by:Olaf Doschke
ID: 41765120
What's between the initial SQL Server data and the display in Azure surely is more than VB, you export to a file, I assume? Did you look at it before this uploads to Azure? Also this display of the ? might be due to Azures web interface, when its encoding is not matching UTF-8.

I have no insight into Azure WebJobs, but this looks more like a log of an import Job than the import file itself, doesn't it?

The misconversion can still happen in several places to look at, eg still a mismatch in the encoding the header of the Azure http responses contain with the html meta tag of this page can cause the questionmark. <meta charset=UTF-8> is just one of the two encoding for a html response, it's in the html <head> part, and depending on the HTML being HTML5 of 4 this could also be:
<meta http-equiv="Content-Type" content="text/html; charset="UTF-8">

Open in new window


What you show is fine with HTML5 pages, so what is the doctype?

I also wrote about headers of the HTTP protocol even before the <html> tag, like the headers of mails many mail clients only display when you ask for it. HTTP headers are shown by Firefox, if you right click and choose "View Page Info". Headers of the protocol specifying encoding, length, authentication and other protocol specifics. For encoding this header is:
Content-Type: text/html; charset=utf-8

Open in new window

This is part of the HTTP protocol and does not show up in the view source html your browser shows you.

It surely would be a good idea to look into this WebJob and extend it for some analysis, eg see how it connects to MySQL, what it reads from your CSV and whether that already shows ? or if they only show up by querying the imported data. CSV surely is no format containing encoding aka charset informations.

Bye. Olaf.
0
 
LVL 1

Author Comment

by:Tim Brocklehurst
ID: 41766120
Thanks Olaf

That is a good point - that it might be Azure's own page structure which is displaying the '?' and not the mis-encryption of the characters by the export code.

I have investigated further, as you suggested, and am working it down by process of elimination. So far, I have determined that the following are true:

  1. Using VB.NET to write a value from a field in Table A in the SQL Server Db directly to Table B in the same SQL Server Db works.
  2. Copying the from the SQL Db and pasting it into the corresponding field manually in Wordpress, works.
  3. The MySQL Db and Wordpress can display greek characters (when copied and pasted).

So based on that, can you suggest what area I should look at next in trying to find a fix for this?

Thank you
Tim
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 29

Expert Comment

by:Olaf Doschke
ID: 41766368
Well, in regard to whether Azure's own page structure is displaying the '?' I'd look into the HTML source for doctype (Is it <!DOCTYPE html>? That would fit the meta charset tag) and into the HTTP headers (Firefox offers the display of them via already mentioned "View Page Info").

The same three things (doctype, html meta charset tag, and HTTP Content-Type header) can be checked on your Wordpress site.

And the other thing to check would be about the WEebJob importing data, how does it interpret the CSV, how does it make the MySQL connection and does it turn the charset to utf8 there, too. Also see http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

Besides that, what is the MySQL server/database/table and field collation and charset?

Bye, Olaf.
0
 
LVL 1

Accepted Solution

by:
Tim Brocklehurst earned 0 total points
ID: 41776175
Thanks again for all your help Olaf

It has taken a while to research this further and to check the page structure of the target website and Azure etc.

Simply put, all relative sources (websites and databases) are configured with UTF-8 and yet the Greek characters are still being transported and displayed as question marks on the Wordpress page.

So in the end, I adapted the sync coding so it looks to see if any field value contains a greek letter:

    Public Shared Function ContainsGreek(g As String) As Boolean
        Dim r As Boolean = False
        If Not IsNothing(g) Then
            If g.Contains("α") OrElse g.Contains("β") OrElse g.Contains("δ") OrElse g.Contains("ε") OrElse g.Contains("ζ") OrElse g.Contains("η") OrElse g.Contains("γ") OrElse g.Contains("θ") OrElse g.Contains("γ") OrElse g.Contains("ι") OrElse g.Contains("κ") OrElse g.Contains("λ") OrElse g.Contains("μ") OrElse g.Contains("ξ") OrElse g.Contains("ο") OrElse g.Contains("π") OrElse g.Contains("ρ") OrElse g.Contains("σ") OrElse g.Contains("τ") OrElse g.Contains("υ") OrElse g.Contains("φ") OrElse g.Contains("χ") OrElse g.Contains("ψ") OrElse g.Contains("ω") Then
                r = True
            End If
        End If
        Return r
    End Function

Open in new window



 and then if it does... it replaces it with its corresponding ASCII code. Like so...:

Public Shared Function ReplGkAsc(originaltext As String) As String
        Dim l As String = ""
        l = Replace(originaltext, "α", "&alpha;")
        l = Replace(l, "β", "&beta;") ' now replaced the replaced to ensure all characters are covered...
        l = Replace(l, "γ", "&gamma;")
        l = Replace(l, "δ", "&delta;")
        l = Replace(l, "ε", "&epsilon;")
        l = Replace(l, "ζ", "&zeta;")
        l = Replace(l, "η", "&eta;")
        l = Replace(l, "θ", "&theta;")
        l = Replace(l, "ι", "&iota;")
        l = Replace(l, "κ", "&kappa;")
        l = Replace(l, "λ", "&lambda;")
        l = Replace(l, "μ", "&mu;")
        l = Replace(l, "ν", "&nu;")
        l = Replace(l, "ξ", "&xi;")
        l = Replace(l, "ο", "&omicron;")
        l = Replace(l, "π", "&pi;")
        l = Replace(l, "ρ", "&rho;")
        l = Replace(l, "σ", "&sigma;")
        l = Replace(l, "τ", "&tau;")
        l = Replace(l, "υ", "&upsilon;")
        l = Replace(l, "φ", "&phi;")
        l = Replace(l, "χ", "&chi;")
        l = Replace(l, "ψ", "&psi;")
        l = Replace(l, "ω", "&omega;")

        Return l
    End Function

Open in new window


This has worked and now the greek characters display fine in Wordpress.
Many thanks again
Tim
0
 
LVL 29

Expert Comment

by:Olaf Doschke
ID: 41776187
You didn't answer all questions, so you gave no chance for further help. If I would have told in advance, this would have been too much information and advice to handle. I don't think you need additional coding. If the script still sees it as greek characters, you have a problem in the collation of the MySQL connection.  You didn't talked about that at all, you only checked all server, html, and database charsets, but if it's on the way from that code and it is still seeing greek character, it has to be the database connection.To begin with, I poiuted out all these points of failure to you, including connection and it's collation and charset setting, especially mentioned this twice.

Anyway, I don't mind, if you want to stay with going for html entities, as it solves your problem.

Bye, Olaf.
0
 
LVL 1

Author Closing Comment

by:Tim Brocklehurst
ID: 41783527
Additional coding was required to identify and edit the offending fields. The coding is included in the solution to assist others with the same problem.
0
 
LVL 29

Expert Comment

by:Olaf Doschke
ID: 41783541
It would needed, if all transfer of text would not have any irreversible conversion. At some point the text must go from UTF-8 to another (eg Ansi) character set not containing greek characters and reconversion to UTF-8 yields ? instead.

Going for html entities is only solving a partial problem, this is of no help for future assists. You cope with the inability of any transfer encoding to not contain greek chars by using the html entitiy codes, which only use chars even available in ASCII.

I know you doubled checked any text file and database and table charsets are UTF-8, but that's still not sufficient. The text only gets from A to B via transfer, and on of these transfers obviously causes that irreversible conversion. If everything including connection charset is correct, you don't get such questionmark chars.

Besides that, there are much more characters in UTF-8, for which no HTML entity encoding exists and thus using them also only helps with a subset of characters.

You got over your problem and that's a fine situation taking the pressure out, but it can be solved better simply with right encoding/charset also during any transfers. Let alone look into how many different utf8 charsets there are? For example use utf8mb4 from MySQL 5.5.3 upwards, as it supports chars beyond xFFFD.

If this is your only solution I can only deduct you didn't read up on everything in enough detail to rue out any conversion errors.

Bye, Olaf.
0

Featured Post

Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

Join & Write a Comment

Parsing a CSV file is a task that we are confronted with regularly, and although there are a vast number of means to do this, as a newbie, the field can be confusing and the tools can seem complex. A simple solution to parsing a customized CSV fi…
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now