Link to home
Start Free TrialLog in
Avatar of intellisource
intellisourceFlag for South Africa

asked on

resolving a utf8 issue in xml data...

hi there,
i was under the impression that this issue was resolved a long time ago.
why is the following function not working when printing utf-8 characters in xml?
// recursive function
function xmlentities($value) {
	// replace characters
	if (is_array($value)) {
		foreach ($value as $key => $val) {
			$value[$key] = xmlentities($val);
		}
	} else {
		//replace $value's invalid xml characters with entities and unicode characters
		// used with Hex NCR's (Hexadecimal Numeric Character Representations).
		$patterns = array(
				'&',			// ampersand symbol
				'<',			// less than symbol
				'>',			// greater than symbol
				'"',			// double quotation mark
				'©',			// (c) copyright symbols
				'ë',			// e with diaresis
				'è',			// e with grave
				'é',			// e with acute
				'"',			// left slanting double quotation mark
				'"'				// right slanting double quotation mark
			);
		$replacements = array(
				'&amp;',
				'&lt;',
				'&gt;',
				'&quot;',
				'&#x00A9;',
				'&#x00EB;',
				'&#x00E8;',
				'&#x00E9;',
				'&#x201C;',
				'&#x201D;'
			);
		$value = utf8_encode($value);	// convert string to unicode to replace characters
		$value = str_replace($patterns, $replacements, $value);
	}
	return $value;
}
// utf-8 characters encoded for xml printing
	$sql = "SELECT 
				DISTINCTROW 
				".$tcategory.", ".$table.".".$order." 
				FROM 
					25_serviceleads SL 
				LEFT JOIN 19_consumers C ON SL.bigint_ConsumerID = C.bigint_ConsumerID 
				LEFT JOIN 27_leadssent LS ON SL.bigint_LeadID = LS.bigint_LeadID 
				LEFT JOIN 2_servicescatalogue SC ON SL.bigint_ServiceID = SC.bigint_ServiceID 
				LEFT JOIN 5_suppliers S ON LS.bigint_SupplierID = S.bigint_SupplierID 
				LEFT JOIN 1_regions R ON C.bigint_ConsumerRegion = R.bigint_RegionID 
			".((count($wsql))?"WHERE ".implode(" AND ",$wsql):"")." 
			ORDER BY ".$table.".".$order." ASC;";
	$result = mysql_query_errors( $sql , $conn , __FILE__ , __LINE__ );
	if ($result) {
		while ($row = mysql_fetch_array($result)) {
			if ($row[$order] != "") {
				echo "	<order name=\"".$row[$order]."\">\n";
				
				$sql0 = "SELECT 
							SL.bigint_LeadID, 
							IFNULL(
								LS.smallint_LeadOrdinal, 
								0
							) AS smallint_LeadOrdinal, 
							CONCAT_WS(
								\";\", 
								C.text_ConsumerName, 
								C.`text_ConsumerE-Mail`, 
								C.text_ConsumerPhone, 
								C.text_ConsumerCity, 
								C.text_ConsumerCounty, 
								C.text_ConsumerState, 
								C.text_ConsumerCountry
							) AS text_Consumer, 
							LS.text_Duplicates, 
							SL.text_LeadAttributes, 
							SL.timestamp_ExpectedBy, 
							SL.text_LeadMessage, 
							SL.bigint_ServiceID, 
							LS.bigint_SupplierID, 
							C.bigint_ConsumerRegion AS bigint_RegionID, 
							LS.tinyint_LeadSent, 
							SL.timestamp_LeadCreated, 
							SL.bigint_ServiceID, 
							SC.text_ServiceDescription, 
							SC.bigint_CostPerLead, 
							S.text_SupplierName, 
							S.text_SupplierW3, 
							S.text_ContactFirstName, 
							S.text_ContactSurname, 
							S.text_ContactPosition, 
							S.`text_ContactE-mail`, 
							S.bigint_ContactTel, 
							S.bigint_CurrentBalance, 
							S.tinyint_VariablePricingEnabled, 
							R.bigint_RegionID, 
							R.text_RegionDescription, 
							FC.text_FormName, 
							(SELECT text_RegionDescription FROM 1_regions WHERE bigint_RegionID = S.bigint_CountryID) AS text_Country, 
							(SELECT CONCAT(R4.text_RegionDescription, \" (\", A4.text_RegionAbbrev, \")\") FROM 1_regions R4, 35_regionabbrevs A4 WHERE R4.bigint_RegionID = A4.bigint_RegionID AND A4.bigint_RegionID = S.bigint_StateID) AS text_State, 
							(SELECT CONCAT(text_RegionDescription, \", \", text_RegionAbbrev) FROM 1_regions R3, 35_regionabbrevs A3 WHERE R3.bigint_ParentRegionID = A3.bigint_RegionID AND R3.bigint_RegionID = S.bigint_CountyID) AS text_County, 
							(SELECT CONCAT(R1.text_RegionDescription, \", \", A2.text_RegionAbbrev, \" \", LPAD(R1.bigint_RegionID,5,\"0\")) FROM 1_regions R1 LEFT JOIN 1_regions R2 ON R1.bigint_ParentRegionID = R2.bigint_RegionID LEFT JOIN 35_regionabbrevs A2 ON R2.bigint_ParentRegionID = A2.bigint_RegionID WHERE R1.bigint_RegionID = S.bigint_ZipCodeID) AS text_City 
						FROM 
							25_serviceleads SL 
						LEFT JOIN 19_consumers C ON SL.bigint_ConsumerID = C.bigint_ConsumerID 
						LEFT JOIN 27_leadssent LS ON SL.bigint_LeadID = LS.bigint_LeadID 
						LEFT JOIN 2_servicescatalogue SC ON SL.bigint_ServiceID = SC.bigint_ServiceID 
						LEFT JOIN 5_suppliers S ON LS.bigint_SupplierID = S.bigint_SupplierID 
						LEFT JOIN 1_regions R ON C.bigint_ConsumerRegion = R.bigint_RegionID 
						LEFT JOIN 32_webformconfigs FC ON SL.bigint_FormID = FC.bigint_FormID 
						WHERE 
							".$tcategory." = ".$row[$tcat[1]].
							((count($wsql))?" AND ".implode(" AND ",$wsql):"")." 
						ORDER BY 
							SL.timestamp_LeadCreated 
						DESC;";
				$result0 = mysql_query_errors( $sql0 , $conn , __FILE__ , __LINE__ );
				if ($result0) {
					while ($row0 = mysql_fetch_assoc($result0)) {
						$consumer = xmlentities(explode(";",$row0["text_Consumer"]));
						...
						echo "			<consumer>".implode(";",$consumer)."</consumer>\n";

Open in new window

why would it not work all of a sudden? are there some utf8 incompatible functions being used here? attached is a screenshot of the page as it currently displays, as well as a screenshot of the data from the database.
Somebody please help!!!
- Greywacke.
intellisource.co.za-premium-lead.jpg
utf8-data-in-phpmyadmin.jpg
Avatar of skullnobrains
skullnobrains

you have accents in your database
most likely the fields in the database were filled with ISO-8859-x or a DOS-xx charset
(assuming a french charset given the sample)

your database will present this data as utf8 if you declare the column, table, or db encoding as such regardless to the encoding you actually fill the database with. you must always feed the database the declared encoding or twist your insert queries to change the charset on the fly
This article may be helpful:
http://www.joelonsoftware.com/articles/Unicode.html

UTF-8 collides with Western-European accented characters if the original Western-European document was created with Ascii encoding (ISO-8859-1, Windows-1252, etc.)
Avatar of intellisource

ASKER

the xml and the document displaying all, as well as the xml generator php - are utf-8 encoded. i just don't get why the functions that used to work - do not work any more :S
i cannot find any western-european or non-utf8 documents involved. same goes for the database, where it is correctly saved.
it seems the CONCAT_WS function does not support UTF-8 0o if so, then how do i modify the select to do so? 0o
@intellisource
you may want to double-check that assertion. the output you showed really looks like an encoding problem. like i said before, what is not utf-8 is most likely the data in your database

@ray
you usually add valuable info to the threads, but i'll have to correct you on this one
- ISO-8859-1, Windows-1252 are NOT ascii at all. a file that is actually ascii will be identical in ascii and iso-8859-1, but as long as one single special character like é is in the file, it is not ascii-compatible any more
- i did not read the article, as the introduction was enough to determine that although the author may be competent, he is also shouting after working code and assuming the libraries and interpreter he uses are supposed to deal with internationalisation problems in his place, which is foolish at best
skullnobrains:
please note that i did check that assertion before i stated it - and that i did say "seems" - as i was unable to locate any reference supporting this allegation - in which case i would have said "is". attached can be seen the proof that the database and all tables as well as default collation are all utf8_bin, along with the mysql server.

Ray_Paseur:
unfortunately i have to agree with skullnobrains... i have checked the server side documents and they are all saved as UTF-8 (javascript as well as the php and html). furthermore, my ftp program filezilla uses utf8 with filenames and binary (for non-ascii text documents) with uploads as result.

something somewhere is suddenly causing the utf-8 compatibility to falter, other than the areas that can be proven in the screenshots attached. even though the xml is encoded as utf-8 - special characters are illegal in xml 1.0 and thus have to be translated to basic html & numeric entities (what the xmlentities function does).

in addition to any previous screenshots attached to this thread, i have added the following screenshots:
1) mysql server collation default User generated image2) mysql table collations 1/2 User generated image3) mysql table collations 2/2 User generated image4) mysql connection & server defaults User generated image5) mysql column/field name encoding User generated image6) filezilla ftp settings (binary xfer & utf8 file names) User generated image7) html default character encoding User generated image8) xml generator encoding & function usage User generated image9) xml reader javascript document encoding User generated image10) global include encoding & function definition User generated image11) xml generator output encoding User generated image12) css document encoding User generated imageeven the css encoding as well as file encoding is utf-8 to discourage any browser switching of web page encoding.
i guess i am humbly asking for help in determining what exactly went awry (will obviously be added to my collective experience then - and hopefully those of people who will/have encounter(ed) the same issue in future/before! ;))...
the collation you use in mysql does not "proove" anything as it simply does not apply. it is mostly an informational setting.

if you insert widows-1252 in a table that is collated in UTF-8, you'll still end up with windows-1252 characters in the table

what you store is binary data. a wrong caracter encoding usually produces something different than what you want but valid in other encodings. i'm actually pretty sure that mysql does not bother to check, but anyway, anything encoded in the mentioned charsets will produce weird characters but valid utf-8

to make it clear, when you insert data through phpmyadmin on a french computer using internet explorer, you inject windows-1252 (EBDIC-something) characters, and when you use another os, or possibly browser, you sometimes end up inserting ISO-8859-x
thankyou for your valuable insight skullnobrains... how can i, on the form handler - ensure that these are converted to utf8 upon insertion into mysql... the utf8_encode function, right? or should i rather detect the submitted encoding and use mb_convert_encoding or iconv instead?
this should mean that the last two lines of code which does the conversion, should change and be applied before insertion to mysql (or after - when retrieved which is where it is currently used).
		$value = utf8_encode($value);	// convert string to unicode to replace characters
		$value = str_replace($patterns, $replacements, $value);

Open in new window

does everybody agree that this would be the logical course of action to take?
ASKER CERTIFIED SOLUTION
Avatar of skullnobrains
skullnobrains

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
this is not fool proof as users can change it in certain browsers. is there a way to ensure utf-8 conversion of the string in the selection below? or would it be better done after the string is retrieved... as previously suggested...
but yes - i agree it would be best to perform the detection and conversion to utf8 as the data comes in from the form. :) just researching this a bit ;) will post complete solution here - thanks :D