Special chars lost when pulling from Amazon

We are using the Amazon Advertiser API to pull down book information into our website and all is going well, except that special chars in the text such as the long dash (represented by Amazon as — in its text - see the image attached called shortlisted.jpg) are being dropped and other special chars are showing up on our web pages as garbage chars.

I know this is an age-old question and we thought we had the solution by going to the actual Amazon page for the book in question (the book is called "A Tale for the Time Being") and examining the page's information using FireFox.
By the way, the link to the Amazon page is http://www.amazon.com/Tale-Time-Being-Novel/dp/0143124870/ref=sr_1_1?ie=UTF8&qid=1412538139&sr=8-1&keywords=A+tale+for+the+time+being

When we select "Page Info" from FireFox while on the Amazon page, we get the following info (see pageinfo.jpg attached to this).

We went into our page and tried to replicate the settings and when we view our page it's still not showing the various special chars (You can see our page display results by viewing the image attached called ourpage.jpg).'
You can see the HTML we set at the top of our page by viewing ourhtml.jpg attached.

Any thoughts on how we can get these special chars to display correctly?

thanks experts!
pageinfo.jpg
shortlisted.jpg
ourpage.jpg
ourhtml.jpg
LVL 1
rascalAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
This is almost certainly a character set collision, perhaps in combination with htmlentities() which is used to prepare documents for browser display.  Here's a possible explanation and solutions.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html

That said, I will now look at the data a little more and post back if I see anything important.
0
Ray PaseurCommented:
OK, in the "ourhtml" image, we have a problem.  The PHP header() command cannot send a header if any browser output has been sent, and since it's embedded inside HTML, it will fail.  You might want to rearrange the logic (or better yet, remove the unnecessary PHP).  One of these solutions should work.

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>As Needed (UTF-8)</title>
</head>
<body>

Open in new window

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="iso-8859-1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>As Needed (ISO-8859-1)</title>
</head>
<body>

Open in new window

Which one will work?  That depends on the encoding in your HTML document.
0
Ray PaseurCommented:
One other thought... Please show us the exact URL of the API that gives you the HTML you're rendering on your web page.  It may be that the HTMLEntities() encoding is happening twice.  PHP htmlentities() is used to prepare external input for "safe" display on a web page.  By safe, we mean that the external input cannot run JavaScript or inject unwanted HTML tags into the display page.   Here is an example that shows the double encoding at work.  It's a mung and cannot be run more than once without data damage.  To use this script you must save it as a UTF-8 encoded file.
http://iconoun.com/demo/temp_rascal.php

<?php // demo/temp_rascal.php
error_reporting(E_ALL);

// CREATE VARIABLES FOR OUR HTML
$abc = "Ruth Ozeki—shortlisted";
$def = htmlentities($abc);
$ghi = htmlentities($def);

// CREATE OUR WEB PAGE IN HTML5 FORMAT
$htm = <<<HTML5
<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>E-E Q_28531658</title>
</head>
<body>

<noscript>Your browsing experience will be better with JavaScript enabled!</noscript>

<p>$abc</p>
<p>$def</p>
<p>$ghi</p>

</body>
</html>
HTML5;

// RENDER THE WEB PAGE
echo $htm;

Open in new window

0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

rascalAuthor Commented:
Thanks for the thoughtful replies Ray. We are not using htmlentities() at all on the page, and the Amazon library code that we invoke to fetch the book description doesn't use it either.

We tried setting the top of the page html with your sample code, but it still did not render what was supposed to be an mdash (it's actually &#151; when we view the source on the Amazon page), and some other chars don't render at all.

Here is the actual code we use:

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>

<?php
try
{
    $amazonEcs = new AmazonECS(AWS_API_KEY, AWS_API_SECRET_KEY, 'com', AWS_ASSOCIATE_TAG);

    $response = $amazonEcs->category('Books')->responseGroup('Large')->search("9780143124870");
    
    if (gettype($response->Items->Item->EditorialReviews->EditorialReview)=='array')
    {
        $s_description=$response->Items->Item->EditorialReviews->EditorialReview[0]->Content;      
    }
    else
    {
        $s_description=$response->Items->Item->EditorialReviews->EditorialReview->Content;    
    }
    
    echo $s_description;
}
catch(Exception $e)
{
  echo $e->getMessage();
}
?>
</body>
</html>

Open in new window

0
rascalAuthor Commented:
Also, here is the Amazon source code that performs the actual book data fetch:

<?php
/**
 * Amazon ECS Class
 * http://www.amazon.com
 * =====================
 *
 * This class fetchs productinformation via the Product Advertising API by Amazon (formerly ECS).
 * It supports three basic operations: ItemSearch, ItemLookup and BrowseNodeLookup.
 * These operations could be expanded with extra prarmeters to specialize the query.
 *
 * Requirement is the PHP extension SOAP.
 *
 * @package      AmazonECS
 * @license      http://www.gnu.org/licenses/gpl.txt GPL
 * @version      1.3.3
 * @terryreview       Exeu <exeu65@googlemail.com>
 * @contributor  Julien Chaumond <chaumond@gmail.com>
 * @link         http://github.com/Exeu/Amazon-ECS-PHP-Library/wiki Wiki
 * @link         http://github.com/Exeu/Amazon-ECS-PHP-Library Source
 */
class AmazonECS
{
  const RETURN_TYPE_ARRAY  = 1;
  const RETURN_TYPE_OBJECT = 2;

  /**
   * Baseconfigurationstorage
   *
   * @var array
   */
  private $requestConfig = array(
    'requestDelay' => false
  );

  /**
   * Responseconfigurationstorage
   *
   * @var array
   */
  private $responseConfig = array(
    'returnType'          => self::RETURN_TYPE_OBJECT,
    'responseGroup'       => 'Small',
    'optionalParameters'  => array()
  );

  /**
   * All possible locations
   *
   * @var array
   */
  private $possibleLocations = array('de', 'com', 'co.uk', 'ca', 'fr', 'co.jp', 'it', 'cn', 'es');

  /**
   * The WSDL File
   *
   * @var string
   */
  protected $webserviceWsdl = 'http://webservices.amazon.com/AWSECommerceService/AWSECommerceService.wsdl';

  /**
   * The SOAP Endpoint
   *
   * @var string
   */
  protected $webserviceEndpoint = 'https://webservices.amazon.%%COUNTRY%%/onca/soap?Service=AWSECommerceService';

  /**
   * @param string $accessKey
   * @param string $secretKey
   * @param string $country
   * @param string $associateTag
   */
  public function __construct($accessKey, $secretKey, $country, $associateTag)
  {
    if (empty($accessKey) || empty($secretKey))
    {
      throw new Exception('No Access Key or Secret Key has been set');
    }

    $this->requestConfig['accessKey']     = $accessKey;
    $this->requestConfig['secretKey']     = $secretKey;
    $this->associateTag($associateTag);
    $this->country($country);
  }

  /**
   * execute search
   *
   * @param string $pattern
   *
   * @return array|object return type depends on setting
   *
   * @see returnType()
   */
  public function search($pattern, $nodeId = null)
  {
    if (false === isset($this->requestConfig['category']))
    {
      throw new Exception('No Category given: Please set it up before');
    }

    $browseNode = array();
    if (null !== $nodeId && true === $this->validateNodeId($nodeId))
    {
      $browseNode = array('BrowseNode' => $nodeId);
    }

    $params = $this->buildRequestParams('ItemSearch', array_merge(
      array(
        'Keywords' => $pattern,
        'SearchIndex' => $this->requestConfig['category']
      ),
      $browseNode
    ));

    return $this->returnData(
      $this->performSoapRequest("ItemSearch", $params)
    );
  }

  /**
   * execute ItemLookup request
   *
   * @param string $asin
   *
   * @return array|object return type depends on setting
   *
   * @see returnType()
   */
  public function lookup($asin)
  {
    $params = $this->buildRequestParams('ItemLookup', array(
      'ItemId' => $asin,
    ));

    return $this->returnData(
      $this->performSoapRequest("ItemLookup", $params)
    );
  }

  /**
   * Implementation of BrowseNodeLookup
   * This allows to fetch information about nodes (children anchestors, etc.)
   *
   * @param integer $nodeId
   */
  public function browseNodeLookup($nodeId)
  {
    $this->validateNodeId($nodeId);

    $params = $this->buildRequestParams('BrowseNodeLookup', array(
      'BrowseNodeId' => $nodeId
    ));

    return $this->returnData(
      $this->performSoapRequest("BrowseNodeLookup", $params)
    );
  }

  /**
   * Implementation of SimilarityLookup
   * This allows to fetch information about product related to the parameter product
   *
   * @param string $asin
   */
  public function similarityLookup($asin)
  {
    $params = $this->buildRequestParams('SimilarityLookup', array(
      'ItemId' => $asin
    ));

    return $this->returnData(
      $this->performSoapRequest("SimilarityLookup", $params)
    );
  }

  /**
   * Builds the request parameters
   *
   * @param string $function
   * @param array  $params
   *
   * @return array
   */
  protected function buildRequestParams($function, array $params)
  {
    $associateTag = array();

    if(false === empty($this->requestConfig['associateTag']))
    {
      $associateTag = array('AssociateTag' => $this->requestConfig['associateTag']);
    }

    return array_merge(
      $associateTag,
      array(
        'AWSAccessKeyId' => $this->requestConfig['accessKey'],
        'Request' => array_merge(
          array('Operation' => $function),
          $params,
          $this->responseConfig['optionalParameters'],
          array('ResponseGroup' => $this->prepareResponseGroup())
    )));
  }

  /**
   * Prepares the responsegroups and returns them as array
   *
   * @return array|prepared responsegroups
   */
  protected function prepareResponseGroup()
  {
    if (false === strstr($this->responseConfig['responseGroup'], ','))
      return $this->responseConfig['responseGroup'];

    return explode(',', $this->responseConfig['responseGroup']);
  }

  /**
   * @param string $function Name of the function which should be called
   * @param array $params Requestparameters 'ParameterName' => 'ParameterValue'
   *
   * @return array The response as an array with stdClass objects
   */
  protected function performSoapRequest($function, $params)
  {
    if (true ===  $this->requestConfig['requestDelay']) {
      sleep(1);
    }

    $soapClient = new SoapClient(
      $this->webserviceWsdl,
      array('exceptions' => 1)
    );

    $soapClient->__setLocation(str_replace(
      '%%COUNTRY%%',
      $this->responseConfig['country'],
      $this->webserviceEndpoint
    ));

    $soapClient->__setSoapHeaders($this->buildSoapHeader($function));

    return $soapClient->__soapCall($function, array($params));
  }

  /**
   * Provides some necessary soap headers
   *
   * @param string $function
   *
   * @return array Each element is a concrete SoapHeader object
   */
  protected function buildSoapHeader($function)
  {
    $timeStamp = $this->getTimestamp();
    $signature = $this->buildSignature($function . $timeStamp);

    return array(
      new SoapHeader(
        'http://security.amazonaws.com/doc/2007-01-01/',
        'AWSAccessKeyId',
        $this->requestConfig['accessKey']
      ),
      new SoapHeader(
        'http://security.amazonaws.com/doc/2007-01-01/',
        'Timestamp',
        $timeStamp
      ),
      new SoapHeader(
        'http://security.amazonaws.com/doc/2007-01-01/',
        'Signature',
        $signature
      )
    );
  }

  /**
   * provides current gm date
   *
   * primary needed for the signature
   *
   * @return string
   */
  final protected function getTimestamp()
  {
    return gmdate("Y-m-d\TH:i:s\Z");
  }

  /**
   * provides the signature
   *
   * @return string
   */
  final protected function buildSignature($request)
  {
    return base64_encode(hash_hmac("sha256", $request, $this->requestConfig['secretKey'], true));
  }

  /**
   * Basic validation of the nodeId
   *
   * @param integer $nodeId
   *
   * @return boolean
   */
  final protected function validateNodeId($nodeId)
  {
    if (false === is_numeric($nodeId) || $nodeId <= 0)
    {
      throw new InvalidArgumentException(sprintf('Node has to be a positive Integer.'));
    }

    return true;
  }

  /**
   * Returns the response either as Array or Array/Object
   *
   * @param object $object
   *
   * @return mixed
   */
  protected function returnData($object)
  {
    switch ($this->responseConfig['returnType'])
    {
      case self::RETURN_TYPE_OBJECT:
        return $object;
      break;

      case self::RETURN_TYPE_ARRAY:
        return $this->objectToArray($object);
      break;

      default:
        throw new InvalidArgumentException(sprintf(
          "Unknwon return type %s", $this->responseConfig['returnType']
        ));
      break;
    }
  }

  /**
   * Transforms the responseobject to an array
   *
   * @param object $object
   *
   * @return array An arrayrepresentation of the given object
   */
  protected function objectToArray($object)
  {
    $out = array();
    foreach ($object as $key => $value)
    {
      switch (true)
      {
        case is_object($value):
          $out[$key] = $this->objectToArray($value);
        break;

        case is_array($value):
          $out[$key] = $this->objectToArray($value);
        break;

        default:
          $out[$key] = $value;
        break;
      }
    }

    return $out;
  }

  /**
   * set or get optional parameters
   *
   * if the argument params is null it will reutrn the current parameters,
   * otherwise it will set the params and return itself.
   *
   * @param array $params the optional parameters
   *
   * @return array|AmazonECS depends on params argument
   */
  public function optionalParameters($params = null)
  {
    if (null === $params)
    {
      return $this->responseConfig['optionalParameters'];
    }

    if (false === is_array($params))
    {
      throw new InvalidArgumentException(sprintf(
        "%s is no valid parameter: Use an array with Key => Value Pairs", $params
      ));
    }

    $this->responseConfig['optionalParameters'] = $params;

    return $this;
  }

  /**
   * Set or get the country
   *
   * if the country argument is null it will return the current
   * country, otherwise it will set the country and return itself.
   *
   * @param string|null $country
   *
   * @return string|AmazonECS depends on country argument
   */
  public function country($country = null)
  {
    if (null === $country)
    {
      return $this->responseConfig['country'];
    }

    if (false === in_array(strtolower($country), $this->possibleLocations))
    {
      throw new InvalidArgumentException(sprintf(
        "Invalid Country-Code: %s! Possible Country-Codes: %s",
        $country,
        implode(', ', $this->possibleLocations)
      ));
    }

    $this->responseConfig['country'] = strtolower($country);

    return $this;
  }

  /**
   * Setting/Getting the amazon category
   *
   * @param string $category
   *
   * @return string|AmazonECS depends on category argument
   */
  public function category($category = null)
  {
    if (null === $category)
    {
      return isset($this->requestConfig['category']) ? $this->requestConfig['category'] : null;
    }

    $this->requestConfig['category'] = $category;

    return $this;
  }

  /**
   * Setting/Getting the responsegroup
   *
   * @param string $responseGroup Comma separated groups
   *
   * @return string|AmazonECS depends on responseGroup argument
   */
  public function responseGroup($responseGroup = null)
  {
    if (null === $responseGroup)
    {
      return $this->responseConfig['responseGroup'];
    }

    $this->responseConfig['responseGroup'] = $responseGroup;

    return $this;
  }

  /**
   * Setting/Getting the returntype
   * It can be an object or an array
   *
   * @param integer $type Use the constants RETURN_TYPE_ARRAY or RETURN_TYPE_OBJECT
   *
   * @return integer|AmazonECS depends on type argument
   */
  public function returnType($type = null)
  {
    if (null === $type)
    {
      return $this->responseConfig['returnType'];
    }

    $this->responseConfig['returnType'] = $type;

    return $this;
  }

  /**
   * Setter/Getter of the AssociateTag.
   * This could be used for late bindings of this attribute
   *
   * @param string $associateTag
   *
   * @return string|AmazonECS depends on associateTag argument
   */
  public function associateTag($associateTag = null)
  {
    if (null === $associateTag)
    {
      return $this->requestConfig['associateTag'];
    }

    $this->requestConfig['associateTag'] = $associateTag;

    return $this;
  }

  /**
   * @deprecated use returnType() instead
   */
  public function setReturnType($type)
  {
    return $this->returnType($type);
  }

  /**
   * Setting the resultpage to a specified value.
   * Allows to browse resultsets which have more than one page.
   *
   * @param integer $page
   *
   * @return AmazonECS
   */
  public function page($page)
  {
    if (false === is_numeric($page) || $page <= 0)
    {
      throw new InvalidArgumentException(sprintf(
        '%s is an invalid page value. It has to be numeric and positive',
        $page
      ));
    }

    $this->responseConfig['optionalParameters'] = array_merge(
      $this->responseConfig['optionalParameters'],
      array("ItemPage" => $page)
    );

    return $this;
  }

  /**
   * Enables or disables the request delay.
   * If it is enabled (true) every request is delayed one second to get rid of the api request limit.
   *
   * Reasons for this you can read on this site:
   * https://affiliate-program.amazon.com/gp/advertising/api/detail/faq.html
   *
   * By default the requestdelay is disabled
   *
   * @param boolean $enable true = enabled, false = disabled
   *
   * @return boolean|AmazonECS depends on enable argument
   */
  public function requestDelay($enable = null)
  {
    if (false === is_null($enable) && true === is_bool($enable))
    {
      $this->requestConfig['requestDelay'] = $enable;

      return $this;
    }

    return $this->requestConfig['requestDelay'];
  }
}

Open in new window

0
Ray PaseurCommented:
The AmazonECS class may be creating something that is programatically "odd."  The information will be found in this variable that is created in the foreign web service:

echo $s_description;

You might try using var_dump($s_description) and looking at the "view source" output to discern what is in that variable.  The root causes of something like this require an understanding of HTML as well as character encoding, and the issues may be hidden below the surface. For now, I would try working with var_dump() and "view source" to see if it might be Amazon's problem instead of your problem.
0
rascalAuthor Commented:
Sorry for any confusion on this - the $s_description is my own variable that I created to hold the:
$response->Items->Item->EditorialReviews->EditorialReview->Content

I just use $s_description from that response object as a shorter variable name than dealing with
$response->Items->Item->EditorialReviews->EditorialReview->Content.

The response "Content" is just a string.
0
Ray PaseurCommented:
The response "Content" is just a string.
Yes, that's correct.  But its point of origin is "somewhere" and we want to trace our steps back to find that datum.
0
rascalAuthor Commented:
Hi Ray,
The actual SOAP function call to fetch and return the data is contained in a function inside of the ECS class listed above:

protected function performSoapRequest($function, $params)
  {
    if (true ===  $this->requestConfig['requestDelay']) {
      sleep(1);
    }

    $soapClient = new SoapClient(
      $this->webserviceWsdl,
      array('exceptions' => 1)
    );

    $soapClient->__setLocation(str_replace(
      '%%COUNTRY%%',
      $this->responseConfig['country'],
      $this->webserviceEndpoint
    ));

    $soapClient->__setSoapHeaders($this->buildSoapHeader($function));

    return $soapClient->__soapCall($function, array($params));
  }

Open in new window


In that entire class source code I don't see any references to any html handling (no htmlentities(), or htmlspecialchar() or utf8_encode/utf8_decode, etc.
0
Ray PaseurCommented:
It may not be possible to find the true point of origin for the "entitized" m-dash character, but if you're not using something in your code that changes the m-dash into a numeric HTML entity, then the data string must be coming from the API with the entity inside it.  Here is what I'm thinking...

1.  If you echo the m-dash character you risk having a character-encoding collision.   It's a different numeric value in ISO-8859-1 and UTF-8 encodings.

2. There are two ways to solve this problem.  One of them is to enforce UTF-8 or ISO-8859-1 encoding.  But that may not be practical in designing an API because you do not know what display character set your clients might be using.  The other way...

3. Is to return an entity that any browser can display correctly, no matter what character encoding is used.  That leads us to &mdash; or &#155; both of which render the correct character in the browser output.

4. If you look at the browser's "view source" output you will be able to see the entity code (instead of the rendered character).

5. If a programmer accidentally uses the numeric entity in a character string (perhaps this came from external input and got stored in the database that way), then uses the (correct, appropriate, respected, proper) escape sequence to render the output, you will get the effect you're seeing here.

Executive summary: I believe that the &#155 symbol is coming from the API.  The way to track this down would be to examine the data you're getting from the API with a "view source" strategy.  If you want to post the API credentials so that I can duplicate your work, I'll be glad to explore it a little more.
0
rascalAuthor Commented:
Thanks Ray,
The API is Amazon's API for fetching books from the Amazon website, so the credentials I have are the client's live credentials which unfortunately I cannot share.

What's odd is that when I use different character encoding on my web page to display the results received from Amazon, curing one character breaks another. In other words, when I make a charset designation that cures the mdash, then other chars such as the MSWord double-quote no longer appear, or appear as garbage where they appeared normally before.

It seems like the data received from Amazon is a mix of Windows-1252, UTF8 and iso-8859-1 and others, all in one data stream. (Possible the person who posted the book on Amazon used MSWord and pulled their source from multiple locations/data coding)?

Looks like we might just have to just pull down what we can from Amazon, then tell the client to review the content in the CMS we provide for them and just edit where necessary to correct any formatting errors.
0
Ray PaseurCommented:
I think you're on firm ground in your understanding of the issue.  There are some programmatic ways of dealing with this, but all of them are a "mung" and once the data has been changed, it's difficult or impossible to change it back into anything that is faithful to the original.  Cut and paste from Word has always been a problem.  I think "review and correct" is about the best you can do.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
rascalAuthor Commented:
Thanks Ray!
0
Ray PaseurCommented:
Glad to help - best of luck with the project, ~Ray
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.