Solved

Special chars lost when pulling from Amazon

Posted on 2014-10-05
14
248 Views
Last Modified: 2014-10-06
We are using the Amazon Advertiser API to pull down book information into our website and all is going well, except that special chars in the text such as the long dash (represented by Amazon as — in its text - see the image attached called shortlisted.jpg) are being dropped and other special chars are showing up on our web pages as garbage chars.

I know this is an age-old question and we thought we had the solution by going to the actual Amazon page for the book in question (the book is called "A Tale for the Time Being") and examining the page's information using FireFox.
By the way, the link to the Amazon page is http://www.amazon.com/Tale-Time-Being-Novel/dp/0143124870/ref=sr_1_1?ie=UTF8&qid=1412538139&sr=8-1&keywords=A+tale+for+the+time+being

When we select "Page Info" from FireFox while on the Amazon page, we get the following info (see pageinfo.jpg attached to this).

We went into our page and tried to replicate the settings and when we view our page it's still not showing the various special chars (You can see our page display results by viewing the image attached called ourpage.jpg).'
You can see the HTML we set at the top of our page by viewing ourhtml.jpg attached.

Any thoughts on how we can get these special chars to display correctly?

thanks experts!
pageinfo.jpg
shortlisted.jpg
ourpage.jpg
ourhtml.jpg
0
Comment
Question by:rascal
  • 8
  • 6
14 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40362589
This is almost certainly a character set collision, perhaps in combination with htmlentities() which is used to prepare documents for browser display.  Here's a possible explanation and solutions.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html

That said, I will now look at the data a little more and post back if I see anything important.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40362597
OK, in the "ourhtml" image, we have a problem.  The PHP header() command cannot send a header if any browser output has been sent, and since it's embedded inside HTML, it will fail.  You might want to rearrange the logic (or better yet, remove the unnecessary PHP).  One of these solutions should work.

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>As Needed (UTF-8)</title>
</head>
<body>

Open in new window

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="iso-8859-1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>As Needed (ISO-8859-1)</title>
</head>
<body>

Open in new window

Which one will work?  That depends on the encoding in your HTML document.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40362611
One other thought... Please show us the exact URL of the API that gives you the HTML you're rendering on your web page.  It may be that the HTMLEntities() encoding is happening twice.  PHP htmlentities() is used to prepare external input for "safe" display on a web page.  By safe, we mean that the external input cannot run JavaScript or inject unwanted HTML tags into the display page.   Here is an example that shows the double encoding at work.  It's a mung and cannot be run more than once without data damage.  To use this script you must save it as a UTF-8 encoded file.
http://iconoun.com/demo/temp_rascal.php

<?php // demo/temp_rascal.php
error_reporting(E_ALL);

// CREATE VARIABLES FOR OUR HTML
$abc = "Ruth Ozeki—shortlisted";
$def = htmlentities($abc);
$ghi = htmlentities($def);

// CREATE OUR WEB PAGE IN HTML5 FORMAT
$htm = <<<HTML5
<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>E-E Q_28531658</title>
</head>
<body>

<noscript>Your browsing experience will be better with JavaScript enabled!</noscript>

<p>$abc</p>
<p>$def</p>
<p>$ghi</p>

</body>
</html>
HTML5;

// RENDER THE WEB PAGE
echo $htm;

Open in new window

0
 
LVL 1

Author Comment

by:rascal
ID: 40362786
Thanks for the thoughtful replies Ray. We are not using htmlentities() at all on the page, and the Amazon library code that we invoke to fetch the book description doesn't use it either.

We tried setting the top of the page html with your sample code, but it still did not render what was supposed to be an mdash (it's actually &#151; when we view the source on the Amazon page), and some other chars don't render at all.

Here is the actual code we use:

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
<head>
<meta charset="utf-8" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>

<?php
try
{
    $amazonEcs = new AmazonECS(AWS_API_KEY, AWS_API_SECRET_KEY, 'com', AWS_ASSOCIATE_TAG);

    $response = $amazonEcs->category('Books')->responseGroup('Large')->search("9780143124870");
    
    if (gettype($response->Items->Item->EditorialReviews->EditorialReview)=='array')
    {
        $s_description=$response->Items->Item->EditorialReviews->EditorialReview[0]->Content;      
    }
    else
    {
        $s_description=$response->Items->Item->EditorialReviews->EditorialReview->Content;    
    }
    
    echo $s_description;
}
catch(Exception $e)
{
  echo $e->getMessage();
}
?>
</body>
</html>

Open in new window

0
 
LVL 1

Author Comment

by:rascal
ID: 40362789
Also, here is the Amazon source code that performs the actual book data fetch:

<?php
/**
 * Amazon ECS Class
 * http://www.amazon.com
 * =====================
 *
 * This class fetchs productinformation via the Product Advertising API by Amazon (formerly ECS).
 * It supports three basic operations: ItemSearch, ItemLookup and BrowseNodeLookup.
 * These operations could be expanded with extra prarmeters to specialize the query.
 *
 * Requirement is the PHP extension SOAP.
 *
 * @package      AmazonECS
 * @license      http://www.gnu.org/licenses/gpl.txt GPL
 * @version      1.3.3
 * @terryreview       Exeu <exeu65@googlemail.com>
 * @contributor  Julien Chaumond <chaumond@gmail.com>
 * @link         http://github.com/Exeu/Amazon-ECS-PHP-Library/wiki Wiki
 * @link         http://github.com/Exeu/Amazon-ECS-PHP-Library Source
 */
class AmazonECS
{
  const RETURN_TYPE_ARRAY  = 1;
  const RETURN_TYPE_OBJECT = 2;

  /**
   * Baseconfigurationstorage
   *
   * @var array
   */
  private $requestConfig = array(
    'requestDelay' => false
  );

  /**
   * Responseconfigurationstorage
   *
   * @var array
   */
  private $responseConfig = array(
    'returnType'          => self::RETURN_TYPE_OBJECT,
    'responseGroup'       => 'Small',
    'optionalParameters'  => array()
  );

  /**
   * All possible locations
   *
   * @var array
   */
  private $possibleLocations = array('de', 'com', 'co.uk', 'ca', 'fr', 'co.jp', 'it', 'cn', 'es');

  /**
   * The WSDL File
   *
   * @var string
   */
  protected $webserviceWsdl = 'http://webservices.amazon.com/AWSECommerceService/AWSECommerceService.wsdl';

  /**
   * The SOAP Endpoint
   *
   * @var string
   */
  protected $webserviceEndpoint = 'https://webservices.amazon.%%COUNTRY%%/onca/soap?Service=AWSECommerceService';

  /**
   * @param string $accessKey
   * @param string $secretKey
   * @param string $country
   * @param string $associateTag
   */
  public function __construct($accessKey, $secretKey, $country, $associateTag)
  {
    if (empty($accessKey) || empty($secretKey))
    {
      throw new Exception('No Access Key or Secret Key has been set');
    }

    $this->requestConfig['accessKey']     = $accessKey;
    $this->requestConfig['secretKey']     = $secretKey;
    $this->associateTag($associateTag);
    $this->country($country);
  }

  /**
   * execute search
   *
   * @param string $pattern
   *
   * @return array|object return type depends on setting
   *
   * @see returnType()
   */
  public function search($pattern, $nodeId = null)
  {
    if (false === isset($this->requestConfig['category']))
    {
      throw new Exception('No Category given: Please set it up before');
    }

    $browseNode = array();
    if (null !== $nodeId && true === $this->validateNodeId($nodeId))
    {
      $browseNode = array('BrowseNode' => $nodeId);
    }

    $params = $this->buildRequestParams('ItemSearch', array_merge(
      array(
        'Keywords' => $pattern,
        'SearchIndex' => $this->requestConfig['category']
      ),
      $browseNode
    ));

    return $this->returnData(
      $this->performSoapRequest("ItemSearch", $params)
    );
  }

  /**
   * execute ItemLookup request
   *
   * @param string $asin
   *
   * @return array|object return type depends on setting
   *
   * @see returnType()
   */
  public function lookup($asin)
  {
    $params = $this->buildRequestParams('ItemLookup', array(
      'ItemId' => $asin,
    ));

    return $this->returnData(
      $this->performSoapRequest("ItemLookup", $params)
    );
  }

  /**
   * Implementation of BrowseNodeLookup
   * This allows to fetch information about nodes (children anchestors, etc.)
   *
   * @param integer $nodeId
   */
  public function browseNodeLookup($nodeId)
  {
    $this->validateNodeId($nodeId);

    $params = $this->buildRequestParams('BrowseNodeLookup', array(
      'BrowseNodeId' => $nodeId
    ));

    return $this->returnData(
      $this->performSoapRequest("BrowseNodeLookup", $params)
    );
  }

  /**
   * Implementation of SimilarityLookup
   * This allows to fetch information about product related to the parameter product
   *
   * @param string $asin
   */
  public function similarityLookup($asin)
  {
    $params = $this->buildRequestParams('SimilarityLookup', array(
      'ItemId' => $asin
    ));

    return $this->returnData(
      $this->performSoapRequest("SimilarityLookup", $params)
    );
  }

  /**
   * Builds the request parameters
   *
   * @param string $function
   * @param array  $params
   *
   * @return array
   */
  protected function buildRequestParams($function, array $params)
  {
    $associateTag = array();

    if(false === empty($this->requestConfig['associateTag']))
    {
      $associateTag = array('AssociateTag' => $this->requestConfig['associateTag']);
    }

    return array_merge(
      $associateTag,
      array(
        'AWSAccessKeyId' => $this->requestConfig['accessKey'],
        'Request' => array_merge(
          array('Operation' => $function),
          $params,
          $this->responseConfig['optionalParameters'],
          array('ResponseGroup' => $this->prepareResponseGroup())
    )));
  }

  /**
   * Prepares the responsegroups and returns them as array
   *
   * @return array|prepared responsegroups
   */
  protected function prepareResponseGroup()
  {
    if (false === strstr($this->responseConfig['responseGroup'], ','))
      return $this->responseConfig['responseGroup'];

    return explode(',', $this->responseConfig['responseGroup']);
  }

  /**
   * @param string $function Name of the function which should be called
   * @param array $params Requestparameters 'ParameterName' => 'ParameterValue'
   *
   * @return array The response as an array with stdClass objects
   */
  protected function performSoapRequest($function, $params)
  {
    if (true ===  $this->requestConfig['requestDelay']) {
      sleep(1);
    }

    $soapClient = new SoapClient(
      $this->webserviceWsdl,
      array('exceptions' => 1)
    );

    $soapClient->__setLocation(str_replace(
      '%%COUNTRY%%',
      $this->responseConfig['country'],
      $this->webserviceEndpoint
    ));

    $soapClient->__setSoapHeaders($this->buildSoapHeader($function));

    return $soapClient->__soapCall($function, array($params));
  }

  /**
   * Provides some necessary soap headers
   *
   * @param string $function
   *
   * @return array Each element is a concrete SoapHeader object
   */
  protected function buildSoapHeader($function)
  {
    $timeStamp = $this->getTimestamp();
    $signature = $this->buildSignature($function . $timeStamp);

    return array(
      new SoapHeader(
        'http://security.amazonaws.com/doc/2007-01-01/',
        'AWSAccessKeyId',
        $this->requestConfig['accessKey']
      ),
      new SoapHeader(
        'http://security.amazonaws.com/doc/2007-01-01/',
        'Timestamp',
        $timeStamp
      ),
      new SoapHeader(
        'http://security.amazonaws.com/doc/2007-01-01/',
        'Signature',
        $signature
      )
    );
  }

  /**
   * provides current gm date
   *
   * primary needed for the signature
   *
   * @return string
   */
  final protected function getTimestamp()
  {
    return gmdate("Y-m-d\TH:i:s\Z");
  }

  /**
   * provides the signature
   *
   * @return string
   */
  final protected function buildSignature($request)
  {
    return base64_encode(hash_hmac("sha256", $request, $this->requestConfig['secretKey'], true));
  }

  /**
   * Basic validation of the nodeId
   *
   * @param integer $nodeId
   *
   * @return boolean
   */
  final protected function validateNodeId($nodeId)
  {
    if (false === is_numeric($nodeId) || $nodeId <= 0)
    {
      throw new InvalidArgumentException(sprintf('Node has to be a positive Integer.'));
    }

    return true;
  }

  /**
   * Returns the response either as Array or Array/Object
   *
   * @param object $object
   *
   * @return mixed
   */
  protected function returnData($object)
  {
    switch ($this->responseConfig['returnType'])
    {
      case self::RETURN_TYPE_OBJECT:
        return $object;
      break;

      case self::RETURN_TYPE_ARRAY:
        return $this->objectToArray($object);
      break;

      default:
        throw new InvalidArgumentException(sprintf(
          "Unknwon return type %s", $this->responseConfig['returnType']
        ));
      break;
    }
  }

  /**
   * Transforms the responseobject to an array
   *
   * @param object $object
   *
   * @return array An arrayrepresentation of the given object
   */
  protected function objectToArray($object)
  {
    $out = array();
    foreach ($object as $key => $value)
    {
      switch (true)
      {
        case is_object($value):
          $out[$key] = $this->objectToArray($value);
        break;

        case is_array($value):
          $out[$key] = $this->objectToArray($value);
        break;

        default:
          $out[$key] = $value;
        break;
      }
    }

    return $out;
  }

  /**
   * set or get optional parameters
   *
   * if the argument params is null it will reutrn the current parameters,
   * otherwise it will set the params and return itself.
   *
   * @param array $params the optional parameters
   *
   * @return array|AmazonECS depends on params argument
   */
  public function optionalParameters($params = null)
  {
    if (null === $params)
    {
      return $this->responseConfig['optionalParameters'];
    }

    if (false === is_array($params))
    {
      throw new InvalidArgumentException(sprintf(
        "%s is no valid parameter: Use an array with Key => Value Pairs", $params
      ));
    }

    $this->responseConfig['optionalParameters'] = $params;

    return $this;
  }

  /**
   * Set or get the country
   *
   * if the country argument is null it will return the current
   * country, otherwise it will set the country and return itself.
   *
   * @param string|null $country
   *
   * @return string|AmazonECS depends on country argument
   */
  public function country($country = null)
  {
    if (null === $country)
    {
      return $this->responseConfig['country'];
    }

    if (false === in_array(strtolower($country), $this->possibleLocations))
    {
      throw new InvalidArgumentException(sprintf(
        "Invalid Country-Code: %s! Possible Country-Codes: %s",
        $country,
        implode(', ', $this->possibleLocations)
      ));
    }

    $this->responseConfig['country'] = strtolower($country);

    return $this;
  }

  /**
   * Setting/Getting the amazon category
   *
   * @param string $category
   *
   * @return string|AmazonECS depends on category argument
   */
  public function category($category = null)
  {
    if (null === $category)
    {
      return isset($this->requestConfig['category']) ? $this->requestConfig['category'] : null;
    }

    $this->requestConfig['category'] = $category;

    return $this;
  }

  /**
   * Setting/Getting the responsegroup
   *
   * @param string $responseGroup Comma separated groups
   *
   * @return string|AmazonECS depends on responseGroup argument
   */
  public function responseGroup($responseGroup = null)
  {
    if (null === $responseGroup)
    {
      return $this->responseConfig['responseGroup'];
    }

    $this->responseConfig['responseGroup'] = $responseGroup;

    return $this;
  }

  /**
   * Setting/Getting the returntype
   * It can be an object or an array
   *
   * @param integer $type Use the constants RETURN_TYPE_ARRAY or RETURN_TYPE_OBJECT
   *
   * @return integer|AmazonECS depends on type argument
   */
  public function returnType($type = null)
  {
    if (null === $type)
    {
      return $this->responseConfig['returnType'];
    }

    $this->responseConfig['returnType'] = $type;

    return $this;
  }

  /**
   * Setter/Getter of the AssociateTag.
   * This could be used for late bindings of this attribute
   *
   * @param string $associateTag
   *
   * @return string|AmazonECS depends on associateTag argument
   */
  public function associateTag($associateTag = null)
  {
    if (null === $associateTag)
    {
      return $this->requestConfig['associateTag'];
    }

    $this->requestConfig['associateTag'] = $associateTag;

    return $this;
  }

  /**
   * @deprecated use returnType() instead
   */
  public function setReturnType($type)
  {
    return $this->returnType($type);
  }

  /**
   * Setting the resultpage to a specified value.
   * Allows to browse resultsets which have more than one page.
   *
   * @param integer $page
   *
   * @return AmazonECS
   */
  public function page($page)
  {
    if (false === is_numeric($page) || $page <= 0)
    {
      throw new InvalidArgumentException(sprintf(
        '%s is an invalid page value. It has to be numeric and positive',
        $page
      ));
    }

    $this->responseConfig['optionalParameters'] = array_merge(
      $this->responseConfig['optionalParameters'],
      array("ItemPage" => $page)
    );

    return $this;
  }

  /**
   * Enables or disables the request delay.
   * If it is enabled (true) every request is delayed one second to get rid of the api request limit.
   *
   * Reasons for this you can read on this site:
   * https://affiliate-program.amazon.com/gp/advertising/api/detail/faq.html
   *
   * By default the requestdelay is disabled
   *
   * @param boolean $enable true = enabled, false = disabled
   *
   * @return boolean|AmazonECS depends on enable argument
   */
  public function requestDelay($enable = null)
  {
    if (false === is_null($enable) && true === is_bool($enable))
    {
      $this->requestConfig['requestDelay'] = $enable;

      return $this;
    }

    return $this->requestConfig['requestDelay'];
  }
}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40362822
The AmazonECS class may be creating something that is programatically "odd."  The information will be found in this variable that is created in the foreign web service:

echo $s_description;

You might try using var_dump($s_description) and looking at the "view source" output to discern what is in that variable.  The root causes of something like this require an understanding of HTML as well as character encoding, and the issues may be hidden below the surface. For now, I would try working with var_dump() and "view source" to see if it might be Amazon's problem instead of your problem.
0
 
LVL 1

Author Comment

by:rascal
ID: 40362905
Sorry for any confusion on this - the $s_description is my own variable that I created to hold the:
$response->Items->Item->EditorialReviews->EditorialReview->Content

I just use $s_description from that response object as a shorter variable name than dealing with
$response->Items->Item->EditorialReviews->EditorialReview->Content.

The response "Content" is just a string.
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40363527
The response "Content" is just a string.
Yes, that's correct.  But its point of origin is "somewhere" and we want to trace our steps back to find that datum.
0
 
LVL 1

Author Comment

by:rascal
ID: 40363889
Hi Ray,
The actual SOAP function call to fetch and return the data is contained in a function inside of the ECS class listed above:

protected function performSoapRequest($function, $params)
  {
    if (true ===  $this->requestConfig['requestDelay']) {
      sleep(1);
    }

    $soapClient = new SoapClient(
      $this->webserviceWsdl,
      array('exceptions' => 1)
    );

    $soapClient->__setLocation(str_replace(
      '%%COUNTRY%%',
      $this->responseConfig['country'],
      $this->webserviceEndpoint
    ));

    $soapClient->__setSoapHeaders($this->buildSoapHeader($function));

    return $soapClient->__soapCall($function, array($params));
  }

Open in new window


In that entire class source code I don't see any references to any html handling (no htmlentities(), or htmlspecialchar() or utf8_encode/utf8_decode, etc.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40364011
It may not be possible to find the true point of origin for the "entitized" m-dash character, but if you're not using something in your code that changes the m-dash into a numeric HTML entity, then the data string must be coming from the API with the entity inside it.  Here is what I'm thinking...

1.  If you echo the m-dash character you risk having a character-encoding collision.   It's a different numeric value in ISO-8859-1 and UTF-8 encodings.

2. There are two ways to solve this problem.  One of them is to enforce UTF-8 or ISO-8859-1 encoding.  But that may not be practical in designing an API because you do not know what display character set your clients might be using.  The other way...

3. Is to return an entity that any browser can display correctly, no matter what character encoding is used.  That leads us to &mdash; or &#155; both of which render the correct character in the browser output.

4. If you look at the browser's "view source" output you will be able to see the entity code (instead of the rendered character).

5. If a programmer accidentally uses the numeric entity in a character string (perhaps this came from external input and got stored in the database that way), then uses the (correct, appropriate, respected, proper) escape sequence to render the output, you will get the effect you're seeing here.

Executive summary: I believe that the &#155 symbol is coming from the API.  The way to track this down would be to examine the data you're getting from the API with a "view source" strategy.  If you want to post the API credentials so that I can duplicate your work, I'll be glad to explore it a little more.
0
 
LVL 1

Author Comment

by:rascal
ID: 40364044
Thanks Ray,
The API is Amazon's API for fetching books from the Amazon website, so the credentials I have are the client's live credentials which unfortunately I cannot share.

What's odd is that when I use different character encoding on my web page to display the results received from Amazon, curing one character breaks another. In other words, when I make a charset designation that cures the mdash, then other chars such as the MSWord double-quote no longer appear, or appear as garbage where they appeared normally before.

It seems like the data received from Amazon is a mix of Windows-1252, UTF8 and iso-8859-1 and others, all in one data stream. (Possible the person who posted the book on Amazon used MSWord and pulled their source from multiple locations/data coding)?

Looks like we might just have to just pull down what we can from Amazon, then tell the client to review the content in the CMS we provide for them and just edit where necessary to correct any formatting errors.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 40364071
I think you're on firm ground in your understanding of the issue.  There are some programmatic ways of dealing with this, but all of them are a "mung" and once the data has been changed, it's difficult or impossible to change it back into anything that is faithful to the original.  Cut and paste from Word has always been a problem.  I think "review and correct" is about the best you can do.
0
 
LVL 1

Author Closing Comment

by:rascal
ID: 40364123
Thanks Ray!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40364166
Glad to help - best of luck with the project, ~Ray
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Suggested Solutions

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now