Solved

Find unicode character in webpage?

Posted on 2014-03-25
8
306 Views
Last Modified: 2014-03-25
I am trying to search for the unicode character &#8652 (double arrrow) in a web page.

The follow code validates the character and searches using PHP preg_match.
It is does not find the required character. How can I fix this?

echo mb_convert_encoding('&#8652', 'UTF-8', 'HTML-ENTITIES');

$var=file_get_contents("http://mywebsite.com")    ;
$var1=utf8_encode($var)  ;

$result = preg_match($arrow, $var1, $matches)     ;

Open in new window

0
Comment
Question by:code4
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39954408
Try
$result = preg_match("\x{21CC}", $var1, $matches);

Open in new window

HTH,
Dan
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39954426
To understand what is happening here, please read this article:
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html

You may also want to learn about this function:
http://us1.php.net/manual/en/function.mb-ereg-match.php

If you want to give us a small sample of the data, I can show you how to find and fix the issues.  But I need the actual test data, not a description of the data.  Thanks, ~Ray
0
 

Author Comment

by:code4
ID: 39954453
Thanks.
The code produces the following error on my system:

Warning: preg_match() [<a href='function.preg-match'>function.preg-match</a>]: Delimiter must not be alphanumeric or backslash

Here is data for testing:
$arrow= mb_convert_encoding('&#8652', 'UTF-8', 'HTML-ENTITIES');

$var=file_get_contents("http://web.centre.edu/shiba/Chemistry%20Symbols%20in%20Word1.htm");
$var1=utf8_encode($var)  ;

$result = preg_match($arrow, $var1, $matches)     ;

Open in new window

0
Learn how to optimize MySQL for your business need

With the increasing importance of apps & networks in both business & personal interconnections, perfor. has become one of the key metrics of successful communication. This ebook is a hands-on business-case-driven guide to understanding MySQL query parameter tuning & database perf

 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39954465
You (and me too) forgot the delimiters:

$result = preg_match('/&#8652;/', $var1, $matches);

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39954473
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39954480
I think so, Ray. It's a Word document saved a HTML (ugh), and the character sequence the OP looks for is in plain text, so no need for any encoding.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39954500
See if this makes sense.  Look at the bottom of the page to see the location of the string in the rendered document.
http://www.iconoun.com/demo/temp_code4.php

<?php // demo/temp_code4.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28397143.html
// REF http://php.net/manual/en/function.utf8-encode.php
// REF http://www.asciitable.com/
// REF http://en.wikipedia.org/wiki/UTF-8


// THE TEST DATA MAY CONTAIN ISO CHARACTERS THAT NEED TO BE CONVERTED TO UTF-8 CHARACTERS
$url = "http://web.centre.edu/shiba/Chemistry%20Symbols%20in%20Word1.htm";
$htm = file_get_contents($url);

// FIRST CHARSET PREVAILS
echo '<meta charset="utf8" />';        // GARBLES NON-UTF-8

// CONVERT THE DATA SET AND DISPLAY THE PAGE
$new = utf8_encode($htm);
echo $new;

// LOCATE A CHARACTER STRING
$sig = '&#8652';
$pos = strpos($new, $sig);
echo PHP_EOL . htmlentities($sig) . "  LOCATED AT $pos";

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39954529
There may be a little more "odd" here than just a Word-driven HTML page.  My recommendation to the college would be to get an agency that is familiar with web development to help build a new web site!
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.centre.edu%2F&charset=%28detect+automatically%29&doctype=Inline&group=0
0

Featured Post

CHALLENGE LAB: Troubleshooting Connectivity Issues

Goal: Fix the connectivity issue in the lab's AWS environment so that you can SSH into the provided EC2 instance.  

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

626 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question