Solved

Find unicode character in webpage?

Posted on 2014-03-25
8
297 Views
Last Modified: 2014-03-25
I am trying to search for the unicode character &#8652 (double arrrow) in a web page.

The follow code validates the character and searches using PHP preg_match.
It is does not find the required character. How can I fix this?

echo mb_convert_encoding('&#8652', 'UTF-8', 'HTML-ENTITIES');

$var=file_get_contents("http://mywebsite.com")    ;
$var1=utf8_encode($var)  ;

$result = preg_match($arrow, $var1, $matches)     ;

Open in new window

0
Comment
Question by:code4
  • 4
  • 3
8 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39954408
Try
$result = preg_match("\x{21CC}", $var1, $matches);

Open in new window

HTH,
Dan
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39954426
To understand what is happening here, please read this article:
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html

You may also want to learn about this function:
http://us1.php.net/manual/en/function.mb-ereg-match.php

If you want to give us a small sample of the data, I can show you how to find and fix the issues.  But I need the actual test data, not a description of the data.  Thanks, ~Ray
0
 

Author Comment

by:code4
ID: 39954453
Thanks.
The code produces the following error on my system:

Warning: preg_match() [<a href='function.preg-match'>function.preg-match</a>]: Delimiter must not be alphanumeric or backslash

Here is data for testing:
$arrow= mb_convert_encoding('&#8652', 'UTF-8', 'HTML-ENTITIES');

$var=file_get_contents("http://web.centre.edu/shiba/Chemistry%20Symbols%20in%20Word1.htm");
$var1=utf8_encode($var)  ;

$result = preg_match($arrow, $var1, $matches)     ;

Open in new window

0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39954465
You (and me too) forgot the delimiters:

$result = preg_match('/&#8652;/', $var1, $matches);

Open in new window

0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39954473
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39954480
I think so, Ray. It's a Word document saved a HTML (ugh), and the character sequence the OP looks for is in plain text, so no need for any encoding.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39954500
See if this makes sense.  Look at the bottom of the page to see the location of the string in the rendered document.
http://www.iconoun.com/demo/temp_code4.php

<?php // demo/temp_code4.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28397143.html
// REF http://php.net/manual/en/function.utf8-encode.php
// REF http://www.asciitable.com/
// REF http://en.wikipedia.org/wiki/UTF-8


// THE TEST DATA MAY CONTAIN ISO CHARACTERS THAT NEED TO BE CONVERTED TO UTF-8 CHARACTERS
$url = "http://web.centre.edu/shiba/Chemistry%20Symbols%20in%20Word1.htm";
$htm = file_get_contents($url);

// FIRST CHARSET PREVAILS
echo '<meta charset="utf8" />';        // GARBLES NON-UTF-8

// CONVERT THE DATA SET AND DISPLAY THE PAGE
$new = utf8_encode($htm);
echo $new;

// LOCATE A CHARACTER STRING
$sig = '&#8652';
$pos = strpos($new, $sig);
echo PHP_EOL . htmlentities($sig) . "  LOCATED AT $pos";

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39954529
There may be a little more "odd" here than just a Word-driven HTML page.  My recommendation to the college would be to get an agency that is familiar with web development to help build a new web site!
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.centre.edu%2F&charset=%28detect+automatically%29&doctype=Inline&group=0
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Why do we like using grid based layouts in website design? Let's look at the live examples of websites and compare them to grid based WordPress themes.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to dynamically set the form action using jQuery.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

25 Experts available now in Live!

Get 1:1 Help Now