Chris Jones
asked on
An elegant solution to changing values in an array using preg_replace.
Hi All,
Background:
I have written a script to grab the most common words in a page (with tags stripped etc.) It mostly works, however there is an occasional occurrence of the following happening: helloThisIsAnExampleOfTheA nomoly.
This occurs while grabbing certain HTML via a cURL based function, stripping tags and counting word frequency. It mostly appears to occur in menus and widgets.
What I'm looking for is an elegant/efficient solution to pop/push/unset values in the array with the values split.
To expand:
I'm using the above regular expressions to essentially split the values based on uppercase values occurring mid string/array element.
To summarise:
$array is currently something like this: ("This", "is", "okay", "this", "IsNotOkay")
What I want:
$array is going to look something like this ("This", "is", "okay", "this", "Is", "Not", "Okay")
Don't worry too much about the repeat values as I am utilising a "stop words" array to rid the ones I would not like to keep.
I've not got it working nicely yet so thought I'd turn to you for your expert input.
Thanks in advance.
Chris
Background:
I have written a script to grab the most common words in a page (with tags stripped etc.) It mostly works, however there is an occasional occurrence of the following happening: helloThisIsAnExampleOfTheA
This occurs while grabbing certain HTML via a cURL based function, stripping tags and counting word frequency. It mostly appears to occur in menus and widgets.
What I'm looking for is an elegant/efficient solution to pop/push/unset values in the array with the values split.
To expand:
preg_replace('/(?<! )(?<!^)[A-Z]/',' $0', $words)
I'm using the above regular expressions to essentially split the values based on uppercase values occurring mid string/array element.
To summarise:
$array is currently something like this: ("This", "is", "okay", "this", "IsNotOkay")
What I want:
$array is going to look something like this ("This", "is", "okay", "this", "Is", "Not", "Okay")
Don't worry too much about the repeat values as I am utilising a "stop words" array to rid the ones I would not like to keep.
I've not got it working nicely yet so thought I'd turn to you for your expert input.
Thanks in advance.
Chris
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
@Julian: In the OP's question, "helloThisIsAnExampleOfThe Anomoly" was one of the examples, The regex that matches only things in Pascal case (all words capitalized) misses the word "hello." That's why I recommended the lookahead assertion - it handles Pascal case and Camel case. And I'm expecting the OP to eventually have a follow-on question about snake case, kebab-case, screaming snake case, etc.
String processing is harder that many people think. The devil is in the details, and with something like scraping other web sites, there are a lot of details (most of them are uncontrollable).
String processing is harder that many people think. The devil is in the details, and with something like scraping other web sites, there are a lot of details (most of them are uncontrollable).
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
@gr8gonzo: Well said. That's just the problem with string processing -- the rules are usually based on human language and are difficult to implement consistently in software, because usage and context are necessary to tell us whether McDonald should be considered one word or two. If it's in the list of one-word examples, you might walk the array looking for "Mc" at position N, and "Donald" at position N+1, merging the array elements when that juxtaposition occurs.
Capitalization has similar issues. I've usually found that you first write the code to do what you think you need, then look at the results, and then start writing the edge cases, such as the known upper-case and lower-case issues shown here.
Capitalization has similar issues. I've usually found that you first write the code to do what you think you need, then look at the results, and then start writing the edge cases, such as the known upper-case and lower-case issues shown here.
<?php // demo/capitalize_names.php
/**
* https://www.experts-exchange.com/questions/25080372/Format-First-Last-Name.html
*
* Apply a set of rules for capitalization of names
*/
error_reporting(E_ALL);
echo '<pre>';
$names =
[ "RONALD MCDONALD"
, "RONALD MCDONALD-o'brien"
, "My NaMe"
, "Van de Graaff GeneratoR"
]
;
// TEST EACH CASE
foreach ($names as $name)
{
echo PHP_EOL . $name . ': ';
echo fixname($name);
}
// FUNCTION TO HANDLE NAMES
function fixname($name)
{
// SPECIAL CASES FOR UPPER OR LOWER CASE DISPOSITION
$uc = [ " Mc", "'", "-" ];
$lc = [ "Van De " ];
// REMOVE UNNECESSARY BLANKS
$name = preg_replace('/\s\s+/', ' ', $name);
// START WITH LOWER CASE
$name = strtolower($name);
// SET EACH FIRST LETTER TO UPPER
$name = ucwords($name);
// CHECK FOR KNOWN SPECIAL UPPER-CASES
foreach ($uc as $dlm)
{
// FIX THE Mcdonald EXAMPLE, ETC
$namex = explode($dlm, $name);
foreach ($namex as $k => $v)
{
$namex[$k] = ucwords($v);
}
$name = implode($dlm, $namex);
}
// CHECK FOR KNOWN SPECIAL LOWER-CASES
foreach ($lc as $dlm)
{
// FIX THE van de Graaff EXAMPLE
$name = str_replace($dlm, strtolower($dlm), $name);
}
// RETURN THE REPAIRED STRING
return $name;
}
ASKER
Wow thanks guys, I wasn't expecting so many attempts, brilliant.
I am actually using DOM, to process the html. What I have to assume for this project is that there will be some badly written html occasionally and that is where I suspect this anomaly is occurring.
I'll go try some of your suggestions now and let you know how it goes.
I am actually using DOM, to process the html. What I have to assume for this project is that there will be some badly written html occasionally and that is where I suspect this anomaly is occurring.
And I'm expecting the OP to eventually have a follow-on question about snake case, kebab-case, screaming snake case, etc.In this case I'm not actually too bothered about that, it's a fairly straight forward function. The worst I can see happening is "McAfee" type words, which I will probably deal with using a little extra regexp.
I'll go try some of your suggestions now and let you know how it goes.
Here's my recommendation using SimpleXML for parsing:
Results of the above should look like:
<?php
// Simulate the result from a cURL-fetched page
$htmlDoc = '<html><head><title>My Webpage</title></head><body><h1>Welcome to <span>my</span> O\'Reilly Webpage for <span>testing</span> 99 McDonalds</h1><img src="foobar.jpg" alt="alt tag">Choose one:<p><select><option value="optionA">Option A</option><option value="optionB">Option B</option></select></p></body></html>';
// Extract words and show the results
print_r(WordExtractor::Extract($htmlDoc));
// A class to nicely organize the extraction and sorting activity
class WordExtractor
{
private static $dom;
private static $words;
public static function Extract($html)
{
// Setup
$tmp = new DOMDocument();
$tmp->loadHTML($html);
self::$dom = simplexml_import_dom($tmp);
self::$words = array();
// Run
self::_extractWords(self::$dom);
// Process the results (pass false as a parameter if you don't want case-insensitive results)
self::_countAndSort();
// Cleanup
$words = self::$words;
self::$dom = null;
self::$words = null;
return $words;
}
/*
* Recursive function that loops through the contents of a
* SimpleXML element and uses regular expressions to extract
* words and numbers.
*/
private static function _extractWords($element, $depth = 0)
{
foreach($element as $tagName => $tag)
{
// If you want to see the structure
echo str_repeat(" ", $depth) . $tagName . ": " . $tag . "\n";
// Extract the words
if(preg_match_all("/\b((?:O')?[A-Za-z0-9]+)\b/",$tag,$matches))
{
self::$words[] = $matches[1];
}
// Recurse into child tags
self::_extractWords($tag, $depth + 1);
}
}
/*
* Function to process the raw results of the words array into a
* single array of keywords, in descending order of frequency.
*/
private static function _countAndSort($forceLower = true)
{
// Combine all the different arrays of words and count up usage
$tmp = array();
foreach(self::$words as $arrWords)
{
foreach($arrWords as $word)
{
// Use lower-casing to count "Title" and "title" and "TITLE" as 3 instances of "title"
if($forceLower)
{
$word = strtolower($word);
}
if(!isset($tmp[$word]))
{
// Initial instance of the word
$tmp[$word] = 1;
}
else
{
// Subsequent instances of the word
$tmp[$word]++;
}
}
}
// Sort by descending frequency and return
arsort($tmp);
self::$words = $tmp;
}
}
Results of the above should look like:
head:
title: My Webpage
body: Choose one:
h1: Welcome to O'Reilly Webpage for 99 McDonalds
span: my
span: testing
img:
p:
select:
option: Option A
option: Option B
Array
(
[my] => 2
[webpage] => 2
[option] => 2
[choose] => 1
[one] => 1
[welcome] => 1
[to] => 1
[o'reilly] => 1
[for] => 1
[99] => 1
[mcdonalds] => 1
[testing] => 1
[a] => 1
[b] => 1
)
A couple notes on my example:
1. The first section of the output is simply there for easier visualization of how the code is traversing the DOM. You can exclude that portion by removing or commenting out line 45.
2. You can use standard SimpleXML calls to also look inside tag attributes for content if you want. I was trying to keep the sample fairly simple, but you could optionally adjust this to look at the contents of alt tags, for example.
1. The first section of the output is simply there for easier visualization of how the code is traversing the DOM. You can exclude that portion by removing or commenting out line 45.
2. You can use standard SimpleXML calls to also look inside tag attributes for content if you want. I was trying to keep the sample fairly simple, but you could optionally adjust this to look at the contents of alt tags, for example.
ASKER
Just another thank you to everyone for all your time, much appreciated.
This is what I had originally come up with (more of less) plus help from everyone's posts here:
Before I mark the thread as solved, does anyone have any issue with the above code? Why should I use array_merge over array_push in this example by the way (if it's not too cheeky to slip a second question in!)
I will definitely be using this thread again as reference as I feel some of the other code in it is going to be really beneficial.
This is what I had originally come up with (more of less) plus help from everyone's posts here:
$regex = '/(?=[A-Z])/';
$final = [];
foreach($wordCount as $key => $word) {
$parts = preg_split($regex, $word, NULL, PREG_SPLIT_NO_EMPTY);
$final = array_merge($final, $parts);
if (preg_match($regex, $key))
unset($wordCount[$key]);
}
Before I mark the thread as solved, does anyone have any issue with the above code? Why should I use array_merge over array_push in this example by the way (if it's not too cheeky to slip a second question in!)
I will definitely be using this thread again as reference as I feel some of the other code in it is going to be really beneficial.
ASKER
Scrap that last bit, I see I was being a bit of an idiot ;) I think I'm nearly there now.
ASKER
Thanks all, I've tried to distribute the points accordingly based on the final decision.
I used almost exactly Rays second post plus a bit of my own string handling for things like ctype_ and some bits and pieces from Julian and gr8gonzo's posts (DOM is definitely the direction I've been taking recently, this was a bit of a weird occurrence for trying to catch certain things, mainly with CSS menu's and widgets, that my DOM helper methods didn't seem to catch so nicely.
All very good posts!
I used almost exactly Rays second post plus a bit of my own string handling for things like ctype_ and some bits and pieces from Julian and gr8gonzo's posts (DOM is definitely the direction I've been taking recently, this was a bit of a weird occurrence for trying to catch certain things, mainly with CSS menu's and widgets, that my DOM helper methods didn't seem to catch so nicely.
All very good posts!
See: https://iconoun.com/demo/temp_chris_jones.php
Open in new window
Outputs:Open in new window