asked on

An elegant solution to changing values in an array using preg_replace.

Hi All,

Background:
I have written a script to grab the most common words in a page (with tags stripped etc.) It mostly works, however there is an occasional occurrence of the following happening: helloThisIsAnExampleOfTheAnomoly.

This occurs while grabbing certain HTML via a cURL based function, stripping tags and counting word frequency. It mostly appears to occur in menus and widgets.

What I'm looking for is an elegant/efficient solution to pop/push/unset values in the array with the values split.

To expand:

preg_replace('/(?<! )(?<!^)[A-Z]/',' $0', $words)

Open in new window

I'm using the above regular expressions to essentially split the values based on uppercase values occurring mid string/array element.

To summarise:
$array is currently something like this: ("This", "is", "okay", "this", "IsNotOkay")
What I want:
$array is going to look something like this ("This", "is", "okay", "this", "Is", "Not", "Okay")

Don't worry too much about the repeat values as I am utilising a "stop words" array to rid the ones I would not like to keep.

I've not got it working nicely yet so thought I'd turn to you for your expert input.

Thanks in advance.
Chris

Ray Paseur

I think you have two parts to the application. The concordance or word frequency part is pretty straightforward, and it sounds like you've got that. The second part can be gotten with a "lookahead assertion" in the regular expression.
See: https://iconoun.com/demo/temp_chris_jones.php

<?php // demo/temp_chris_jones.php
/**
 * https://www.experts-exchange.com/questions/29038760/An-elegant-solution-to-changing-values-in-an-array-using-preg-replace.html
 */
error_reporting(E_ALL);

// TEST DATA FROM THE POST AT E-E
$alpha = array("This", "is", "okay", "this", "IsNotOkay");
$omega = array("This", "is", "okay", "this", "Is", "Not", "Okay");

// LOOKAHEAD ASSERTION TO ANY UPPERCASE LETTER
$regex = '/(?=[A-Z])/';

$final = [];
foreach ($alpha as $key => $word)
{
    $parts = preg_split($regex, $word);
    if (empty($parts[0])) unset($parts[0]); // EMPTY ELEMENT IF THE FIRST LETTER IS CAPITALIZED
    $final = array_merge($final, $parts);
}

if ($final == $omega) echo PHP_EOL . "Success!";

Open in new window

Outputs:

Success!

Open in new window

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Julian Hansen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

@Julian: In the OP's question, "helloThisIsAnExampleOfTheAnomoly" was one of the examples, The regex that matches only things in Pascal case (all words capitalized) misses the word "hello." That's why I recommended the lookahead assertion - it handles Pascal case and Camel case. And I'm expecting the OP to eventually have a follow-on question about snake case, kebab-case, screaming snake case, etc.

String processing is harder that many people think. The devil is in the details, and with something like scraping other web sites, there are a lot of details (most of them are uncontrollable).

SOLUTION

gr8gonzo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

@gr8gonzo: Well said. That's just the problem with string processing -- the rules are usually based on human language and are difficult to implement consistently in software, because usage and context are necessary to tell us whether McDonald should be considered one word or two. If it's in the list of one-word examples, you might walk the array looking for "Mc" at position N, and "Donald" at position N+1, merging the array elements when that juxtaposition occurs.

Capitalization has similar issues. I've usually found that you first write the code to do what you think you need, then look at the results, and then start writing the edge cases, such as the known upper-case and lower-case issues shown here.

<?php // demo/capitalize_names.php
/**
 * https://www.experts-exchange.com/questions/25080372/Format-First-Last-Name.html
 *
 * Apply a set of rules for capitalization of names
 */
error_reporting(E_ALL);
echo '<pre>';


$names =
[ "RONALD    MCDONALD"
, "RONALD    MCDONALD-o'brien"
, "My NaMe"
, "Van de Graaff GeneratoR"
]
;

// TEST EACH CASE
foreach ($names as $name)
{
    echo PHP_EOL . $name . ': ';
    echo fixname($name);
}


// FUNCTION TO HANDLE NAMES
function fixname($name)
{
    // SPECIAL CASES FOR UPPER OR LOWER CASE DISPOSITION
    $uc = [ " Mc", "'", "-" ];
    $lc = [ "Van De " ];

    // REMOVE UNNECESSARY BLANKS
    $name = preg_replace('/\s\s+/', ' ', $name);

    // START WITH LOWER CASE
    $name = strtolower($name);

    // SET EACH FIRST LETTER TO UPPER
    $name = ucwords($name);

    // CHECK FOR KNOWN SPECIAL UPPER-CASES
    foreach ($uc as $dlm)
    {
        // FIX THE Mcdonald EXAMPLE, ETC
        $namex = explode($dlm, $name);
        foreach ($namex as $k => $v)
        {
            $namex[$k] = ucwords($v);
        }
        $name = implode($dlm, $namex);
    }

    // CHECK FOR KNOWN SPECIAL LOWER-CASES
    foreach ($lc as $dlm)
    {
        // FIX THE van de Graaff EXAMPLE
        $name = str_replace($dlm, strtolower($dlm), $name);
    }

    // RETURN THE REPAIRED STRING
    return $name;
}

Open in new window

Chris Jones

ASKER

Wow thanks guys, I wasn't expecting so many attempts, brilliant.

I am actually using DOM, to process the html. What I have to assume for this project is that there will be some badly written html occasionally and that is where I suspect this anomaly is occurring.

And I'm expecting the OP to eventually have a follow-on question about snake case, kebab-case, screaming snake case, etc.

In this case I'm not actually too bothered about that, it's a fairly straight forward function. The worst I can see happening is "McAfee" type words, which I will probably deal with using a little extra regexp.

I'll go try some of your suggestions now and let you know how it goes.

gr8gonzo

Here's my recommendation using SimpleXML for parsing:

<?php
// Simulate the result from a cURL-fetched page
$htmlDoc = '<html><head><title>My Webpage</title></head><body><h1>Welcome to <span>my</span> O\'Reilly Webpage for <span>testing</span> 99 McDonalds</h1><img src="foobar.jpg" alt="alt tag">Choose one:<p><select><option value="optionA">Option A</option><option value="optionB">Option B</option></select></p></body></html>';

// Extract words and show the results
print_r(WordExtractor::Extract($htmlDoc));

// A class to nicely organize the extraction and sorting activity
class WordExtractor
{
  private static $dom;
  private static $words;
  
  public static function Extract($html)
  {
    // Setup
    $tmp = new DOMDocument();
    $tmp->loadHTML($html);
    self::$dom = simplexml_import_dom($tmp);
    self::$words = array();
    
    // Run
    self::_extractWords(self::$dom);
    
    // Process the results (pass false as a parameter if you don't want case-insensitive results)
    self::_countAndSort();
    
    // Cleanup
    $words = self::$words;
    self::$dom = null;
    self::$words = null;
    return $words;
  }
  
  /*
   * Recursive function that loops through the contents of a
   * SimpleXML element and uses regular expressions to extract
   * words and numbers.
   */
  private static function _extractWords($element, $depth = 0)
  {
    foreach($element as $tagName => $tag)
    {
      // If you want to see the structure
      echo str_repeat("  ", $depth) . $tagName . ": " . $tag . "\n";
      
      // Extract the words
      if(preg_match_all("/\b((?:O')?[A-Za-z0-9]+)\b/",$tag,$matches))
      {
        self::$words[] = $matches[1];
      }
      
      // Recurse into child tags
      self::_extractWords($tag, $depth + 1);
    }
  }
  
  /*
   * Function to process the raw results of the words array into a
   * single array of keywords, in descending order of frequency.
   */
  private static function _countAndSort($forceLower = true)
  {
    // Combine all the different arrays of words and count up usage
    $tmp = array();
    foreach(self::$words as $arrWords)
    {
      foreach($arrWords as $word)
      {
        // Use lower-casing to count "Title" and "title" and "TITLE" as 3 instances of "title"
        if($forceLower)
        {
          $word = strtolower($word);
        }
        
        if(!isset($tmp[$word]))
        {
          // Initial instance of the word
          $tmp[$word] = 1;
        }
        else
        {
          // Subsequent instances of the word
          $tmp[$word]++;
        }
      }
    }
    
    // Sort by descending frequency and return
    arsort($tmp);    
    self::$words = $tmp;
  }
}

Open in new window

Results of the above should look like:

head:
  title: My Webpage
body: Choose one:
  h1: Welcome to  O'Reilly Webpage for  99 McDonalds
    span: my
    span: testing
  img:
  p:
    select:
      option: Option A
      option: Option B

Array
(
    [my] => 2
    [webpage] => 2
    [option] => 2
    [choose] => 1
    [one] => 1
    [welcome] => 1
    [to] => 1
    [o'reilly] => 1
    [for] => 1
    [99] => 1
    [mcdonalds] => 1
    [testing] => 1
    [a] => 1
    [b] => 1
)

Open in new window

gr8gonzo

A couple notes on my example:

1. The first section of the output is simply there for easier visualization of how the code is traversing the DOM. You can exclude that portion by removing or commenting out line 45.

2. You can use standard SimpleXML calls to also look inside tag attributes for content if you want. I was trying to keep the sample fairly simple, but you could optionally adjust this to look at the contents of alt tags, for example.

Chris Jones

ASKER

Just another thank you to everyone for all your time, much appreciated.

This is what I had originally come up with (more of less) plus help from everyone's posts here:

$regex = '/(?=[A-Z])/';
$final = [];
foreach($wordCount as $key => $word) {
                                       $parts = preg_split($regex, $word, NULL, PREG_SPLIT_NO_EMPTY);
                                        $final = array_merge($final, $parts);
                                        if (preg_match($regex, $key))
                                                unset($wordCount[$key]);
}

Open in new window

Before I mark the thread as solved, does anyone have any issue with the above code? Why should I use array_merge over array_push in this example by the way (if it's not too cheeky to slip a second question in!)

I will definitely be using this thread again as reference as I feel some of the other code in it is going to be really beneficial.

Chris Jones

ASKER

Scrap that last bit, I see I was being a bit of an idiot ;) I think I'm nearly there now.

Chris Jones

ASKER

Thanks all, I've tried to distribute the points accordingly based on the final decision.

I used almost exactly Rays second post plus a bit of my own string handling for things like ctype_ and some bits and pieces from Julian and gr8gonzo's posts (DOM is definitely the direction I've been taking recently, this was a bit of a weird occurrence for trying to catch certain things, mainly with CSS menu's and widgets, that my DOM helper methods didn't seem to catch so nicely.

All very good posts!