• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 294
  • Last Modified:

How to create an array of uniques from two arrays, that also addresses transposed strings


I have two arrays with the following values:

Array 1:
Blue ball
Red ball
Small green ball
Big orange ball

Array 2:
Blue ball
Red ball
Small green ball
Ball red
Ball blue
Big orange ball
Blue ball
Purple ball
Pink Ball

I need to output an Array3 that includes only uniques (Purple ball, Pink Ball). This excludes dupes(eg: blue ball) AND transposed dupes (eg:ball blue)

many thanks,

GJ
0
ggjones
Asked:
ggjones
  • 11
  • 10
  • 5
  • +2
3 Solutions
 
Aaron TomoskyTechnology ConsultantCommented:
Array_merge will combine, array_unique will remove duplicates. However this won't deal with our transposed issues. To do this I think you could then explode each value on spaces, then I dunno. But that gets you pretty close. Maybe I'll thin of the rest.
0
 
ggjonesAuthor Commented:

thanks for replying Aaron.

The "transposed issues" are addressed here:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_27398480.html

The challenge, essentially, is to modify this solution to include 2 arrays. array_merge makes sense... but I think something is missing....

GJ

.
0
 
Ray PaseurCommented:
Are you interested in case-sensitivity?
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
ggjonesAuthor Commented:
Hi Ray... no, in most cases I'm insensitive, heh, heh.

Case issues are handled down-stream. No need to go there for this.

regards,

GJ
0
 
Ray PaseurCommented:
Thanks.  How about permutations:

Small green ball
Small ball green
ball Small green
ball green Small

What are the rules you want to apply here?
0
 
Ray PaseurCommented:
Also, on the issue of case-sensitivity consider this..

Blue ball (vs) ball Blue -- these are simply rearranged.  But the capitalization is questionable in the context of natural language.
Blue ball (vs) Ball blue -- these are rearranged and the capitalization is sensible in an English-language sort of way.  But if case truly does not matter a better test would come from this, where everything is normalized to one case.

BLUE BALL (vs) BALL BLUE

Most of PHP's string and array functions are case-sensitive, so I think it is important to be clear on the rules about the case of the strings.

Thanks, ~Ray
0
 
ggjonesAuthor Commented:
... now we get into an area of linguistic complexity that is several degrees beyond "ball blue" , and that I would dearly love to address - thank you for teasing out the problem.

Here are three examples that immediately jump out:

1) Contractors - Plumbers and Plumbing, Plumbers Plumbing Contractors, Plumbing Contractors
2) Flowers Plants and Trees Artificial, Flowers Plants Trees Artificial
3) Tanning Salons, Tanning Salon

Rules to apply? well, clearly these pairs are duplicates-in-meaning. But how to extrapolate pattern-recognition to the realm of meaning is a real challenge, isn't it...

.
0
 
ggjonesAuthor Commented:
... thanks Ray.

regarding case, my data is of random case, so I apply this prior to output:

ucwords(strtolower($theString]).

GJ
0
 
Ray PaseurCommented:
OK, good.  With that transformation we can make some progress.  Plurals are a bit more complex.  But case-sensitivity can be neutralized.  I might skip the ucfirst() and go with strtoupper() since MySQL queries are by default case-insensitive (but PHP is case-sensitive).  In PHP RAY is not the same as Ray, but in MySQL Ray and RaY and rAY are the same unless you use the BOOLEAN attribute.
0
 
Beverley PortlockCommented:
Well, we could convert the code from the other question into a class like so

<?php




class CleanUp {

     protected $result;
     protected $originals;
     
     

     function __construct() {
          $this->result   = array();
          $this->originals = array();
     }



     function add( $arr ) {

          // Process the array
          //
          foreach( $arr as $index => $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               if ( array_search( $normalisedString, $this->result ) === false ) {
                    $this->result [$index] = $normalisedString;
                    $this->originals [$index] = $string;
               }
          }

     }




     function cleanArray() {


          // Array processed and all duplicates removed. Build the new results array
          //
          $newArr = array();
          foreach( $this->result as $index => $string )
               $newArr [] = $this->originals [$index];


          return $newArr;

     }

}






$arr1 = array(
               'Blue ball',
               'Red ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball'
          );


$arr2 = array( 
               'Blue ball',
               'Red ball',
               'Small green ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Pink Ball'
              );


// Instantiate the class
//
$clean = new CleanUp();

$clean->add( $arr1 );
$clean->add( $arr2 );

echo "<pre>";
print_r( $clean->cleanArray() );
echo "</pre>";

Open in new window



To deal with the matches of fuzzy comparisons, the the key code in the above is this bit

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               if ( array_search( $normalisedString, $this->result ) === false ) {
                    $this->result [$index] = $normalisedString;
                    $this->originals [$index] = $string;
               }
  

Open in new window


Basically to need to apply your transformation rules to produce 'normalisedString' because this is stored for the comparison, so you might apply preg_replace to remove trailing 's's from the words in the string. You can run a series of preg_replaces starting with the most specific and working up the the most general case and convert plurals to singulars if you wish. Some example code is below

$normalisedString = "balls ball address addresses hillbillies" . ' ';    // NOTE THE TRAILING SPACE

$normalisedString = preg_replace('#([a-z]+)ies\s#is', '$1y ', $normalisedString );
$normalisedString = preg_replace('#([a-z]+)es\s#is', '$1 ', $normalisedString );
$normalisedString = preg_replace('#([a-z]+[^s])s\s#is', '$1 ', $normalisedString );

echo $normalisedString;

Open in new window


which produces the following string.

ball ball address address hillbilly


Note that 'hillbillies' has been converted to 'hillbilly' and addresses is dealt with but address is not affected because the final preg ignores 'ss'. Also, for my own convenience I added a trailing blank to the string before commencing the conversions to make the pattern simpler to code.


0
 
Ray PaseurCommented:
@Brian: I like your solution and had I stayed up last night I might have come up with something like that.

A variant on the plurals: You can add the letter s to the end of each of the words then use metaphone() to generate a short code.  I find this works most of the time.  I didn't try that here but I've used it for search matching before with pretty good success.

http://www.laprbass.com/RAY_temp_ggjones.php
Outputs something like:
Array
(
    [0] => Purple ball
    [1] => Pink Ball
)

Best to all, ~Ray
<?php // RAY_temp_ggjones.php
error_reporting(E_ALL);
echo "<pre>"; // READABILITY


/* FROM THE POST AT EE
Array 1:
Blue ball
Red ball
Small green ball
Big orange ball

Array 2:
Blue ball
Red ball
Small green ball
Ball red
Ball blue
Big orange ball
Blue ball
Purple ball
Pink Ball
*/  // END OF COPIED DATA


// CONSTRUCT THE ARRAYS
$array1 = array
( 'Blue ball'
, 'Red ball'
, 'Small green ball'
, 'Big orange ball'
)
;

$array2 = array
( 'Blue ball'
, 'Red ball'
, 'Small green ball'
, 'Ball red'
, 'Ball blue'
, 'Big orange ball'
, 'Blue ball'
, 'Purple ball'
, 'Pink Ball'
)
;

// SHOW THE TEST DATA IN THE ARRAYS
print_r($array1);
print_r($array2);


// NORMALIZE THE ARRAYS
$work_array1 = array();
foreach ($array1 as $str)
{
    $str = strtoupper($str);
    $arr = explode(' ', $str);
    sort($arr);
    $key = implode(' ', $arr);
    $work_array1[$key] = $str;
}

$work_array2 = array();
foreach ($array2 as $str)
{
    $str = strtoupper($str);
    $arr = explode(' ', $str);
    sort($arr);
    $key = implode(' ', $arr);
    $work_array2[$key] = $str;
}


// MAKE AN ARRAY OF ALL THE UNIQUE VALUES
$uniq_array = $work_array1 + $work_array2;

// REMOVE ANY NON-UNIQUE VALUES (APPEAR IN BOTH ARRAYS)
foreach ($uniq_array as $key => $str)
{
    if (array_key_exists($key, $work_array1))
    {
        if (array_key_exists($key, $work_array2))
        {
            unset($uniq_array[$key]);
        }
    }
}


// MAKE A NEW ARRAY FROM THE ORIGINALS
$new = $array1 + $array2;
foreach ($uniq_array as $str)
{
    $rgx = '#' . "($str)" . '#' . 'i';
    foreach ($new as $txt)
    {
        if (preg_match($rgx, $txt, $mat))
        {
            $out[] = $mat[0];
        }
    }
}
print_r($out);

Open in new window

0
 
Sandeep KothariProject LeadCommented:
I think array_diff followed by array_merge is sufficient to get the uniques...

so first do array_diff with array1, array 2 as input  [sequece in important]
than array_diff with array2, array 1 as input [sequence is important]
than array_merge array1 , array 2

Checkout these links before implementing it ...
http://php.net/manual/en/function.array-diff.php
http://in2.php.net/manual/en/function.array-merge.php
0
 
ggjonesAuthor Commented:
... thank you very much gentleman; I will sift through this today, and figure out how best to apply the logic to my code.

Brian, the preg_replace statements you proposed to manage plurals - very elegant, by the way, in terms of coverage - should these be inserted immediately after line 33 -  " $normalisedString = implode( " ", $exp );" ?

Ray, the introduction to me of metaphone() opens up all sorts of possibilities for other applications as well. In terms of managing plurals for this case, could you elaborate a bit more please? I assume the metaphone($string) call would be inserted in each of the initial for-loops, and then the returned value stored ?? ... or would the initial string simply be replaced, and then converted back for the new array of uniques?

regards,

GJ
0
 
Beverley PortlockCommented:
"
Brian, the preg_replace statements you proposed to manage plurals - very elegant, by the way, in terms of coverage - should these be inserted immediately after line 33 -  " $normalisedString = implode( " ", $exp );" ?"


Yes. They go between the creation of the normalisedString and its insertion in the array.

Best of luck

BP
0
 
Beverley PortlockCommented:
Ray said "@Brian: I like your solution and had I stayed up last night I might have come up with something like that."

We all got to sleep sometime Ray...

:-D

0
 
ggjonesAuthor Commented:

Hi Ray... I'm finding some anomalous behavior.

The output array does not include uniques from array2, if their indices are less than or equal to the highest Array1 index.

I cant figure out the reason though...

GJ
Array1
(
    [0] => Blue ball
    [1] => Small green ball
    [2] => Purple ball
    [3] => Ball red
    [4] => Ball blue
    [5] => Big orange ball
    [6] => Blue ball
    [7] => Brown ball
)
Array2
(
    [0] => Blue ball
    [1] => Red ball
    [2] => Pink Ball
    [3] => white Ball
    [4] => ball Small green
    [5] => Small green ball
    [6] => Ball red
    [7] => Ball black
    [8] => Big orange ball
    [9] => Ball blue
    [10] => Small ball green
    [11] => Blue ball
)
ArrayOut_Actual
(
    [0] => Purple ball
    [1] => Brown ball
)

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
    [] => Ball black
    [] => Pink Ball
    [] => white Ball
    [] => Ball black
)

Open in new window

0
 
ggjonesAuthor Commented:

Hi Brian...

I'm getting an odd result. I'm unclear what it represents; it certainly is not the uniques though!
Any ideas why this should be??

GJ

$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array( 
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );

ArrayOut_Actual
(
    [0] => Blue ball
    [1] => Small green ball
    [2] => Pink Ball
    [3] => white Ball
    [4] => Big orange ball
    [5] => Ball black
    [6] => Ball red
)


ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
    [] => Ball black
    [] => Pink Ball
    [] => white Ball
    [] => Ball black
)

Open in new window

0
 
Sandeep KothariProject LeadCommented:
have you tried this ...

first do array_diff with array1, array 2 as input  [sequece in important]
than array_diff with array2, array 1 as input [sequence is important]
than array_merge array1 , array 2

Checkout these links before implementing it ...
http://php.net/manual/en/function.array-diff.php
http://in2.php.net/manual/en/function.array-merge.php
0
 
ggjonesAuthor Commented:
Hi Ray... Brian...

If you could spare a moment... this anomalous behavior has me stumpted!

cheers,

GJ
0
 
Beverley PortlockCommented:
I cannot see why you are expecting this result

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
    [] => Ball black
    [] => Pink Ball
    [] => white Ball
    [] => Ball black
)

Why should the blue, orange and green balls be omitted?
0
 
Beverley PortlockCommented:
Also you have 'Ball black' twice...
0
 
ggjonesAuthor Commented:


Brian... you are of course correct; I have much in common with a bag of hammers.

But that is not all, oh no, that is not all.

I have also failed to articulate the problem correctly.

The third Array is supposed to include the values of Array1 that are NOT in Array2. I think Ray got it correct after all; sorry Ray.

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
)

Talk about cognitive dissonance. I'm not even sure what I was thinking. A momentary lapse? Heh, probably insight is what is momentary!

Thanks for correcting me Brian, and for all of your effort.

cheers,

GJ
0
 
Beverley PortlockCommented:
So where are we here? I am confused as to what (if anything) I need to be doing....

:-O

0
 
ggjonesAuthor Commented:
Brian....

Your approach and Ray's appear to be  quite different. I'm curious as to efficiency of the methods with respect to speed/performance.

In my testing, I'm looping through 100 records at a time.... so, 300 arrays each with 5 to 10 values.

I would be curious to try each of your respective approaches to see if there is a discernible performance difference.

Would you be able to tweak your output so that :

The third Array  includes the values of Array1 that are NOT in Array2.

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
)

regards,

GJ
$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array(
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
)

Open in new window

0
 
Beverley PortlockCommented:
OK, modified code below. However, if you read my EE profile you will see that efficiency is the least of my concerns unless it makes itself a problem.

<?php




class CleanUp {

     protected $result;
     protected $originals;



     function __construct() {
          $this->result    = array();
          $this->originals = array();
     }



     function add( $arr ) {

          // Process the array
          //
          foreach( $arr as $index => $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               $i = array_search( $normalisedString, $this->result );
               if (  $i === false ) {
                    $this->result [$index] = $normalisedString;
                    $this->originals [$index] = $string;
               }
          }

     }



     function remove( $arr ) {

          // Process the array
          //
          foreach( $arr as $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               $i = array_search( $normalisedString, $this->result );
               if (  $i !== false )
                    unset( $this->result[$i] );
               
          }
     }

     

     function cleanArray() {

          // Array processed and all duplicates removed. Build the new results array
          //
          $newArr = array();
            foreach($this->result as $index => $string )
                    $newArr [] = $this->originals [$index];


          return $newArr;

     }

}


$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array(
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );



// Instantiate the class
//
$clean = new CleanUp();

$clean->add( $myarray1 );
$clean->remove( $myarray2 );

echo "<pre>";
print_r( $clean->cleanArray() );
echo "</pre>";

Open in new window


Output

Array
(
    [0] => Purple ball
    [1] => Brown ball
)
0
 
Beverley PortlockCommented:
Instantiating a class is a performance heavy operation, so I have modified the class so it only needs instantiating once and can then be reset with an initialisation method. Try the original (above) and then this (below).

I ran a little test of my own over 10,000 iterations and got this

10,000 instances - 1.93467 seconds
1 instance, 10,000 initialisations - 1.85216 seconds

so my second method saved 0.08251 seconds over 10,000 iterations or 8 microseconds per iteration.

This is why I do not worry much about efficiency.

0
 
Beverley PortlockCommented:
Oops - the code (forgot to apply it)

<?php




class CleanUp {

     protected $result;
     protected $originals;



     function __construct() {
          $this->initialise();
     }


     function initialise() {
          $this->result    = array();
          $this->originals = array();
     }
     

     function add( $arr ) {

          // Process the array
          //
          foreach( $arr as $index => $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               $i = array_search( $normalisedString, $this->result );
               if (  $i === false ) {
                    $this->result [$index] = $normalisedString;
                    $this->originals [$index] = $string;
               }
          }

     }



     function remove( $arr ) {

          // Process the array
          //
          foreach( $arr as $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               $i = array_search( $normalisedString, $this->result );
               if (  $i !== false )
                    unset( $this->result[$i] );
               
          }
     }

     

     function cleanArray() {

          // Array processed and all duplicates removed. Build the new results array
          //
          $newArr = array();
            foreach($this->result as $index => $string )
                    $newArr [] = $this->originals [$index];


          return $newArr;

     }

}


$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array(
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );


              

// Instantiate the class
//
$clean = new CleanUp();

for( $i=0; $i < 100; $i++) {
     $clean->initialise();
     $clean->add( $myarray1 );
     $clean->remove( $myarray2 );

     echo "<pre>";
     print_r( $clean->cleanArray() );
     echo "</pre>";
}

Open in new window

0
 
ggjonesAuthor Commented:

Thanks Brian. Having tested it across a wide variety of scenarios, it has held up well. As you say, performance is not really an issue. The only things I have added so far are the code to handle plurals that you also contributed, and a check to ensure that none of the input arrays are null.

I did some comparisons with Ray's method, and his method chocked... on pizza :-)

$array1 = array(
               'pizza',
               'pizza restaurants',
               'restaurants pizza'
          );


$array2 = array(
               'Restaurants',
               'Pizza Restaurant'
              );

outputted:

Array
(
    [0] => pizza
    [1] => pizza
    [2] => pizza
    [3] => restaurants pizza
    [4] => restaurants
    [5] => restaurants
    [6] => pizza restaurant
)


many thanks to both of you for helping me out on this task.

regards,

GJ
0
 
Beverley PortlockCommented:
Glad you are sorted. I wonder if Ray likes Pizza???

:-)

0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 11
  • 10
  • 5
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now