Link to home
Start Free TrialLog in
Avatar of ggjones
ggjonesFlag for Afghanistan

asked on

How to create an array of uniques from two arrays, that also addresses transposed strings


I have two arrays with the following values:

Array 1:
Blue ball
Red ball
Small green ball
Big orange ball

Array 2:
Blue ball
Red ball
Small green ball
Ball red
Ball blue
Big orange ball
Blue ball
Purple ball
Pink Ball

I need to output an Array3 that includes only uniques (Purple ball, Pink Ball). This excludes dupes(eg: blue ball) AND transposed dupes (eg:ball blue)

many thanks,

GJ
Avatar of Aaron Tomosky
Aaron Tomosky
Flag of United States of America image

Array_merge will combine, array_unique will remove duplicates. However this won't deal with our transposed issues. To do this I think you could then explode each value on spaces, then I dunno. But that gets you pretty close. Maybe I'll thin of the rest.
Avatar of ggjones

ASKER


thanks for replying Aaron.

The "transposed issues" are addressed here:

https://www.experts-exchange.com/questions/27398480/How-to-use-'array-unique-on-transposed-strings.html

The challenge, essentially, is to modify this solution to include 2 arrays. array_merge makes sense... but I think something is missing....

GJ

.
Are you interested in case-sensitivity?
Avatar of ggjones

ASKER

Hi Ray... no, in most cases I'm insensitive, heh, heh.

Case issues are handled down-stream. No need to go there for this.

regards,

GJ
Thanks.  How about permutations:

Small green ball
Small ball green
ball Small green
ball green Small

What are the rules you want to apply here?
Also, on the issue of case-sensitivity consider this..

Blue ball (vs) ball Blue -- these are simply rearranged.  But the capitalization is questionable in the context of natural language.
Blue ball (vs) Ball blue -- these are rearranged and the capitalization is sensible in an English-language sort of way.  But if case truly does not matter a better test would come from this, where everything is normalized to one case.

BLUE BALL (vs) BALL BLUE

Most of PHP's string and array functions are case-sensitive, so I think it is important to be clear on the rules about the case of the strings.

Thanks, ~Ray
Avatar of ggjones

ASKER

... now we get into an area of linguistic complexity that is several degrees beyond "ball blue" , and that I would dearly love to address - thank you for teasing out the problem.

Here are three examples that immediately jump out:

1) Contractors - Plumbers and Plumbing, Plumbers Plumbing Contractors, Plumbing Contractors
2) Flowers Plants and Trees Artificial, Flowers Plants Trees Artificial
3) Tanning Salons, Tanning Salon

Rules to apply? well, clearly these pairs are duplicates-in-meaning. But how to extrapolate pattern-recognition to the realm of meaning is a real challenge, isn't it...

.
Avatar of ggjones

ASKER

... thanks Ray.

regarding case, my data is of random case, so I apply this prior to output:

ucwords(strtolower($theString]).

GJ
OK, good.  With that transformation we can make some progress.  Plurals are a bit more complex.  But case-sensitivity can be neutralized.  I might skip the ucfirst() and go with strtoupper() since MySQL queries are by default case-insensitive (but PHP is case-sensitive).  In PHP RAY is not the same as Ray, but in MySQL Ray and RaY and rAY are the same unless you use the BOOLEAN attribute.
SOLUTION
Avatar of Beverley Portlock
Beverley Portlock
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I think array_diff followed by array_merge is sufficient to get the uniques...

so first do array_diff with array1, array 2 as input  [sequece in important]
than array_diff with array2, array 1 as input [sequence is important]
than array_merge array1 , array 2

Checkout these links before implementing it ...
http://php.net/manual/en/function.array-diff.php
http://in2.php.net/manual/en/function.array-merge.php
Avatar of ggjones

ASKER

... thank you very much gentleman; I will sift through this today, and figure out how best to apply the logic to my code.

Brian, the preg_replace statements you proposed to manage plurals - very elegant, by the way, in terms of coverage - should these be inserted immediately after line 33 -  " $normalisedString = implode( " ", $exp );" ?

Ray, the introduction to me of metaphone() opens up all sorts of possibilities for other applications as well. In terms of managing plurals for this case, could you elaborate a bit more please? I assume the metaphone($string) call would be inserted in each of the initial for-loops, and then the returned value stored ?? ... or would the initial string simply be replaced, and then converted back for the new array of uniques?

regards,

GJ
"
Brian, the preg_replace statements you proposed to manage plurals - very elegant, by the way, in terms of coverage - should these be inserted immediately after line 33 -  " $normalisedString = implode( " ", $exp );" ?"


Yes. They go between the creation of the normalisedString and its insertion in the array.

Best of luck

BP
Ray said "@Brian: I like your solution and had I stayed up last night I might have come up with something like that."

We all got to sleep sometime Ray...

:-D

Avatar of ggjones

ASKER


Hi Ray... I'm finding some anomalous behavior.

The output array does not include uniques from array2, if their indices are less than or equal to the highest Array1 index.

I cant figure out the reason though...

GJ
Array1
(
    [0] => Blue ball
    [1] => Small green ball
    [2] => Purple ball
    [3] => Ball red
    [4] => Ball blue
    [5] => Big orange ball
    [6] => Blue ball
    [7] => Brown ball
)
Array2
(
    [0] => Blue ball
    [1] => Red ball
    [2] => Pink Ball
    [3] => white Ball
    [4] => ball Small green
    [5] => Small green ball
    [6] => Ball red
    [7] => Ball black
    [8] => Big orange ball
    [9] => Ball blue
    [10] => Small ball green
    [11] => Blue ball
)
ArrayOut_Actual
(
    [0] => Purple ball
    [1] => Brown ball
)

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
    [] => Ball black
    [] => Pink Ball
    [] => white Ball
    [] => Ball black
)

Open in new window

Avatar of ggjones

ASKER


Hi Brian...

I'm getting an odd result. I'm unclear what it represents; it certainly is not the uniques though!
Any ideas why this should be??

GJ

$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array( 
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );

ArrayOut_Actual
(
    [0] => Blue ball
    [1] => Small green ball
    [2] => Pink Ball
    [3] => white Ball
    [4] => Big orange ball
    [5] => Ball black
    [6] => Ball red
)


ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
    [] => Ball black
    [] => Pink Ball
    [] => white Ball
    [] => Ball black
)

Open in new window

have you tried this ...

first do array_diff with array1, array 2 as input  [sequece in important]
than array_diff with array2, array 1 as input [sequence is important]
than array_merge array1 , array 2

Checkout these links before implementing it ...
http://php.net/manual/en/function.array-diff.php
http://in2.php.net/manual/en/function.array-merge.php
Avatar of ggjones

ASKER

Hi Ray... Brian...

If you could spare a moment... this anomalous behavior has me stumpted!

cheers,

GJ
I cannot see why you are expecting this result

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
    [] => Ball black
    [] => Pink Ball
    [] => white Ball
    [] => Ball black
)

Why should the blue, orange and green balls be omitted?
Also you have 'Ball black' twice...
Avatar of ggjones

ASKER



Brian... you are of course correct; I have much in common with a bag of hammers.

But that is not all, oh no, that is not all.

I have also failed to articulate the problem correctly.

The third Array is supposed to include the values of Array1 that are NOT in Array2. I think Ray got it correct after all; sorry Ray.

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
)

Talk about cognitive dissonance. I'm not even sure what I was thinking. A momentary lapse? Heh, probably insight is what is momentary!

Thanks for correcting me Brian, and for all of your effort.

cheers,

GJ
So where are we here? I am confused as to what (if anything) I need to be doing....

:-O

Avatar of ggjones

ASKER

Brian....

Your approach and Ray's appear to be  quite different. I'm curious as to efficiency of the methods with respect to speed/performance.

In my testing, I'm looping through 100 records at a time.... so, 300 arrays each with 5 to 10 values.

I would be curious to try each of your respective approaches to see if there is a discernible performance difference.

Would you be able to tweak your output so that :

The third Array  includes the values of Array1 that are NOT in Array2.

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
)

regards,

GJ
$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array(
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );

ArrayOut_Should_be
(
    [0] => Purple ball
    [1] => Brown ball
)

Open in new window

OK, modified code below. However, if you read my EE profile you will see that efficiency is the least of my concerns unless it makes itself a problem.

<?php




class CleanUp {

     protected $result;
     protected $originals;



     function __construct() {
          $this->result    = array();
          $this->originals = array();
     }



     function add( $arr ) {

          // Process the array
          //
          foreach( $arr as $index => $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               $i = array_search( $normalisedString, $this->result );
               if (  $i === false ) {
                    $this->result [$index] = $normalisedString;
                    $this->originals [$index] = $string;
               }
          }

     }



     function remove( $arr ) {

          // Process the array
          //
          foreach( $arr as $string ) {

               // Convert the string to lower case and spilt it
               //
               $exp = explode(" ", strtolower( $string ) );

               // Sort the resulting array and convert it back to a string
               //
               asort( $exp );
               $normalisedString = implode( " ", $exp );

               // Check if the new string has already been seen before. If not record it and its index
               //
               $i = array_search( $normalisedString, $this->result );
               if (  $i !== false )
                    unset( $this->result[$i] );
               
          }
     }

     

     function cleanArray() {

          // Array processed and all duplicates removed. Build the new results array
          //
          $newArr = array();
            foreach($this->result as $index => $string )
                    $newArr [] = $this->originals [$index];


          return $newArr;

     }

}


$myarray1 = array(
               'Blue ball',
               'Small green ball',
               'Purple ball',
               'Ball red',
               'Ball blue',
               'Big orange ball',
               'Blue ball',
               'Brown ball'
          );


$myarray2 = array(
               'Blue ball',
               'Red ball',
               'Pink Ball',
               'white Ball',
               'ball Small green',
               'Small green ball',
               'Ball red',
               'Ball black',
               'Big orange ball',
               'Ball blue',
               'Small ball green',
               'Blue ball'
              );



// Instantiate the class
//
$clean = new CleanUp();

$clean->add( $myarray1 );
$clean->remove( $myarray2 );

echo "<pre>";
print_r( $clean->cleanArray() );
echo "</pre>";

Open in new window


Output

Array
(
    [0] => Purple ball
    [1] => Brown ball
)
Instantiating a class is a performance heavy operation, so I have modified the class so it only needs instantiating once and can then be reset with an initialisation method. Try the original (above) and then this (below).

I ran a little test of my own over 10,000 iterations and got this

10,000 instances - 1.93467 seconds
1 instance, 10,000 initialisations - 1.85216 seconds

so my second method saved 0.08251 seconds over 10,000 iterations or 8 microseconds per iteration.

This is why I do not worry much about efficiency.

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ggjones

ASKER


Thanks Brian. Having tested it across a wide variety of scenarios, it has held up well. As you say, performance is not really an issue. The only things I have added so far are the code to handle plurals that you also contributed, and a check to ensure that none of the input arrays are null.

I did some comparisons with Ray's method, and his method chocked... on pizza :-)

$array1 = array(
               'pizza',
               'pizza restaurants',
               'restaurants pizza'
          );


$array2 = array(
               'Restaurants',
               'Pizza Restaurant'
              );

outputted:

Array
(
    [0] => pizza
    [1] => pizza
    [2] => pizza
    [3] => restaurants pizza
    [4] => restaurants
    [5] => restaurants
    [6] => pizza restaurant
)


many thanks to both of you for helping me out on this task.

regards,

GJ
Glad you are sorted. I wonder if Ray likes Pizza???

:-)