Link to home
Create AccountLog in
Avatar of egoselfaxis
egoselfaxis

asked on

Reporting the number of duplicate email addresses in 2 comma delimited CSV files using PHP

I need to write a PHP script that reads the contents of 2 separate CSV files --- (which each contain a single column in that that lists only email addresses) --- and that reports on the number of duplicate email addresses that are found between them.  

I've already coded a simple web form that successfully uploads the 2 separate CSV files to the server:

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Pilot Workshop :: Dupe Finder</title>
</head>

<body>

	<br /><br />	

	<form name="dupeFinderForm" id="dupeFinderForm" method="post" enctype="multipart/form-data" action="" style="width:300px;margin:0 auto;padding:25px;text-align:center;font-family:Arial,Helvetica;border:5px double #333;">
	
		<?php if (isset($_POST['UPLOAD'])) { 
		
			$csv1filename = $_FILES['csv1']['tmp_name'];
			
			$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
			
			move_uploaded_file($csv1filename, $csv1);			
			
			$csv2filename = $_FILES['csv2']['tmp_name'];
			
			$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
			
			move_uploaded_file($csv2filename, $csv2);			
		
		
		} else { ?>



			<h1>Dupe Finder</h1><br /><br />

			<label style="font-weight:bold;">CSV File #1</label><br />
			<input type="file" name="csv1" id="csv1" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />

			<label style="font-weight:bold;">CSV File #2</label><br />
			<input type="file" name="csv2" id="csv2" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />

			<input type="submit" name="UPLOAD" id="UPLOAD" value="UPLOAD" style="border:1px solid #333;background-color:#93BF00;color:#333;font-weight:bold;padding:7px;" />
		
		<?php } ?>

	</form>

</body>
</html>

Open in new window


I now just need to read the contents of each of them and report on the number of dupes found.  I don't need to display the actual duplicate email addresses.  I just need to display the number of dupes that were found.

Can anybody here help me finish my script?  Or does anyone have a PHP based code example that I could use for reference which generates this type of report?  

Thanks!
- Yvan
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

If you can put the email addresses into arrays, and normalize them to all-upper-case, this user function will be able to find the duplicates.  Perhaps with a little innovation you will be able to count the number of replications among all the lists.
http://www.php.net/manual/en/function.array-unique.php#81513
Avatar of egoselfaxis
egoselfaxis

ASKER

The example you've provided a link for seems to suggest that all of the email addresses from both data files should be placed inside the same single array.  This might work -- though I'm not exactly sure how I would integrate this with what I have already.

If I remember correctly -- I can append data to an array using the array_push() function -- and I can then count the number of elements in that array using the count() function.  

Could you perhaps give me an example as to how I would stitch all of this together?

Thanks,
- Yvan
Try combining the arrays like this:

$new = $arr1 + $arr2;

This code example seems to be close to something you could modify for your own needs.

<?php // RAY_array_not_unique.php
error_reporting(E_ALL);
echo "<pre>";


// A FUNCTION TO FIND REPLICATED VALUES IN AN ARRAY
function array_not_unique($raw)
{
    // MAN PAGE: http://php.net/manual/en/function.array-count-values.php
    $new = array_count_values($raw);
    foreach ($new as $key => $val)
    {
       if ($val < 2) unset($new[$key]);
    }
    return $new;
}


// SOME TEST DATA
$raw_array   = array();
$raw_array[] = 'abc@xyz.com';
$raw_array[] = 'def@xyz.com';
$raw_array[] = 'ghi@xyz.com';
$raw_array[] = 'jkl@xyz.com';
$raw_array[] = 'mno@xyz.com';
$raw_array[] = 'pqr@xyz.com';
$raw_array[] = 'stu@xyz.com';

// SOME DUPLICATES
$raw_array[] = 'abc@xyz.com';
$raw_array[] = 'jkl@xyz.com';
$raw_array[] = 'abc@xyz.com';
$raw_array[] = 'def@xyz.com';


// SHOW THE FUNCTION AT WORK
$common = array_not_unique($raw_array);
foreach ($common as $x => $n)
{
    echo PHP_EOL . "THE VALUE $x APPEARED $n TIMES";
}

Open in new window

If you want to post the test data, I'll be glad to try to give you a code example.

Best regards, ~Ray
Here's what I've got so far:

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Pilot Workshop :: Dupe Finder</title>
</head>

<body>

	<br /><br />	

	<form name="dupeFinderForm" id="dupeFinderForm" method="post" enctype="multipart/form-data" action="" style="width:300px;margin:0 auto;padding:25px;text-align:center;font-family:Arial,Helvetica;border:5px double #333;">
	
		<?php if (isset($_POST['UPLOAD'])) { 
		
			//error_reporting(E_ALL);		
		
			$csv1filename = $_FILES['csv1']['tmp_name'];
			
			$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
			
			move_uploaded_file($csv1filename, $csv1);			
			
			$csv2filename = $_FILES['csv2']['tmp_name'];
			
			$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
			
			move_uploaded_file($csv2filename, $csv2);				

			// CREATE THE EMPTY ARRAY
			
			$raw_array = array();	
			
			// LOOP THROUGH THE 1ST CSV FILE	

			if (($handle = fopen("$csv1", "r")) !== FALSE) {

				$key = 0;

				while (($data = fgetcsv($handle, 0, ",")) !== FALSE) {

				   $c = count($data);

				   for ($x=0;$x<$c;$x++) {
					$raw_array[$key][$x] = $data[$x];
				   }
				   $key++;
				}

				fclose($handle);

			} 		

			// LOOP THROUGH THE 2ND CSV FILE
			
			if (($handle = fopen("$csv2", "r")) !== FALSE) {

				while (($data = fgetcsv($handle, 0, ",")) !== FALSE) {

				   $c = count($data);

				   for ($x=0;$x<$c;$x++) {
					$raw_array[$key][$x] = $data[$x];
				   }
				   $key++;
				}

				fclose($handle);

			} 	

			// DISPLAY CONTENTS OF THE ARRAY 	
			
			echo "<pre >";			
			print_r($raw_array);			
			echo "</pre>";	
			
			// FUNCTION TO COUNT DUPLICATE EMAILS IN THE ARRAY
			
			function array_not_unique($raw) {
				$new = array_count_values($raw);				
				foreach ($new as $key => $val) {
				   if ($val < 2) unset($new[$key]);
				}				
				return $new;				
			}		

			$common = array_not_unique($raw_array);
			
			// DISPLAY THE NUMBER OF DUPES FOUND IN THE ARRAY

			foreach ($common as $x => $n) {
			
				echo PHP_EOL . "THE VALUE $x APPEARED $n TIMES";
				
			}				
		
		} else { ?>

			<h1>Dupe Finder</h1><br /><br />

			<label style="font-weight:bold;">CSV File #1</label><br />
			<input type="file" name="csv1" id="csv1" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />

			<label style="font-weight:bold;">CSV File #2</label><br />
			<input type="file" name="csv2" id="csv2" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />

			<input type="submit" name="UPLOAD" id="UPLOAD" value="UPLOAD" style="border:1px solid #333;background-color:#93BF00;color:#333;font-weight:bold;padding:7px;" />
		
		<?php } ?>

	</form>

</body>
</html>

Open in new window



It's partially working -- but for some reason, I'm getting an error message, -- and the function isn't returning a value.

Warning: array_count_values() [function.array-count-values]: Can only count STRING and INTEGER values!

Open in new window


The 2 sample CSV files that I'm using for testing are attached.

Any ideas?
- Yvan
csv1.csv
csv2.csv
Your strategy here would be to upload the CSV files, move them to a safe place, then read them with file() and follow the general design shown in the code.

Please see http://www.laprbass.com/RAY_temp_egoselfaxis.php

<?php // RAY_temp_egoselfaxis.php
error_reporting(E_ALL);
echo '<pre>';


// A FUNCTION TO FIND REPLICATED VALUES IN AN ARRAY
function array_not_unique($raw)
{
    // MAN PAGE: http://php.net/manual/en/function.array-count-values.php
    $new = array_count_values($raw);
    foreach ($new as $key => $val)
    {
       if ($val < 2) unset($new[$key]);
    }
    return $new;
}


// THE COLLECTION OF URLS
$urls = array
( 'http://filedb.experts-exchange.com/incoming/2013/07_w27/664096/csv1.csv'
, 'http://filedb.experts-exchange.com/incoming/2013/07_w27/664097/csv2.csv'
)
;

// READ THE DATA INTO A COMBINED ARRAY
$arr = array();
foreach ($urls as $url)
{
    $csv = file($url, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
    $arr = array_merge($arr, $csv);
}

// SHOW THE FUNCTION AT WORK
$common = array_not_unique($arr);
foreach ($common as $x => $n)
{
    echo PHP_EOL . "THE VALUE $x APPEARED $n TIMES";
}

Open in new window

Thanks Ray. I don't doubt that your re-written code is likely more efficient than mine.. but I'm already doing what you've suggested.. uploading the 2 CSV files successfully, ... moving them to a safe place, .. and then reading them and appending their contents to an array.

Can you at least point out where or why my code is failing?  I ask because I'm ending up with a proper array that contains the 20 email addresses that I expected to see in it -- but the function that's supposed to return the number of dupes in the array just spits out an error message .. instead of the number of dupes.  What is it that I've done wrong?

- Yvan
Sorry - I don't know what is wrong. It's not about efficiency, it's about functionality.  It looks to me that your script may be creating a multi-dimensional array, and I do not understand why that would be needed.  That is why I used array_merge() in my test case.

What is the output of this statement:

print_r($raw_array);
Here is the output of the array:

Array
(
    [0] => Array
        (
            [0] => yvan@egoselfaxis.com
        )

    [1] => Array
        (
            [0] => yvan@ideasdesigninc.com
        )

    [2] => Array
        (
            [0] => yvanator@gmail.com
        )

    [3] => Array
        (
            [0] => ivan@egoselfaxis.com
        )

    [4] => Array
        (
            [0] => ivan@ideasdesigninc.com
        )

    [5] => Array
        (
            [0] => ivanator@gmail.com
        )

    [6] => Array
        (
            [0] => yvon@egoselfaxis.com
        )

    [7] => Array
        (
            [0] => yvon@ideasdesigninc.com
        )

    [8] => Array
        (
            [0] => yvonator@gmail.com
        )

    [9] => Array
        (
            [0] => test@test.com
        )

    [10] => Array
        (
            [0] => yvan@egoselfaxis.com
        )

    [11] => Array
        (
            [0] => yvan@ideasdesigninc.com
        )

    [12] => Array
        (
            [0] => yvanator@gmail.com
        )

    [13] => Array
        (
            [0] => ivan2@egoselfaxis.com
        )

    [14] => Array
        (
            [0] => ivan2@ideasdesigninc.com
        )

    [15] => Array
        (
            [0] => ivanator2@gmail.com
        )

    [16] => Array
        (
            [0] => yvon2@egoselfaxis.com
        )

    [17] => Array
        (
            [0] => yvon2@ideasdesigninc.com
        )

    [18] => Array
        (
            [0] => yvonator2@gmail.com
        )

    [19] => Array
        (
            [0] => test2@test.com
        )

)

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Sweet --  that worked!

$raw_array = array();							
$csv1 = file($csv1, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);			
$csv2 = file($csv2, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);			
$raw_array = array_merge((array)$csv1, (array)$csv2);	

Open in new window


Thanks Ray!
- Yvan
Thanks for the points.  You might want to change line 4 from this:

$raw_array = array_merge((array)$csv1, (array)$csv2);      

to this

$raw_array = array_merge($csv1, $csv2);      

because file() returns an array and there is no need to re-cast the array.

Best regards, ~Ray
Thanks Ray!  I took you up on your suggestion.

- Yvan