egoselfaxis
asked on
Reporting the number of duplicate email addresses in 2 comma delimited CSV files using PHP
I need to write a PHP script that reads the contents of 2 separate CSV files --- (which each contain a single column in that that lists only email addresses) --- and that reports on the number of duplicate email addresses that are found between them.
I've already coded a simple web form that successfully uploads the 2 separate CSV files to the server:
I now just need to read the contents of each of them and report on the number of dupes found. I don't need to display the actual duplicate email addresses. I just need to display the number of dupes that were found.
Can anybody here help me finish my script? Or does anyone have a PHP based code example that I could use for reference which generates this type of report?
Thanks!
- Yvan
I've already coded a simple web form that successfully uploads the 2 separate CSV files to the server:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Pilot Workshop :: Dupe Finder</title>
</head>
<body>
<br /><br />
<form name="dupeFinderForm" id="dupeFinderForm" method="post" enctype="multipart/form-data" action="" style="width:300px;margin:0 auto;padding:25px;text-align:center;font-family:Arial,Helvetica;border:5px double #333;">
<?php if (isset($_POST['UPLOAD'])) {
$csv1filename = $_FILES['csv1']['tmp_name'];
$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
move_uploaded_file($csv1filename, $csv1);
$csv2filename = $_FILES['csv2']['tmp_name'];
$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
move_uploaded_file($csv2filename, $csv2);
} else { ?>
<h1>Dupe Finder</h1><br /><br />
<label style="font-weight:bold;">CSV File #1</label><br />
<input type="file" name="csv1" id="csv1" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />
<label style="font-weight:bold;">CSV File #2</label><br />
<input type="file" name="csv2" id="csv2" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />
<input type="submit" name="UPLOAD" id="UPLOAD" value="UPLOAD" style="border:1px solid #333;background-color:#93BF00;color:#333;font-weight:bold;padding:7px;" />
<?php } ?>
</form>
</body>
</html>
I now just need to read the contents of each of them and report on the number of dupes found. I don't need to display the actual duplicate email addresses. I just need to display the number of dupes that were found.
Can anybody here help me finish my script? Or does anyone have a PHP based code example that I could use for reference which generates this type of report?
Thanks!
- Yvan
ASKER
The example you've provided a link for seems to suggest that all of the email addresses from both data files should be placed inside the same single array. This might work -- though I'm not exactly sure how I would integrate this with what I have already.
If I remember correctly -- I can append data to an array using the array_push() function -- and I can then count the number of elements in that array using the count() function.
Could you perhaps give me an example as to how I would stitch all of this together?
Thanks,
- Yvan
If I remember correctly -- I can append data to an array using the array_push() function -- and I can then count the number of elements in that array using the count() function.
Could you perhaps give me an example as to how I would stitch all of this together?
Thanks,
- Yvan
Try combining the arrays like this:
$new = $arr1 + $arr2;
This code example seems to be close to something you could modify for your own needs.
Best regards, ~Ray
$new = $arr1 + $arr2;
This code example seems to be close to something you could modify for your own needs.
<?php // RAY_array_not_unique.php
error_reporting(E_ALL);
echo "<pre>";
// A FUNCTION TO FIND REPLICATED VALUES IN AN ARRAY
function array_not_unique($raw)
{
// MAN PAGE: http://php.net/manual/en/function.array-count-values.php
$new = array_count_values($raw);
foreach ($new as $key => $val)
{
if ($val < 2) unset($new[$key]);
}
return $new;
}
// SOME TEST DATA
$raw_array = array();
$raw_array[] = 'abc@xyz.com';
$raw_array[] = 'def@xyz.com';
$raw_array[] = 'ghi@xyz.com';
$raw_array[] = 'jkl@xyz.com';
$raw_array[] = 'mno@xyz.com';
$raw_array[] = 'pqr@xyz.com';
$raw_array[] = 'stu@xyz.com';
// SOME DUPLICATES
$raw_array[] = 'abc@xyz.com';
$raw_array[] = 'jkl@xyz.com';
$raw_array[] = 'abc@xyz.com';
$raw_array[] = 'def@xyz.com';
// SHOW THE FUNCTION AT WORK
$common = array_not_unique($raw_array);
foreach ($common as $x => $n)
{
echo PHP_EOL . "THE VALUE $x APPEARED $n TIMES";
}
If you want to post the test data, I'll be glad to try to give you a code example.Best regards, ~Ray
ASKER
Here's what I've got so far:
It's partially working -- but for some reason, I'm getting an error message, -- and the function isn't returning a value.
The 2 sample CSV files that I'm using for testing are attached.
Any ideas?
- Yvan
csv1.csv
csv2.csv
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Pilot Workshop :: Dupe Finder</title>
</head>
<body>
<br /><br />
<form name="dupeFinderForm" id="dupeFinderForm" method="post" enctype="multipart/form-data" action="" style="width:300px;margin:0 auto;padding:25px;text-align:center;font-family:Arial,Helvetica;border:5px double #333;">
<?php if (isset($_POST['UPLOAD'])) {
//error_reporting(E_ALL);
$csv1filename = $_FILES['csv1']['tmp_name'];
$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
move_uploaded_file($csv1filename, $csv1);
$csv2filename = $_FILES['csv2']['tmp_name'];
$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
move_uploaded_file($csv2filename, $csv2);
// CREATE THE EMPTY ARRAY
$raw_array = array();
// LOOP THROUGH THE 1ST CSV FILE
if (($handle = fopen("$csv1", "r")) !== FALSE) {
$key = 0;
while (($data = fgetcsv($handle, 0, ",")) !== FALSE) {
$c = count($data);
for ($x=0;$x<$c;$x++) {
$raw_array[$key][$x] = $data[$x];
}
$key++;
}
fclose($handle);
}
// LOOP THROUGH THE 2ND CSV FILE
if (($handle = fopen("$csv2", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 0, ",")) !== FALSE) {
$c = count($data);
for ($x=0;$x<$c;$x++) {
$raw_array[$key][$x] = $data[$x];
}
$key++;
}
fclose($handle);
}
// DISPLAY CONTENTS OF THE ARRAY
echo "<pre >";
print_r($raw_array);
echo "</pre>";
// FUNCTION TO COUNT DUPLICATE EMAILS IN THE ARRAY
function array_not_unique($raw) {
$new = array_count_values($raw);
foreach ($new as $key => $val) {
if ($val < 2) unset($new[$key]);
}
return $new;
}
$common = array_not_unique($raw_array);
// DISPLAY THE NUMBER OF DUPES FOUND IN THE ARRAY
foreach ($common as $x => $n) {
echo PHP_EOL . "THE VALUE $x APPEARED $n TIMES";
}
} else { ?>
<h1>Dupe Finder</h1><br /><br />
<label style="font-weight:bold;">CSV File #1</label><br />
<input type="file" name="csv1" id="csv1" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />
<label style="font-weight:bold;">CSV File #2</label><br />
<input type="file" name="csv2" id="csv2" size="30" style="border:1px solid #333;" /><br /><br /><br /><br />
<input type="submit" name="UPLOAD" id="UPLOAD" value="UPLOAD" style="border:1px solid #333;background-color:#93BF00;color:#333;font-weight:bold;padding:7px;" />
<?php } ?>
</form>
</body>
</html>
It's partially working -- but for some reason, I'm getting an error message, -- and the function isn't returning a value.
Warning: array_count_values() [function.array-count-values]: Can only count STRING and INTEGER values!
The 2 sample CSV files that I'm using for testing are attached.
Any ideas?
- Yvan
csv1.csv
csv2.csv
Your strategy here would be to upload the CSV files, move them to a safe place, then read them with file() and follow the general design shown in the code.
Please see http://www.laprbass.com/RAY_temp_egoselfaxis.php
Please see http://www.laprbass.com/RAY_temp_egoselfaxis.php
<?php // RAY_temp_egoselfaxis.php
error_reporting(E_ALL);
echo '<pre>';
// A FUNCTION TO FIND REPLICATED VALUES IN AN ARRAY
function array_not_unique($raw)
{
// MAN PAGE: http://php.net/manual/en/function.array-count-values.php
$new = array_count_values($raw);
foreach ($new as $key => $val)
{
if ($val < 2) unset($new[$key]);
}
return $new;
}
// THE COLLECTION OF URLS
$urls = array
( 'http://filedb.experts-exchange.com/incoming/2013/07_w27/664096/csv1.csv'
, 'http://filedb.experts-exchange.com/incoming/2013/07_w27/664097/csv2.csv'
)
;
// READ THE DATA INTO A COMBINED ARRAY
$arr = array();
foreach ($urls as $url)
{
$csv = file($url, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$arr = array_merge($arr, $csv);
}
// SHOW THE FUNCTION AT WORK
$common = array_not_unique($arr);
foreach ($common as $x => $n)
{
echo PHP_EOL . "THE VALUE $x APPEARED $n TIMES";
}
ASKER
Thanks Ray. I don't doubt that your re-written code is likely more efficient than mine.. but I'm already doing what you've suggested.. uploading the 2 CSV files successfully, ... moving them to a safe place, .. and then reading them and appending their contents to an array.
Can you at least point out where or why my code is failing? I ask because I'm ending up with a proper array that contains the 20 email addresses that I expected to see in it -- but the function that's supposed to return the number of dupes in the array just spits out an error message .. instead of the number of dupes. What is it that I've done wrong?
- Yvan
Can you at least point out where or why my code is failing? I ask because I'm ending up with a proper array that contains the 20 email addresses that I expected to see in it -- but the function that's supposed to return the number of dupes in the array just spits out an error message .. instead of the number of dupes. What is it that I've done wrong?
- Yvan
Sorry - I don't know what is wrong. It's not about efficiency, it's about functionality. It looks to me that your script may be creating a multi-dimensional array, and I do not understand why that would be needed. That is why I used array_merge() in my test case.
What is the output of this statement:
print_r($raw_array);
What is the output of this statement:
print_r($raw_array);
ASKER
Here is the output of the array:
Array
(
[0] => Array
(
[0] => yvan@egoselfaxis.com
)
[1] => Array
(
[0] => yvan@ideasdesigninc.com
)
[2] => Array
(
[0] => yvanator@gmail.com
)
[3] => Array
(
[0] => ivan@egoselfaxis.com
)
[4] => Array
(
[0] => ivan@ideasdesigninc.com
)
[5] => Array
(
[0] => ivanator@gmail.com
)
[6] => Array
(
[0] => yvon@egoselfaxis.com
)
[7] => Array
(
[0] => yvon@ideasdesigninc.com
)
[8] => Array
(
[0] => yvonator@gmail.com
)
[9] => Array
(
[0] => test@test.com
)
[10] => Array
(
[0] => yvan@egoselfaxis.com
)
[11] => Array
(
[0] => yvan@ideasdesigninc.com
)
[12] => Array
(
[0] => yvanator@gmail.com
)
[13] => Array
(
[0] => ivan2@egoselfaxis.com
)
[14] => Array
(
[0] => ivan2@ideasdesigninc.com
)
[15] => Array
(
[0] => ivanator2@gmail.com
)
[16] => Array
(
[0] => yvon2@egoselfaxis.com
)
[17] => Array
(
[0] => yvon2@ideasdesigninc.com
)
[18] => Array
(
[0] => yvonator2@gmail.com
)
[19] => Array
(
[0] => test2@test.com
)
)
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Sweet -- that worked!
Thanks Ray!
- Yvan
$raw_array = array();
$csv1 = file($csv1, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$csv2 = file($csv2, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$raw_array = array_merge((array)$csv1, (array)$csv2);
Thanks Ray!
- Yvan
Thanks for the points. You might want to change line 4 from this:
$raw_array = array_merge((array)$csv1, (array)$csv2);
to this
$raw_array = array_merge($csv1, $csv2);
because file() returns an array and there is no need to re-cast the array.
Best regards, ~Ray
$raw_array = array_merge((array)$csv1, (array)$csv2);
to this
$raw_array = array_merge($csv1, $csv2);
because file() returns an array and there is no need to re-cast the array.
Best regards, ~Ray
ASKER
Thanks Ray! I took you up on your suggestion.
- Yvan
- Yvan
http://www.php.net/manual/en/function.array-unique.php#81513