egoselfaxis
asked on
Need help updating custom PHP script for counting duplicate emails in 2 uploaded CSV files
I wrote a custom PHP script a while back which allowed the user to upload 2 different CSV files, and it would then compare the 2 files and display a list of all of the duplicate email addresses that it found. I've recently been asked to apply some modifications to this script, and was wondering if someone here might be able to help.
Currently, when the script runs, it displays the number of duplicates at the bottom of the page, --- but it counts multiple emails in that total. So if the same email address is on both lists 5 times (and that can happen) the total is increased by 5.
What I need to do is find a way to calculate and display a total that counts the unique number of duplicated emails . In other words -- if the same email address is on both lists 5 times .. it would be counted as 1 duplicate email .. NOT 5.
How would I accomplish this? My code is below:
Thanks,
- Yvan
Currently, when the script runs, it displays the number of duplicates at the bottom of the page, --- but it counts multiple emails in that total. So if the same email address is on both lists 5 times (and that can happen) the total is increased by 5.
What I need to do is find a way to calculate and display a total that counts the unique number of duplicated emails . In other words -- if the same email address is on both lists 5 times .. it would be counted as 1 duplicate email .. NOT 5.
How would I accomplish this? My code is below:
<?php
if (isset($_POST['UPLOAD'])) {
$csv1filename = $_FILES['csv1']['tmp_name'];
$ext1 = strtoupper(pathinfo($_FILES['csv1']['name'], PATHINFO_EXTENSION));
$csv2filename = $_FILES['csv2']['tmp_name'];
$ext2 = strtoupper(pathinfo($_FILES['csv2']['name'], PATHINFO_EXTENSION));
if ($ext1 == 'CSV' && $ext2 == 'CSV') {
$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
move_uploaded_file($csv1filename, $csv1);
$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
move_uploaded_file($csv2filename, $csv2);
// CREATE THE EMPTY ARRAY
$raw_array = array();
$csv1 = file($csv1, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$csv2 = file($csv2, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$raw_array = array_merge($csv1, $csv2);
// FUNCTION TO COUNT DUPLICATE EMAILS IN THE ARRAY
function array_not_unique($raw) {
$new = array_count_values($raw);
foreach ($new as $key => $val) {
if ($val < 2) unset($new[$key]);
}
return $new;
}
$common = array_not_unique($raw_array);
// DISPLAY THE NUMBER OF DUPES FOUND IN THE ARRAY
$total = 0;
echo "<pre style='text-align:left;line-height:45px;'>";
foreach ($common as $x => $n) {
echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>$n</span> TIMES";
$total = $total + $n;
}
echo "</pre>";
echo "<br />A Total of <strong style='text-align:center;background-color:#FBCB45;padding:3px;'>$total</strong> duplicate email addresses were found.<br /><br /><br />";
echo '<a href="/dupefinder/" style="font-size:18px;text-decoration:none;font-weight:bold;color:blue;"><< Back</a>';
unlink('/home/expert/public_html/dupefinder/csv1.csv');
unlink('/home/expert/public_html/dupefinder/csv2.csv');
}
}
?>
Thanks,
- Yvan
Once you load CSV files into SQL(ite) database...
Change line 55 to read:
$total++;
foreach ($common as $x => $n) {
echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</str ong> APPEARED <span style='background-color:#F BCB45;padd ing:3px;'> $n</span> TIMES";
$total++;
}
echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</str
$total++;
}
Please show us some test data and we can give you a complete, tested and working example.
ASKER
Proof of fix. Note that I edited the output to not display the email addresses.
<?php
// -------------------------------------------
// Contents of ee-test.php
// I edited the file to:
// - enable running as command line, not a web request
// - provide static filenames (no POST data to use)
// - remove empty blank lines
// - prevent deletion of the source files
// - implement the fix I described in #40463892
// -------------------------------------------
/*if (isset($_POST['UPLOAD'])) {
$csv1filename = $_FILES['csv1']['tmp_name'];
$ext1 = strtoupper(pathinfo($_FILES['csv1']['name'], PATHINFO_EXTENSION));
$csv2filename = $_FILES['csv2']['tmp_name'];
$ext2 = strtoupper(pathinfo($_FILES['csv2']['name'], PATHINFO_EXTENSION));
if ($ext1 == 'CSV' && $ext2 == 'CSV') {
$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
move_uploaded_file($csv1filename, $csv1);
$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
move_uploaded_file($csv2filename, $csv2);*/
$csv1 = 'c:\bb\csv1.csv';
$csv2 = 'c:\bb\csv2.csv';
// CREATE THE EMPTY ARRAY
$raw_array = array();
$csv1 = file($csv1, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$csv2 = file($csv2, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
$raw_array = array_merge($csv1, $csv2); // FUNCTION TO COUNT DUPLICATE EMAILS IN THE ARRAY
function array_not_unique($raw) {
$new = array_count_values($raw);
foreach ($new as $key => $val) {
if ($val < 2) unset($new[$key]);
}
return $new;
}
$common = array_not_unique($raw_array);
// DISPLAY THE NUMBER OF DUPES FOUND IN THE ARRAY
$total = 0;
echo "<pre style='text-align:left;line-height:45px;'>";
foreach ($common as $x => $n) {
echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>$n</span> TIMES";
/* $total = $total + $n;*/
$total++;
}
echo "</pre>";
echo "<br />A Total of <strong style='text-align:center;background-color:#FBCB45;padding:3px;'>$total</strong> duplicate email addresses were found.<br /><br /><br />";
echo '<a href="/dupefinder/" style="font-size:18px;text-decoration:none;font-weight:bold;color:blue;"><< Back</a>';
/* unlink('/home/expert/public_html/dupefinder/csv1.csv');
unlink('/home/expert/public_html/dupefinder/csv2.csv');
}
}*/
?>
C:\bb\www>php ee-test.php
<pre style='text-align:left;line-height:45px;'>* THE VALUE <strong style='color:red;'>email1@example.com</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>2</span> TIMES* THE VALUE <strong style='color:red;'>email2@example.com</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>2</span> TIMES* THE VALUE <strong style='color:red;'>email3@example.com</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>2</span> TIMES</pre><br />A Total of <strong style='text-align:center;background-color:#FBCB45;padding:3px;'>3</strong> duplicate email addresses were found.<br /><br /><br /><a href="/dupefinder/" style="font-size:18px;text-decoration:none;font-weight:bold;color:blue;"><< Back</a>
C:\bb\www>
ASKER
Steve -- will this modified PHP script of yours ONLY run via the command line?
Or can I still run it as a web-based app?
- Yvan
Or can I still run it as a web-based app?
- Yvan
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start your array learning adventures here:
http://php.net/manual/en/ref.array.php
This script shows a few ways of looking at arrays:
http://iconoun.com/demo/temp_egoselfaxis.php
http://php.net/manual/en/ref.array.php
This script shows a few ways of looking at arrays:
http://iconoun.com/demo/temp_egoselfaxis.php
<?php // demo/temp_egoselfaxis.php
error_reporting(E_ALL);
echo '<pre>';
// SEE http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28568673.html
// REF http://php.net/manual/en/function.array-unique.php
// REF http://php.net/manual/en/function.array-unique.php#81513
// REF http://php.net/manual/en/function.array-intersect.php
// TEST DATA FROM THE POST AT EE
$url1 = 'http://filedb.experts-exchange.com/incoming/2014/11_w48/884897/csv1.csv';
$csv1 = file_get_contents($url1);
$url2 = 'http://filedb.experts-exchange.com/incoming/2014/11_w48/884898/csv2.csv';
$csv2 = file_get_contents($url2);
// CHECK THE TEST DATA
echo PHP_EOL . htmlentities($csv1);
echo PHP_EOL;
echo PHP_EOL . htmlentities($csv2);
echo PHP_EOL;
// TRANSFORM THE CSV FILES INTO NORMALIZED ARRAYS (THERE ARE LOTS OF WAYS TO DO THIS)
$arr1 = explode(PHP_EOL, $csv1);
foreach ($arr1 as $key => $str) { $arr1[$key] = trim(strtolower($str)); }
$arr2 = explode(PHP_EOL, $csv2);
foreach ($arr2 as $key => $str) { $arr2[$key] = trim(strtolower($str)); }
// INSPECT THE NORMALIZED ARRAYS
var_dump($arr1, $arr2);
// FIND ALL OF THE EMAILS IN ARRAY 1 THAT ARE IN ARRAY 2
$new1 = array_intersect($arr1, $arr2);
echo PHP_EOL . "THESE EMAILS IN ARRAY 1 ARE ALSO IN ARRAY 2: ";
print_r($new1);
ASKER
Thanks Steve!
- Yvan
- Yvan