Solved

Need help updating custom PHP script for counting duplicate emails in 2 uploaded CSV files

Posted on 2014-11-24
10
155 Views
Last Modified: 2014-11-28
I wrote a custom PHP script a while back which allowed the user to upload 2 different CSV files, and it would then compare the 2 files and display a list of all of the duplicate email addresses that it found.  I've recently been asked to apply some modifications to this script, and was wondering if someone here might be able to help.

Currently, when the script runs, it displays the number of duplicates at the bottom of the page, --- but it counts multiple emails in that total.  So if the same email address is on both lists 5 times (and that can happen) the total is increased by 5.  

What I need to do is find a way to calculate and display a total that counts the unique number of duplicated emails .  In other words -- if the same email address is on both lists 5 times .. it would be counted as 1 duplicate email .. NOT 5.

How would I accomplish this?  My code is below:

<?php 
	
	if (isset($_POST['UPLOAD'])) { 
			
		$csv1filename = $_FILES['csv1']['tmp_name'];

		$ext1 = strtoupper(pathinfo($_FILES['csv1']['name'], PATHINFO_EXTENSION));								
		
		$csv2filename = $_FILES['csv2']['tmp_name'];
		
		$ext2 = strtoupper(pathinfo($_FILES['csv2']['name'], PATHINFO_EXTENSION));				

		if ($ext1 == 'CSV' && $ext2 == 'CSV')	{	

			$csv1 = '/home/expert/public_html/dupefinder/csv1.csv';					

			move_uploaded_file($csv1filename, $csv1);

			$csv2 = '/home/expert/public_html/dupefinder/csv2.csv';	

			move_uploaded_file($csv2filename, $csv2);					

			// CREATE THE EMPTY ARRAY
			
			$raw_array = array();				
			
			$csv1 = file($csv1, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
			
			$csv2 = file($csv2, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);

			$raw_array = array_merge($csv1, $csv2); 			
			
			// FUNCTION TO COUNT DUPLICATE EMAILS IN THE ARRAY
			
			function array_not_unique($raw) {
				$new = array_count_values($raw);				
				foreach ($new as $key => $val) {
				   if ($val < 2) unset($new[$key]);
				}				
				return $new;				
			}		

			$common = array_not_unique($raw_array);
			
			// DISPLAY THE NUMBER OF DUPES FOUND IN THE ARRAY
			
			$total = 0;
			
			echo "<pre style='text-align:left;line-height:45px;'>";	

			foreach ($common as $x => $n) {
			
				echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>$n</span> TIMES";
				
				$total = $total + $n;
				
			}	
			
			echo "</pre>";
			
			echo "<br />A Total of <strong style='text-align:center;background-color:#FBCB45;padding:3px;'>$total</strong> duplicate email addresses were found.<br /><br /><br />";
			
			echo '<a href="/dupefinder/" style="font-size:18px;text-decoration:none;font-weight:bold;color:blue;">&lt;&lt; Back</a>';
			
			unlink('/home/expert/public_html/dupefinder/csv1.csv');
			
			unlink('/home/expert/public_html/dupefinder/csv2.csv');							
			
		}	
	
	} 

?>

Open in new window


Thanks,
- Yvan
0
Comment
Question by:egoselfaxis
  • 3
  • 3
  • 2
  • +2
10 Comments
 
LVL 61

Expert Comment

by:gheist
ID: 40463884
Once you load CSV files into SQL(ite) database...
0
 
LVL 50

Expert Comment

by:Steve Bink
ID: 40463892
Change line 55 to read:
$total++;

Open in new window

0
 
LVL 17

Expert Comment

by:shinuq
ID: 40464067
foreach ($common as $x => $n) {
                  
                        echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>$n</span> TIMES";
                        
                        $total++;
                        
                  }
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40464241
Please show us some test data and we can give you a complete, tested and working example.
0
 

Author Comment

by:egoselfaxis
ID: 40464535
Some sample data / CSV files are attached -- the contain email addresses only.

- Yvan
csv1.csv
csv2.csv
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 50

Expert Comment

by:Steve Bink
ID: 40464684
Proof of fix.  Note that I edited the output to not display the email addresses.
<?php
// -------------------------------------------
// Contents of ee-test.php
// I edited the file to:
//   - enable running as command line, not a web request
//   - provide static filenames (no POST data to use)
//   - remove empty blank lines
//   - prevent deletion of the source files 
//   - implement the fix I described in #40463892
// -------------------------------------------
/*if (isset($_POST['UPLOAD'])) {
  $csv1filename = $_FILES['csv1']['tmp_name'];
  $ext1 = strtoupper(pathinfo($_FILES['csv1']['name'], PATHINFO_EXTENSION));
  $csv2filename = $_FILES['csv2']['tmp_name'];
  $ext2 = strtoupper(pathinfo($_FILES['csv2']['name'], PATHINFO_EXTENSION));
  if ($ext1 == 'CSV' && $ext2 == 'CSV')	{
    $csv1 = '/home/expert/public_html/dupefinder/csv1.csv';
    move_uploaded_file($csv1filename, $csv1);

    $csv2 = '/home/expert/public_html/dupefinder/csv2.csv';
    move_uploaded_file($csv2filename, $csv2);*/
    $csv1 = 'c:\bb\csv1.csv';
    $csv2 = 'c:\bb\csv2.csv';
    // CREATE THE EMPTY ARRAY
    $raw_array = array();
    $csv1 = file($csv1, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);
    $csv2 = file($csv2, FILE_IGNORE_NEW_LINES + FILE_SKIP_EMPTY_LINES);

    $raw_array = array_merge($csv1, $csv2); 							// FUNCTION TO COUNT DUPLICATE EMAILS IN THE ARRAY
    function array_not_unique($raw) {
      $new = array_count_values($raw);
      foreach ($new as $key => $val) {
        if ($val < 2) unset($new[$key]);
      }
      return $new;
    }
    $common = array_not_unique($raw_array);
    // DISPLAY THE NUMBER OF DUPES FOUND IN THE ARRAY
    $total = 0;
    echo "<pre style='text-align:left;line-height:45px;'>";
    foreach ($common as $x => $n) {
      echo PHP_EOL . "* THE VALUE <strong style='color:red;'>$x</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>$n</span> TIMES";
      /* $total = $total + $n;*/
      $total++;
    }
    echo "</pre>";
    echo "<br />A Total of <strong style='text-align:center;background-color:#FBCB45;padding:3px;'>$total</strong> duplicate email addresses were found.<br /><br /><br />";
    echo '<a href="/dupefinder/" style="font-size:18px;text-decoration:none;font-weight:bold;color:blue;">&lt;&lt; Back</a>';
/*    unlink('/home/expert/public_html/dupefinder/csv1.csv');
    unlink('/home/expert/public_html/dupefinder/csv2.csv');
  }
}*/
?>

C:\bb\www>php ee-test.php
<pre style='text-align:left;line-height:45px;'>* THE VALUE <strong style='color:red;'>email1@example.com</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>2</span> TIMES* THE VALUE <strong style='color:red;'>email2@example.com</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>2</span> TIMES* THE VALUE <strong style='color:red;'>email3@example.com</strong> APPEARED <span style='background-color:#FBCB45;padding:3px;'>2</span> TIMES</pre><br />A Total of <strong style='text-align:center;background-color:#FBCB45;padding:3px;'>3</strong> duplicate email addresses were found.<br /><br /><br /><a href="/dupefinder/" style="font-size:18px;text-decoration:none;font-weight:bold;color:blue;">&lt;&lt; Back</a>

C:\bb\www>

Open in new window

0
 

Author Comment

by:egoselfaxis
ID: 40464695
Steve -- will this modified PHP script of yours ONLY run via the command line?
Or can I still run it as a web-based app?

- Yvan
0
 
LVL 50

Accepted Solution

by:
Steve Bink earned 500 total points
ID: 40465086
If you implement the single change I suggested in #40463892, the script will continue to work exactly as it does now, only with the count correction you requested.  The changes I made for the purpose of my demonstration were solely to work around the lack of a web environment.  The only change that really affected the core functionality is the "total++" modification (line 55 in your original code, lines 43-44 in my demonstration code).
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40465965
Start your array learning adventures here:
http://php.net/manual/en/ref.array.php

This script shows a few ways of looking at arrays:
http://iconoun.com/demo/temp_egoselfaxis.php

<?php // demo/temp_egoselfaxis.php
error_reporting(E_ALL);
echo '<pre>';

// SEE http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28568673.html
// REF http://php.net/manual/en/function.array-unique.php
// REF http://php.net/manual/en/function.array-unique.php#81513
// REF http://php.net/manual/en/function.array-intersect.php

// TEST DATA FROM THE POST AT EE
$url1 = 'http://filedb.experts-exchange.com/incoming/2014/11_w48/884897/csv1.csv';
$csv1 = file_get_contents($url1);
$url2 = 'http://filedb.experts-exchange.com/incoming/2014/11_w48/884898/csv2.csv';
$csv2 = file_get_contents($url2);

// CHECK THE TEST DATA
echo PHP_EOL . htmlentities($csv1);
echo PHP_EOL;
echo PHP_EOL . htmlentities($csv2);
echo PHP_EOL;

// TRANSFORM THE CSV FILES INTO NORMALIZED ARRAYS (THERE ARE LOTS OF WAYS TO DO THIS)
$arr1 = explode(PHP_EOL, $csv1);
foreach ($arr1 as $key => $str) { $arr1[$key] = trim(strtolower($str)); }
$arr2 = explode(PHP_EOL, $csv2);
foreach ($arr2 as $key => $str) { $arr2[$key] = trim(strtolower($str)); }

// INSPECT THE NORMALIZED ARRAYS
var_dump($arr1, $arr2);

// FIND ALL OF THE EMAILS IN ARRAY 1 THAT ARE IN ARRAY 2
$new1 = array_intersect($arr1, $arr2);
echo PHP_EOL . "THESE EMAILS IN ARRAY 1 ARE ALSO IN ARRAY 2: ";
print_r($new1);

Open in new window

0
 

Author Closing Comment

by:egoselfaxis
ID: 40470712
Thanks Steve!  

- Yvan
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now