Solved

Do I make this process faster by creating a csv file right off the bat?

Posted on 2014-10-07
13
134 Views
Last Modified: 2014-10-10
Here it comes...a bunch of data that I want to collect and then put into a csv file that my user can download...

$don="SELECT * FROM verizon_test WHERE (geo_coords_0>=$latitude_1 and geo_coords_0<=$latitude_2) and (geo_coords_1>=$longitude_1 and geo_coords_1<=$longitude_2) and
(posted_day>='$start_date' and posted_day<='$end_date') LIMIT 400000";
$don_query=mysqli_query($cxn, $don);
	if(!$don_query)
	{
	$nuts=mysqli_errno($cxn).': '.mysqli_error($cxn);
	die($nuts);
	}
$don_count=mysqli_num_rows($don_query);
	if($don_count==0)
	{
	header("Location:no_results.php");
	exit;
	}
	else
	{
	$crystal = "delete from twitter_csv";
	$crystal_query=mysqli_query($cxn, $crystal);
	}
$row_number=1;

Open in new window


After I've established that there's some data, I loop through the rows and insert the recordset into a table:

<?php	
		while($don_row=mysqli_fetch_assoc($don_query))
		{
		extract($don_row);
		?>
	<tr>
		<td class="twitter_data"><?php echo $row_number;?></td>
		<td class="twitter_data"><?php echo date("m/d/Y", strtotime($don_row['posted_day'])); ?></td>
		<td class="twitter_data"><?php echo $don_row['actor_id']; ?></td>
		<td class="twitter_data"><?php echo stripslashes($don_row['actor_display_name']); ?></td>
		<td class="twitter_data"><?php echo $don_row['geo_coords_0']; ?></td>
		<td class="twitter_data"><?php echo $don_row['geo_coords_1']; ?></td>
		<td class="twitter_data"><?php echo stripslashes($don_row['location_name']); ?></td>
	<?php
		$row_number=$row_number+1;
		//since you've got results, we're going to add these rows to a table and from there, we'll make our csv page
		$the_actor_id=$don_row['actor_id'];
		$the_actor_display_name= $don_row['actor_display_name'];
		$the_posted_time=$don_row['posted_time'];
		$the_display_name=$don_row['display_name'];
		$the_geo_coords_0=$don_row['geo_coords_0'];
		$the_geo_coords_1=$don_row['geo_coords_1'];
		$the_location_name=$don_row['location_name'];
		$the_posted_day=$don_row['posted_day'];
		$insert = "insert into twitter_csv (actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day) 
		values ('$the_actor_id', '$the_actor_display_name', '$the_posted_time', '$the_display_name', '$the_geo_coords_0', '$the_geo_coords_1', '$the_location_name', '$the_posted_day')";
			$insertexe = mysqli_query($cxn, $insert);
			if(!$insertexe) {
			$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
			die($error);
			}
		}
	?>
</table><br><br>
<div style="text-align:center;">click <a href="twitter_csv.php">here</a> to download your csv file</div><br><br>
</body>

Open in new window


From there, direct the user out to another page where they can download the csv file that's getting all of its data from the twitter_csv table.

My question is: Would I be making the process quicker by forgoing inserting the records into the twitter_csv table and instead sticking them all inside a csv file that my user can download?

I was playing with this:

$don="SELECT actor_id, actor_display_name, posted_time, actor_display_name, geo_coords_lat, geo_coords_lon, location_name, posted_day FROM verizon_test WHERE (geo_coords_lat>='$latitude_1' and geo_coords_lat<='$latitude_2') and (geo_coords_lon>='$longitude_1' and geo_coords_lon<='$longitude_2') and
(posted_day>='$start_date' and posted_day<='$end_date') LIMIT 400000";
$don_query=mysqli_query($cxn, $don);
	if(!$don_query)
	{
	$nuts=mysqli_errno($cxn).': '.mysqli_error($cxn);
	die($nuts);
	}
$don_count=mysqli_num_rows($don_query);
	if($don_count==0)
	{
	header("Location:no_results.php");
	exit;
	}
	else
	{
	$row=mysqli_fetch_array($don_query, MYSQLI_ASSOC);
	$fp = fopen('file.csv', 'w');

		foreach ($row as $val) {
			fputcsv($fp, $val);
		}

		fclose($fp);
	}

Open in new window


I couldn't get it to work, but I would like to explore this option, only because I've got a csv file being built with this process that could be over 50MB (400,000 rows). So, I want to be wise in the way I do this.

How do I do it?

I've got some sample data attached in case that proves helpful.
verizon-data.csv
0
Comment
Question by:brucegust
  • 7
  • 5
13 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40367240
While I see the LIMIT 400000 clause, I don't know how many records are really going to appear in the results set.  If it's a small number, performance will not be an issue.  If it's a larger number, the latter method (the one that goes from database to CSV directly) is going to be the best way to get reasonable speed.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40367254
Could you please post the CREATE TABLE statement for "verizon_test" ?  Thanks.
0
 
LVL 34

Assisted Solution

by:gr8gonzo
gr8gonzo earned 150 total points
ID: 40367281
Your bigger problem with the 400,000 scenario is more likely going to lie with memory. When you query 400k records, all those records are pulled into PHP's memory space, so they're sitting there as you iterate through the record set. It might be 50 megs of raw data, but with all the memory structures, you're likely looking at a much higher amount of memory necessary JUST to hold the MySQL result in memory.

If you're trying to keep peak memory usage down, use smaller groups, like 10,000 records at a time. That way, you're only holding 10k records in memory at any given time. You may need to make more queries in order to fetch all the data, but the increase in overhead between 1 query and 40 queries is fairly minimal with the amount of data you're processing.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40367308
@gr8gonzo: Good point.  1,000 CSV records == 152 KB so it follows that 400,000 records == 59 MB.  But... (and this is a big but) PHP array and object structures can take as much as 20X the storage required by the string data.  Which would imply a memory requirement for 1.2 GB.  Don't use PDO!

@brucegust: Since it looks like there may be a geospatial component to the data set, you might want to read some background information on using geocodes in PHP and MySQL.  These articles are not directly applicable to your work, but they have some insights that may be helpful as you go forward.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_4276-What-is-near-me-Proximity-calculations-using-PHP-and-MySQL.html
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_9854-Using-GeoCodes-to-Find-an-Average-Location.html
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_3350-Using-the-Google-Maps-API-in-PHP.html
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40368358
I have a partial test case set up using the 1,000 rows in the attachment.  In order to proceed with a test, I'll need to know what values you want to use in the query.  What should these be?

latitude_1
latitude_2
longitude_1
longitude_2
start_date
end_date
0
 

Author Comment

by:brucegust
ID: 40368432
Morning, boys!

Ray, here's my CREATE TABLE statement:

CREATE TABLE `verizon_data` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `twitter_id` varchar(150) NOT NULL,
  `actor_id` varchar(150) NOT NULL,
  `actor_display_name` varchar(150) NOT NULL,
  `posted_time` varchar(150) NOT NULL,
  `display_name` varchar(150) NOT NULL,
  `geo_coords_lat` varchar(150) NOT NULL,
  `geo_coords_lon` varchar(150) NOT NULL,
  `location_name` varchar(150) NOT NULL,
  `posted_day` date NOT NULL,
  PRIMARY KEY (`id`),
  KEY `posted_day` (`posted_day`),
  KEY `geo_coords_1` (`geo_coords_lon`),
  KEY `geo_coords_0` (`geo_coords_lat`)
) ENGINE=InnoDB AUTO_INCREMENT=500001 DEFAULT CHARSET=latin1

Open in new window


As far as the insight the two of you have offered thus far, let me explain it back to you so I can be certain I'm understanding WHY things need to be done a certain way and not just THAT it has to be done that way.

However properly indexes have been applied to a table, while that may dramatically improve the speed of said query, you still have to take into consideration the amount of memory you're using as the number of rows you're iterating through increases.

With my situation, it seems like what I need to do to best serve my user is determine how many rows I'm looking at. Grab the count first and then, based on the number of rows, route the process in a way where I'm either exporting the recordset directly into a csv file, or doing a series of queries that insert records into a temporary table 10K records at a time.

If that resonates with y'all, assuming I've got a large recordset - over 300K - how do I grab those records 10K at a time without grabbing the same ones over and over again?

So, if my query is, hypothetically "select * from Verizon_test where ... LIMIT 10000" - I do that once knowing there's 300K results. How do I structure my next query so that I'm grabbing the next 10K?

What do you think?
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 

Author Comment

by:brucegust
ID: 40368452
Hey, Ray!

I just saw your post. Here's some test data:

Based on the data that I gave you,

latitude_1: 25
 latitude_2 : 36
 longitude_1: -100
 longitude_2: -86
 start_date:  2013-07-04
 end_date : 2013-08-06

Based on the data that I'm working with:

latitude_1: 26
latitude_2: 26
longitude_1: -80
longitude_2: -79
start_date: 2013-08-06
end_date: 2013-08-10

That's a sample query from one of my users.
CREATE TABLE `verizon_data` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `twitter_id` varchar(150) NOT NULL,
  `actor_id` varchar(150) NOT NULL,
  `actor_display_name` varchar(150) NOT NULL,
  `posted_time` varchar(150) NOT NULL,
  `display_name` varchar(150) NOT NULL,
  `geo_coords_lat` varchar(150) NOT NULL,
  `geo_coords_lon` varchar(150) NOT NULL,
  `location_name` varchar(150) NOT NULL,
  `posted_day` date NOT NULL,
  PRIMARY KEY (`id`),
  KEY `posted_day` (`posted_day`),
  KEY `geo_coords_1` (`geo_coords_lon`),
  KEY `geo_coords_0` (`geo_coords_lat`)
) ENGINE=InnoDB AUTO_INCREMENT=500001 DEFAULT CHARSET=latin1

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40368454
You would use a WHERE clause and an ORDER BY clause.  You might be able to ORDER BY the AUTO_INCREMENT key.  The first query might say something like WHERE id BETWEEN 1 AND 10000; the second query might say WHERE id BETWEEN 10001 AND 20000, etc.
0
 

Author Comment

by:brucegust
ID: 40368456
Don't know why that CREATE TABLE snippet showed up again...
0
 

Author Comment

by:brucegust
ID: 40368474
Ray, your last suggestion makes total sense, but assuming that the records in my recordset don't start with id#1 and proceed sequentially, then what I'm looking would look something like this, correct:

query_1 = "select * from Verizon_test where mystuff limit 10000 order by id ASC" (grab the id of the last record)

query_2 "select * from Verizon_test where mystuff and id>query_1['id'] LIMIT 10000 order by id ASC

So on and so forth, correct?
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 350 total points
ID: 40368499
Here's a code sample.  Moving parts start at line 112, the rest is just setup for the test.
http://iconoun.com/demo/temp_brucegust.php

<?php // demo_temp_brucegust.php
error_reporting(E_ALL);

// SEE http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28533249.html

// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// THE DATA SET IS UTF-8
$mysqli->set_charset("utf8mb4");
mb_internal_encoding("UTF-8");

// CREATING A TABLE FOR OUR TEST DATA
$sql
=
"
CREATE TEMPORARY TABLE `verizon`
( `id`                 INT           NOT NULL AUTO_INCREMENT PRIMARY KEY
, `twitter_id`         VARCHAR(150)  NOT NULL DEFAULT ''
, `actor_id`           VARCHAR(150)  NOT NULL DEFAULT ''
, `actor_display_name` VARCHAR(150)  NOT NULL DEFAULT ''
, `posted_time`        DATETIME      NOT NULL DEFAULT '0000-00-00 00:00:00'
, `display_name`       VARCHAR(150)  NOT NULL DEFAULT ''
, `geo_coords_lat`     DECIMAL(16,9) NOT NULL DEFAULT 0.0
, `geo_coords_lon`     DECIMAL(16,9) NOT NULL DEFAULT 0.0
, `location_name`      VARCHAR(150)  NOT NULL DEFAULT ''
, `posted_day`         DATE          NOT NULL DEFAULT '0000-00-00'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
"
;
// IF mysqli::query() RETURNS FALSE, LOG AND SHOW THE ERROR
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// MAP OF COLLUMN NAMES // AND SAMPLE VALUES
$map = array
( 'id'                  //  "2"                      <-- AUTO_INCREMENT KEY
, 'twitter_id'          //  "364536294640451585"
, 'actor_id'            //  "322321678"
, 'actor_display_name'  //  "Mutt"
, 'posted_time'         //  "2013-08-06 00:00:00"
, 'display_name'        //  "Twitter for Android"
, 'geo_coords_lat'      //  "32.867479200"
, 'geo_coords_lon'      //  "-96.641121600"
, 'location_name'       //  "Garland"
, 'posted_day'          //  "2013-08-06"
)
;

// READ THE TEST DATA
$url = 'http://filedb.experts-exchange.com/incoming/2014/10_w41/876263/verizon-data.csv';
$fpr = fopen($url, 'rb');
if (!$fpr) trigger_error("Unable to open $url", E_USER_ERROR);

// LOAD THE DB TABLE
while ($csv = fgetcsv($fpr))
{
    $csv = array_combine($map, $csv);
    unset($csv['id']);
    foreach ($csv as $key => $val)
    {
        $csv[$key] = $mysqli->real_escape_string($val);
    }
    $cols = implode(',', array_keys($csv));
    $data = "'" . implode("','", $csv) . "'";
    $sql = "INSERT INTO verizon ($cols) VALUES ($data)";
    if (!$res = $mysqli->query($sql))
    {
        $err
        = 'QUERY FAILURE:'
        . ' ERRNO: '
        . $mysqli->errno
        . ' ERROR: '
        . $mysqli->error
        . ' QUERY: '
        . $sql
        ;
        trigger_error($err, E_USER_ERROR);
    }
}


// ASSIGN VARIABLES FOR OUR QUERY
$latitude_1 = 25;
$latitude_2 = 36;
$longitude_1 = -100;
$longitude_2 = -86;
$start_date  = '2013-07-04';
$end_date    = '2013-08-06';

// CREATE A SELECT QUERY
$sql
=
"
SELECT
  actor_id
, actor_display_name
, posted_time
, display_name
, geo_coords_lat
, geo_coords_lon
, location_name
, posted_day
FROM verizon
WHERE
( geo_coords_lat >= '$latitude_1' AND geo_coords_lat <= '$latitude_2')
AND
( geo_coords_lon >= '$longitude_1' AND geo_coords_lon <= '$longitude_2')
AND
( posted_day >= '$start_date' AND posted_day <= '$end_date')
LIMIT 400000
"
;

// RUN THE QUERY OR FAIL ON ERROR
$res = $mysqli->query($sql);
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// HOW MANY ROWS WERE RETURNED?
$cnt = $res->num_rows;
if($cnt == 0)
{
    echo PHP_EOL . "NO ROWS MATCHED QUERY: $sql";
}
else
{
    // WRITE A CSV FILE HERE
    echo PHP_EOL . "$cnt ROWS MATCHED QUERY: $sql<br>";
    $out = 'storage/brucegust.csv';
    $fpw = fopen($out, 'wb');
    if (!$fpw) trigger_error("Unable to Open $out", E_USER_ERROR);
    while ($row = $res->fetch_assoc())
    {
        fputcsv($fpw, $row);
    }
    fclose($fpw);

    echo PHP_EOL . 'CSV: <a target="_blank" href="' . $out . '">' . $out . '</a>';
}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40368505
0
 

Author Comment

by:brucegust
ID: 40373520
OK...here we go!
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

This article discusses four methods for overlaying images in a container on a web page
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now