Solved

A while within a while

Posted on 2014-10-10
11
85 Views
Last Modified: 2014-10-14
I've done a select statement and determined the number of rows

$count = 250,000

Now, I want to loop through that recordset and insert those rows 10,000 at a time. So, I'm thinking:

while ($myrow=mysqli_fetch_assoc($myrecordset))
{
extract($myrow)

$myinsert = "insert into mytable (my values) values (my stuff)"...

but see, here's where I get "confused..."

I've got to insert 10,000 at a time and when I go back to repeat that insert statement for the next 10K, I've got to pick up where I left off.

It's a while within a while, a riddle within a riddle.

Looking for a mind greater than my own to show me how to pull this off.

Happy Friday!
0
Comment
Question by:brucegust
  • 4
  • 2
  • 2
  • +2
11 Comments
 
LVL 51

Expert Comment

by:Julian Hansen
ID: 40372821
The first question is is it not possible to do this with a single SQL query

INSERT INTO sometable (field1, field2 ...) SELECT field1, field2, .... FROM someothertable

Open in new window


Failing that - there is no issue with looping a while within a while but the question is why do you need a while within a while

Can you explain a bit more about the data you are working with and why you need to do it this way?

Usually, if you can't solve this with a query, you would get one record and then either insert that record somewhere else or perform a loop based on something in the returned result to do the second insert.

Where exactly are you getting confused?
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
ID: 40372839
What's the RDBMS that you are working on?
0
 
LVL 34

Assisted Solution

by:Beverley Portlock
Beverley Portlock earned 333 total points
ID: 40372993
"I've got to insert 10,000 at a time and when I go back to repeat that insert statement for the next 10K, I've got to pick up where I left off."

There are a number of options depending on the data involved.

1. The work file: Create a workfile which references a unique key for each record to be transferred. Each time you go round your loop, read the next record to process, copy it from the original to the new file, then delete it from the workfile. This way only records which remain in the workfile can be transferred no matter how many times you run it

2. The LEFT JOIN: Do a LEFT JOIN on a key between the master file and the new file. Select all records (or 10,000 using a LIMIT clause) where the fields in the new file are NULL. These will be records that exist in the original but not in the copy. You can then use this list to select which records go into the new file.

3. The status field: If your original file has a spare field then every time you process a record and copy it to the new file, update the spare file with a status that indicates it has been transferred. That way when deciding what to transfer you can select all records whose status indicates that the have not yet been processed.

4. The destructive method: My least favourite as it destroys the original. Everytime you transfer a record to the new file. Delete it from the old one.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40373347
You want to insert those rows into what, exactly?  It seems like this process is creating a proliferation of data, but it may not be adding a lot of organization.  You have some different ways to approach this.  You might, for example, SELECT with a LIMIT and ORDER clause.  You could SELECT 10,000 rows at a time, and INSERT them as they are SELECTed.  Or you could use some kind of counter in a loop.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 167 total points
ID: 40373412
Now that I've put a bit of code in front of me, I want to revise my previous comment.  Once you've done a SELECT and created a 250,000 record results set, you've got all 250,000 rows in one place.  There is no logical way to work with only 10,000 of those rows at a time -- they're all there in the results set.  Your SELECT query is the place to choose the 10,000 rows you want.

Taken together, this and your other recent questions leave me scratching my head.  We have lots of tiny pieces of a puzzle, but we don't have the big picture.  And if you're not willing to divulge the big picture for business reasons, I certainly understand.  But I'm left wondering about a lot of questions that have never come up yet, and probably should come up now, or as soon as possible, if this project is to have any chance of success.  

We started with the question of a 250,000,000 row table.  We got some data from Twitter and loaded this table.  It gave us a static snapshot of a dynamic system (Twitter does not stand still).  Why did we do that?    What is the cycle for refreshing this data, or is the static snapshot always going to be the same?

What parts of the data are important for your project?  In asking this question I'm trying to lead you to tell us about the difference between propellant and payload.  Payload is the stuff you want to SELECT that does not need to be in WHERE, GROUP, JOIN, HAVING or ORDER clauses.  Propellant is the stuff that powers these selective clauses.  Why these columns and not some other columns?  What parts of the data can be used to shrink the mega-table into something more manageable?  For example, could you partition the table into smaller tables, dividing the rows on geocode boundaries?  Or dividing the rows by dates?

I'm concerned that if we don't get a better understanding of these ideas we will be advising you about things that may lead you to build an unworkable system.  Just like the situation we hit when we loaded a 250,000,000 row table, then tried to add an index - data collections of this size are not the stuff of everyday life in the PHP world.  Even the test processes take a long time when you're dealing with this much data, and I don't want you to waste your time testing the wrong approach to the issues.  That's why I recommended that you consult an information architect and database administrator.  You have an "edge case" in the world of PHP applications, and only a specialist is likely to have a depth of experience with tables this size.

Don't get me wrong, I still think you can get good answers here, but only if we have the background information to understand the big picture.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:brucegust
ID: 40373740
OK, Ray, here's the "big picture:"

The Twitter data that I have represents a snapshot of a year's activity in a given area. The idea is to take that info, parse out and store it in a database that can then be queried according to geographical location and time of year.

The nature of the information is static. It's not something that's going to be updated as much as it's going to be analyzed.

My challenge was to first create code that would automate the process of decompressing the files and parsing them into a database. Thanks to ninjas like yourself, that was done over the course of a weekend.

The next challenge was to institute indexes that would allow the aforementioned queries to be executed within a reasonable timeframe. Come to find out, that adding an index was far more difficult than expected because of the way the file was being scanned into a tmp directory that apparently wasn't large enough. As I researched how to change the size of said directory, I resolved to consider other options, including:

- making a duplicate of the table, adding the indexes, and then moving the data over incrementally
- trying to streamline the queries themselves by doing things in batches of 10K as opposed to a situation where I was confronted with a  recordset of hundreds of thousands of rows and trying to process that all at once

I've been running a script for the last several days that is moving files incrementally but it's very slow and I want to put this project to bed. Hence my questions pertaining to changing the my.cnf and other such things.

Does that help?
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
ID: 40373996
Did you think in Replication?
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40374053
OK, here's an idea.  You've got 365 days of data - consider loading it into 365 different tables.  I'm sure it's not an even distribution, but if it were, you would have about 700,000 rows per table - a manageable number.  Then when an inquiry comes in for data from January 1-14, you would query 14 tables.  If the query were for September 15, you would only have to query one table.

Working with a few hundred thousand rows, or even a couple of million rows is likely to be more successful than hundreds of millions of rows.  Does that make sense, given what you know about the application?
0
 
LVL 34

Assisted Solution

by:Beverley Portlock
Beverley Portlock earned 333 total points
ID: 40374131
Looking back over related questions I came across some code that let me infer the table structure as

actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day

Open in new window


and I presume that the copies being made are being done into a similar table. It struck me that this table was highly un-normalised and some basic normalisation might reduce the overall data space requirements and allow speedier access by using integers rather than strings. For instance, splitting the data like so

actor_id, actor_display_name

location_id, location_name

display_id, display_name

id, actor_id, location_id, display_id, posted_time, geo_coords_0, geo_coords_1, posted_day

Open in new window


Would eliminate a lot of repeated data for "actor names" and "location names", etc and replace these strings with a simple integer. Integer comparison is quicler than string comparison and over a large data set the small difference will build up.

Carrying this logic further, a date integer would occupy slightly less space than a date string, but even so

actor_id, actor_display_name

location_id, location_name

display_id, display_name

day_id, posted_day

id, actor_id, location_id, display_id, day_id, posted_time, geo_coords_0, geo_coords_1

Open in new window


Thus we have shortened the main table considerably and produced much smaller index tables which could be used to speed up searching and updates in the main table.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40374252
@bportlock: That is good, solid computer science.  It may be the right approach, and I would be interested in benchmarking it against the current, unnormalized table.  In my experience at scale, normalization is not always helpful to performance.  Usually, just "throwing hardware at the problem" is where these sorts of large scale applications sometimes get their best results.  

It's probably worth counting the number of rows that would appear in each of those tables.  We can know some of that already - for example, there will be 365 rows in the posted_day table, and there will still be 250,000,000 in the main table.  IIRC the display_name table will be small.
0
 

Author Comment

by:brucegust
ID: 40380604
Hey, guys!

I got it to work, although we're still not out of the woods yet in that the query is taking far too long and I've got a question sitting out at http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28537414.html

As far as the "while within a while," here's what I came up with:

if(!isset($_GET['the_id']))
{
$the_id=0;

$bruce = "select count(id) as record_count from verizon_stuntdouble where posted_day BETWEEN '2013-08-01' and '2013-08-02'";
$bruce_query=mysqli_query($cxn, $bruce)
or die("Couldn't make Bruce happen.");
$bruce_row=mysqli_fetch_assoc($bruce_query);
extract($bruce_row);

$bruce_count=$record_count/10000;
$the_count=round($bruce_count);
//if you've got more than 400000 rows, you'll probably need to do a redirect and let the user know that they're going to wind up with a csv file that isn't practical

$michelle = "delete from twitter_csv";
$michelle_query=mysqli_query($cxn, $michelle);$crystal="SELECT id, actor_id, actor_display_name, posted_time, display_name, geo_coords_lat, geo_coords_lon, location_name, posted_day  FROM verizon_stuntdouble where posted_day BETWEEN '2013-08-01' and '2013-08-02' order by id ASC LIMIT 10000";
}
else
{
$the_count=$_GET['new_count'];
$the_id=$_GET['the_id'];
$crystal="SELECT id, actor_id, actor_display_name, posted_time, display_name, geo_coords_lat, geo_coords_lon, location_name, posted_day  FROM verizon_stuntdouble where posted_day BETWEEN '2013-08-01' and '2013-08-02' and id>$the_id order by id ASC LIMIT 10000";
}

$crystal_query=mysqli_query($cxn, $crystal)
or die("Crystal didn't happen.");
$crystal_count=mysqli_num_rows($crystal_query);
if($crystal_count>0)
{
	while($crystal_row=mysqli_fetch_assoc($crystal_query))
	{
	extract($crystal_row);
	$verizon_id = mysqli_real_escape_string($cxn, $crystal_row['id']);
	$the_actor_id= mysqli_real_escape_string($cxn,$crystal_row['actor_id']);
	$the_actor_display_name= mysqli_real_escape_string($cxn,$crystal_row['actor_display_name']);
	$the_posted_time= mysqli_real_escape_string($cxn,$crystal_row['posted_time']);
	$the_geo_coords_lat= mysqli_real_escape_string($cxn,$crystal_row['geo_coords_lat']);
	$the_geo_coords_lon= mysqli_real_escape_string($cxn,$crystal_row['geo_coords_lon']);
	$the_location_name= mysqli_real_escape_string($cxn,$crystal_row['location_name']);
	$the_posted_day=$crystal_row['posted_day'];
	$insert = "insert into twitter_csv (verizon_id, actor_id, actor_display_name, posted_time, geo_coords_lat, geo_coords_lon, location_name, posted_day) 
	values ('$verizon_id', '$the_actor_id', '$the_actor_display_name', '$the_posted_time', '$the_geo_coords_lat', '$the_geo_coords_lon', '$the_location_name', '$the_posted_day')";
		$insertexe = mysqli_query($cxn, $insert);
		if(!$insertexe) {
		$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
		die($error);
		}
	}
	//grab from the twitter_csv table the id of the last record added
	$angel = "select verizon_id from twitter_csv order by verizon_id DESC LIMIT 1";
	$angel_query=mysqli_query($cxn, $angel)
	or die("Angel didn't happen.");
	$angel_row=mysqli_fetch_assoc($angel_query);
	extract($angel_row);
	$brand_new_id=$angel_row['verizon_id']; 
	$new_count=$the_count-1;
	header("Location: breather_experiment.php?the_id=$brand_new_id&new_count=$new_count");
	exit();
}
else
{
//the csv file is ready
echo "now you can download your csv file";
}

Open in new window


The "breather.php" looked like this:

<?php
$the_id=$_GET['the_id'];
$new_count=$_GET['new_count'];
?>
<!DOCTYPE html>
<html lang="en">
<head>
<title>JSON / CSV Conversion</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<head>
<script type="text/javascript">
<!--
function Redirect()
{
    window.location="experiment.php?the_id=<?php echo $the_id; ?>&new_count=<?php echo $new_count; ?>";
}

//document.write("You just finished processing a file. You are getting ready to jump back in again!");
setTimeout('Redirect()', 5000);
//-->
</script>

Open in new window


So the user's watching things "cook" with a little text saying where we were at in the overall process.

Thanks for weighing in!
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
These days socially coordinated efforts have turned into a critical requirement for enterprises.
The viewer will learn how to count occurrences of each item in an array.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now