Why does this query take so long?

Bruce Gust
Bruce Gust used Ask the Experts™
on
Here's my query:

EXPLAIN SELECT id, actor_id, actor_display_name, posted_time, display_name, geo_coords_lat, geo_coords_lon, location_name, posted_day FROM verizon WHERE (posted_day BETWEEN '2014-03-05' and '2014-03-18') and (geo_coords_lat BETWEEN '26' and '26.25') and (geo_coords_lon BETWEEN '-80.25' and '-80') order by id ASC LIMIT 10000

When I run the "explain" function, I get this:

i      select     table         type        possible_keys                                                            key                  k_len           ref           rows      
1    SIMPLE   Verizon    index       posted_day, geo_coords_lat, geo_coords_lon     PRIMARY        4                  NULL     223953

The last column was "Extra" which read "Using where."

At first glance, I'm stoked because it appears as though my indexes are doing exactly as they're supposed to do in that they're taking the 250,000,000 rows and reducing it to a very manageable collection of rows.

But the process, which I have below, is taking anywhere from 20-25 minutes, which makes no sense in that 223953 rows should sing.

What am I doing that's clogging the pipes. Theoretically, everything looks great. Practically, we're needing some major improvement.

Thoughts?

$crystal="SELECT id, actor_id, actor_display_name, posted_time, display_name, geo_coords_lat, geo_coords_lon, location_name, posted_day FROM verizon WHERE (posted_day BETWEEN '$start_date' and '$end_date') and (geo_coords_lat BETWEEN '$latitude_1' and '$latitude_2') and (geo_coords_lon BETWEEN '$longitude_1' and '$longitude_2') order by id ASC LIMIT 10000";
$crystal_query=mysqli_query($cxn, $crystal)
or die("Crystal didn't happen.");
	while($crystal_row=mysqli_fetch_assoc($crystal_query))
	{
	extract($crystal_row);
	$verizon_id=mysqli_real_escape_string($cxn, $crystal_row['id']);
	$the_actor_id= mysqli_real_escape_string($cxn,$crystal_row['actor_id']);
	$the_actor_display_name= mysqli_real_escape_string($cxn,$crystal_row['actor_display_name']);
	$the_posted_time= mysqli_real_escape_string($cxn,$crystal_row['posted_time']);
	$the_geo_coords_lat= mysqli_real_escape_string($cxn,$crystal_row['geo_coords_lat']);
	$the_geo_coords_lon= mysqli_real_escape_string($cxn,$crystal_row['geo_coords_lon']);
	$the_location_name= mysqli_real_escape_string($cxn,$crystal_row['location_name']);
	$the_posted_day=$crystal_row['posted_day'];
	$insert = "insert into twitter_csv (verizon_id, actor_id, actor_display_name, posted_time, geo_coords_lat, geo_coords_lon, location_name, posted_day) 
	values ('$verizon_id', '$the_actor_id', '$the_actor_display_name', '$the_posted_time', '$the_geo_coords_lat', '$the_geo_coords_lon', '$the_location_name', '$the_posted_day')";
		$insertexe = mysqli_query($cxn, $insert);
		if(!$insertexe) {
		$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
		die($error);
		}
	}

Open in new window


PS: Don't be distracted by the LIMIT 10000. I did that thinking that by breaking things down into bite sized chunks, I was streamlining the process. Maybe, maybe not. But the problem is in the amount of time the initial query is taking. Once I saw the EXPLAIN, I was certain that I'm missing something.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
I would make sure that you have properly indexed those tables for optimal performance in your database. Indexing the tables properly will severely reduce the time :)
Fixer of Problems
Most Valuable Expert 2014
Commented:
The key to the slowness is the use of BETWEEN in 3 different WHERE clauses.  I believe that MySQL has to create 3 different sorted indexes and cross reference them to find the ones where there is a match for all three clauses.  I don't think that creating indexes will help much either because the BETWEEN clauses force MySQL to go thru the whole table each time.  Try a limit of 10 and I think you will see very little change in the amount of time that it takes.
David -
I only saw 2 BETWEENS the first time i looked :(  
good eye

*Bows*
Rowby Goren Makes an Impact on Screen and Online

Learn about longtime user Rowby Goren and his great contributions to the site. We explore his method for posing questions that are likely to yield a solution, and take a look at how his career transformed from a Hollywood writer to a website entrepreneur.

Bruce GustPHP Developer

Author

Commented:
Dave, I've been reading while I've been waiting for some feedback and your counsel resonates with what I've discovered thus far.

I know what you're saying is correct only because as I've played with the database directly, I can see how things absolutely fly when I'm doing a specific equality as opposed to a "between."

Can you think of a creative way in which I can break things up so I can serve my user (who's going to be using a range of geo_coords as well as dates) so I can get them their answer without having to clog the pipes?
Dave BaldwinFixer of Problems
Most Valuable Expert 2014

Commented:
Nope.  You have created a pipe-clogging scenario.  What you are currently trying to do will never be quick.  Too much data combined with a slow method.

Commented:
Looks like it should be a bit faster in MySQL 5.6 or newer: https://dev.mysql.com/doc/refman/5.6/en/index-condition-pushdown-optimization.html

WIth the new "Index Condition Pushdown Optimization" it should limit the full table scans.

HTH,
Dan
Most Valuable Expert 2011
Top Expert 2016

Commented:
You might want to do this in two queries.  Whether this is a good idea or not may depend on the number of rows you expect to get in the results set.  A sensible design might go something like this (pidgin code):

CREATE TEMPORARY TABLE x
( SELECT * FROM verizon WHERE WHERE posted_day BETWEEN '$start_date' and '$end_date' )
ENGINE=MEMORY

Now you would have a smaller table.  Not sure how much smaller, but...

See the proximity calculator in this article for a way to down-select into a temporary table.
http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/A_4276-What-is-near-me-Proximity-calculations-using-PHP-and-MySQL.html
Bruce GustPHP Developer

Author

Commented:
Ray, I was thinking about doing that and I was playing with the idea in phpMyAdmin and I got an error that indicated my innodb_buffer_pool_size was too small.

Does creating a temp table via php eliminate that problem?
Most Valuable Expert 2011
Top Expert 2016

Commented:
I don't know; it's a data-dependent problem and only you have the data.  It's something to test, but like I wrote in earlier questions, you're working with data at a very large scale.  You might want to break this one giant table up into tables by day, so you would be working with 365 tables, each of a more manageable size.  I'm guessing you haven't tried that yet?

You might also want to consider making up a test data set and posting it for us here.  I would recommend selecting every 1,000th row out of your big table.  That would create a test data set that contained about 250,000 rows, presumably with a more-or-less representative and well-distributed subset of the big collection.  Once you have the data uploaded, you can make reference to the uploaded file in this and future questions.  You can use the "Attach File" link below the comment box.

If I have that small test data set, I can show you tested examples of the logic for things like a down-select into a memory table or a design that uses tables per day or per month, etc.
Expert of the Year 2014
Top Expert 2014

Commented:
Are the lat and lng indexed? That could certainly slow you down. Which is why you should be using a geo spatial index.
Brian TaoSenior Business Solutions Consultant
Top Expert 2014
Commented:
I think the bottleneck would be in the while loop with the insert statement.  You were trying to insert 223953 rows one by one, meaning that the DB server has to process your insert as many times.
Have you tried commenting out the insert part and see how long it takes?
Most Valuable Expert 2011
Top Expert 2016
Commented:
Agree with taoyipai: There are just too many moving parts to this application, compounded by millions of lines of data that seems to get copied over and over.  Check this idea and see if it can help you consolidate some things:
http://dev.mysql.com/doc/refman/5.0/en/insert-select.html

The query string would look something like this (untested, awaiting test data) -- not sure about display_name column.
$insert 
= 
"
INSERT INTO twitter_csv 
( verizon_id
, actor_id
, actor_display_name
, posted_time
, display_name
, geo_coords_lat
, geo_coords_lon
, location_name
, posted_day
) 
SELECT
  id
, actor_id
, actor_display_name
, posted_time
, display_name
, geo_coords_lat
, geo_coords_lon
, location_name
, posted_day 
FROM verizon 
WHERE (posted_day     BETWEEN '$start_date'  AND '$end_date') 
AND   (geo_coords_lat BETWEEN '$latitude_1'  AND '$latitude_2') 
AND   (geo_coords_lon BETWEEN '$longitude_1' AND '$longitude_2') 
ORDER BY id ASC 
LIMIT 10000
"
;

Open in new window


I'd also hope you come to understand the danger in this line of code.  Don't ever write the extract() function again or for that matter, compact().  These functions blur the line between code and data in ways that can cause your scripts to fail without warning when a variable name collision occurs, thus they constitute a code smell.  You do not want that on your resume!  The function is uncalled for in this context and it causes a proliferation of variables in your symbol table.  More variables means more potential failure points, so it's best to just leave it out.
// extract($crystal_row); OMIT THIS

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial