Solved

The correct use of the sleep function in php

Posted on 2014-09-30
14
176 Views
Last Modified: 2014-10-01
I've written a script that successfully decompresses a JSON file and then parses it. I have 365 files, each one ranging from 15 - 30 MB.

It takes a little over 10 minutes per file and I want to be able to start it up, leave for the nite and come back with all the files that I've loaded into the directory sitting as apples of gold on trays of silver - everything decompressed and neatly parsed

Voila!

I want to be wise / strategic in the way I craft my script so I give it a chance to "breathe" in between files so it doesn't time out.

I've got my max_execution_time set to 14400, but if I've got my for loop set up like this, can I incorporate the sleep dynamic so I give my code a chance to catch its breath in between files? And when I do that, is the page still operating in the context of the four hour window I've set, or does it going to "sleep" result in a situation where when it "wakes up," it's functioning as though it were just getting started?

Here's what I'm thinking:

foreach($arr as $file)
{
//decompressing code
//parsing code
sleep(10);
}

What do you think?
0
Comment
Question by:brucegust
  • 7
  • 5
  • 2
14 Comments
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 250 total points
Comment Utility
This won't work.  10 minutes x 365 files = 61 hours.

Instead, set it up so that a script handles one file.  Use the database to keep a record of the files that have been processed.  As each file is processed, insert the file name into the data base table.  Let the script find all of the 365 files and run through the table to choose the first file that has not been processed yet.  Process that file, then make a POST-method request to restart the script.  Maybe sleep(1) before each restart.  Don't know what else has to run simultaneously on the server, but you might want to think about that.

There are many things that can go wrong overnight, so make sure that the script is restartable at any time.
0
 
LVL 34

Assisted Solution

by:gr8gonzo
gr8gonzo earned 250 total points
Comment Utility
1. Do you -regularly- have 365 files that you have to parse every night? In your previous question, your filenames had the word "day" in them and now you have 365 files... it sounds like you're dealing with daily data across the span of a year. I could understand if you're processing 365 upfront as an initial load, but 365 every night seems a little odd.

2. I would agree with Ray that some kind of management process might be the right play here. It sounded like your previous question was doing something like that (inserting records of what you're processing into the database), so you might be able to build off of that. Just add a status field (e.g. a tinyint field where 0 = ready for processing, 1 = processing, 2 = successful, 3 = failed). Then have your script pull the first X records where status = 0, and process those records.

I would typically have the script process a few records each time, and then run a few instances of that script simultaneously (offset their start times a little bit). That way, if each script processes 5 records at a time (~50 minutes runtime), you can run four instances of the same script at the same time and be processing about 20 records every 50 minutes. I'd also use a cron job to schedule the jobs so you can use midnight and lighter-traffic times to run more instances of the script. Don't worry about the script restarting the next loop - let cron take care of that part.

Also, make sure you log everything (use a filename that has the PID in it so you don't have multiple script instances writing to the same log file at the same time) so that if something fails, you know what to do next.

3. It seems a little strange for it to take 10 minutes to parse a 15-30 meg file. Unless you have a really slow server or a really lengthy and complex parsing routine. I write data import and parsing scripts ALL the time (JSON, CSV, XML, etc) that dealt with enterprise data (hundreds of megs of stuff), so that speed just seems a little off to me. You might greatly benefit from asking experts to review the parsing part of the code.

Also, it might be easier to simply gunzip your files in a separate process before PHP gets to them. Running the normal gunzip binary on those files will be faster and more efficient than using PHP's zlib extension to do the same job. Worst case, just use PHP's shell_exec to run the binary if you can't gunzip them before PHP runs.
0
 

Author Comment

by:brucegust
Comment Utility
Gonz...

Thanks for taking the time to weigh in. Let me respond by giving you the "back-story" so you can see where all this is going.

The 365 files is a one time "data dump." The files need to be decompressed, parsed and then I'm storing all the info in a single table that's been indexed so as to facilitate efficient queries.

The end user will be entering a date as well as some latitude and longitude values resulting in a recordset that they can then export as a csv file.

Everything that you've seen me struggle with these last few days has as its target that user interface. I rarely work with this much data, so I'm a sponge, trying to soak up all the info I can in order to put together a process that takes all this info and stores it in a way that can be used.

Ray's suggestion resonates as a solid solution and I'm looking forward to popping the hood on that approach and making it work.

First off, however, here's my script, top to bottom. It works, but should you see any room for improvement, I'm all ears. Also, while I understand the logic of Ray's suggestion, I'm google-ing, even as we speak, looking for a tutorial that will walk me through that process.

Bring it!

<?php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    //echo PHP_EOL . "PROCESSING $file";

    $daniel = "SELECT file_name FROM raw_files WHERE file_name='$file'";
    $daniel_query=mysqli_query($cxn, $daniel);
    if(!$daniel_query)
    {
        var_dump($daniel);
        trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
    }
    $daniel_count=mysqli_num_rows($daniel_query);
	//good to go up to here
   // echo PHP_EOL . "FOUND $daniel_count DATABASE ROWS FOR $file";

    if ($daniel_count==0)
    {
        $now = date('c');
        $nelson="INSERT INTO raw_files (file_name, start_time) VALUES ('$file', '$now')";
        $nelson_query=mysqli_query($cxn, $nelson);
        if(!$nelson_query)
        {
            var_dump($nelson);
            trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
        }
        $new_id = $cxn->insert_id;
        //echo PHP_EOL . "INSERTED ID=$new_id INTO raw_files TABLE";
     
    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen('JSON/' . $file, 'rb');
    if (!$inp_handle) trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);

    $out_handle = fopen('JSON/' . $out_file_name, 'wb');
    if (!$out_handle) trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);

		while(!gzeof($inp_handle))
		{
			$data = gzread($inp_handle, $buffer_size);
			fwrite($out_handle, $data);
		}
		fclose($out_handle);
		gzclose($inp_handle);

		//at this point, you've decompressed your file, now you do your parsing
		$the_new_file=str_replace('.gz',"",$file);
		$chunk_size=4096;
		$url="JSON/";
		$url .=$the_new_file;
		$handle=@fopen($url,'r');
			if(!$handle) 
			{
				echo "failed to open JSON file";
			}
		while (!feof($handle)) 
		{
		$buffer = fgets($handle, $chunk_size);
			if(trim($buffer)!=='')
			{
			$obj=json_decode(($buffer), true);
			
			include('clean_up.php');	
			
			$insert = "insert into verizon (actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day) 
			values ('$actor_id', '$actor_display_name', '$posted_time', '$display_name', '$geo_coords_0', '$geo_coords_1', '$location_name', '$posted_day')";
				$insertexe = mysqli_query($cxn, $insert);
				if(!$insertexe) {
				$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
				die($error);
				}
				//echo $row_count.' | '. $obj['actor']['id'].' | '.$obj['actor']['displayName'].' | '.$obj['postedTime'].' | '.$obj['generator']['displayName'].' | '.$obj['geo']['coordinates']['0'].' | '.$obj['geo']['coordinates']['1'].' | '.$obj['location']['name'].' '.$trigger.'<br>';
			}
		}
		fclose($handle);
		//you're done parsing and decompressing. Now we update the raw_files table with the time we completed the processing
		$now = date('c');
		$brice="UPDATE raw_files SET end_time = '$now' WHERE id=$new_id LIMIT 1";
		$brice_query=mysqli_query($cxn, $brice);
			if(!$brice_query)
			{
				var_dump($brice);
				trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
			}
	}
	else
    {
	//you've already processed this file
    //trigger_error("NO DATA INSERTED FOR $file", E_USER_ERROR);
	continue;
    }
}

echo "done!";
?>

Open in new window

0
 

Author Comment

by:brucegust
Comment Utility
Ray, I'm looking and I'm not finding anything that breaks your suggestion down into academically bite sized pieces for this hardcharger to grasp (https://www.google.com/search?q=php+post+method+request).

It seems like, the code that I have currently falls in line with what you're suggesting right up to line 27:

<?php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    //echo PHP_EOL . "PROCESSING $file";

    $daniel = "SELECT file_name FROM raw_files WHERE file_name='$file'";
    $daniel_query=mysqli_query($cxn, $daniel);
    if(!$daniel_query)
    {
        var_dump($daniel);
        trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
    }
    $daniel_count=mysqli_num_rows($daniel_query);
	//good to go up to here
   // echo PHP_EOL . "FOUND $daniel_count DATABASE ROWS FOR $file";

    if ($daniel_count==0)
    {

Open in new window


Yes?

I'm going through the files as they exist in the directory, I'm looking in my "raw_files" table to see if that particular file has been processed and at line 27 I'm doing my parsing and decompressing.

What I hear you saying is that once that file has been processed, instead of continuing with the for loop, I'm going to...

do a redirect to a page where I just initiate the whole thing all over again
do something along the lines of a POST-method request that involves things I don't pretend to understand at this point

Poised on the threshold of greatness. What is that POST-method request and how do I implement it here?
<?php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    //echo PHP_EOL . "PROCESSING $file";

    $daniel = "SELECT file_name FROM raw_files WHERE file_name='$file'";
    $daniel_query=mysqli_query($cxn, $daniel);
    if(!$daniel_query)
    {
        var_dump($daniel);
        trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
    }
    $daniel_count=mysqli_num_rows($daniel_query);
	//good to go up to here
   // echo PHP_EOL . "FOUND $daniel_count DATABASE ROWS FOR $file";

    if ($daniel_count==0)
    {
        $now = date('c');
        $nelson="INSERT INTO raw_files (file_name, start_time) VALUES ('$file', '$now')";
        $nelson_query=mysqli_query($cxn, $nelson);
        if(!$nelson_query)
        {
            var_dump($nelson);
            trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
        }
        $new_id = $cxn->insert_id;
        //echo PHP_EOL . "INSERTED ID=$new_id INTO raw_files TABLE";
     
    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen('JSON/' . $file, 'rb');
    if (!$inp_handle) trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);

    $out_handle = fopen('JSON/' . $out_file_name, 'wb');
    if (!$out_handle) trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);

		while(!gzeof($inp_handle))
		{
			$data = gzread($inp_handle, $buffer_size);
			fwrite($out_handle, $data);
		}
		fclose($out_handle);
		gzclose($inp_handle);

		//at this point, you've decompressed your file, now you do your parsing
		$the_new_file=str_replace('.gz',"",$file);
		$chunk_size=4096;
		$url="JSON/";
		$url .=$the_new_file;
		$handle=@fopen($url,'r');
			if(!$handle) 
			{
				echo "failed to open JSON file";
			}
		while (!feof($handle)) 
		{
		$buffer = fgets($handle, $chunk_size);
			if(trim($buffer)!=='')
			{
			$obj=json_decode(($buffer), true);
			
			include('clean_up.php');	
			
			$insert = "insert into verizon (actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day) 
			values ('$actor_id', '$actor_display_name', '$posted_time', '$display_name', '$geo_coords_0', '$geo_coords_1', '$location_name', '$posted_day')";
				$insertexe = mysqli_query($cxn, $insert);
				if(!$insertexe) {
				$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
				die($error);
				}
				//echo $row_count.' | '. $obj['actor']['id'].' | '.$obj['actor']['displayName'].' | '.$obj['postedTime'].' | '.$obj['generator']['displayName'].' | '.$obj['geo']['coordinates']['0'].' | '.$obj['geo']['coordinates']['1'].' | '.$obj['location']['name'].' '.$trigger.'<br>';
			}
		}
		fclose($handle);
		//you're done parsing and decompressing. Now we update the raw_files table with the time we completed the processing
		$now = date('c');
		$brice="UPDATE raw_files SET end_time = '$now' WHERE id=$new_id LIMIT 1";
		$brice_query=mysqli_query($cxn, $brice);
			if(!$brice_query)
			{
				var_dump($brice);
				trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
			}
	}
	else
    {
	//you've already processed this file
    //trigger_error("NO DATA INSERTED FOR $file", E_USER_ERROR);
	continue;
    }
}

echo "done!";
?>

Open in new window

0
 
LVL 34

Expert Comment

by:gr8gonzo
Comment Utility
Just to clarify, I was agreeing with Ray's general suggestion about using a management kind of process. I was simply expanding on it with some additional suggestions.

Basically, Ray was suggesting a serial/sequential, 1-by-1 loop through all the files, processing one at a time (whichever one has not been processed yet), and restarting the script at the end. There's nothing wrong about it, but you can typically be more efficient than that (by processing more than one file at a time). A serial loop (one after another) is fine when you're looking at small quantities of things or when one item depends on another completing first, but in data import / processing cases, you're often better off doing some parallel processing. Otherwise, you -will- be forced into a 61-hour loop at minimum, when you could be doing all files in a fraction of that time.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
@brucegust:  Please post some test data.  I'll show you what I'm talking about with a code example.  It does not have to be serial, one-at-a-time for a 61-hour process, but writing an explanation is going to take longer than just showing you the code.  FWIW this is a fairly advanced topic in application design, and also very useful.
0
 

Author Comment

by:brucegust
Comment Utility
Morning, guys!

Ray, here's some sample data:

{"id":"tag:search.twitter.com,2005:389903668427763712","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:91239297","link":"http://www.twitter.com/OGkush103","displayName":"WalkingLick74","postedTime":"2009-11-20T01:21:39.000Z","image":"https://si0.twimg.com/profile_images/378800000593715086/755411d8bdc495472c2d7ed50e319582_normal.jpeg","summary":"Self-Made, Self Paid..... I always had the mind to get it like a man, head first bout my younging Ean! #YOLO","links":[{"href":null,"rel":"me"}],"friendsCount":468,"followersCount":677,"listedCount":0,"statusesCount":25504,"twitterTimeZone":"Alaska","verified":false,"utcOffset":"-28800","preferredUsername":"OGkush103","languages":["en"],"location":{"objectType":"place","displayName":"Boston George Crib"},"favoritesCount":26},"verb":"post","postedTime":"2013-10-15T00:00:53.000Z","generator":{"displayName":"Twitter for iPhone","link":"http://twitter.com/download/iphone"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/OGkush103/statuses/389903668427763712","body":"You a killer you on twitter, You'n do NO talking","object":{"objectType":"note","id":"object:search.twitter.com,2005:389903668427763712","summary":"You a killer you on twitter, You'n do NO talking","link":"http://twitter.com/OGkush103/statuses/389903668427763712","postedTime":"2013-10-15T00:00:53.000Z"},"favoritesCount":0,"location":{"objectType":"place","displayName":"Mississippi, US","name":"Mississippi","country_code":"United States","twitter_country_code":"US","link":"https://api.twitter.com/1.1/geo/id/43d2418301bf1a49.json","geo":{"type":"Polygon","coordinates":[[[-91.65500899999999,30.146096],[-91.65500899999999,34.996099],[-88.097888,34.996099],[-88.097888,30.146096]]]}},"geo":{"type":"Point","coordinates":[31.99686058,-88.72688823]},"twitter_entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"twitter_filter_level":"medium","twitter_lang":"en","retweetCount":0,"gnip":{"matching_rules":[{"tag":null}],"language":{"value":"en"}}}

The data that I'm grabbing from the above is documented in the attached clean_up.php file, although I think I'm going to the "twitter id" field as well to ensure my not duplicating records (tag:search.twitter.com,2005:389903668427763712).

While I've no doubt that you can readily discern what my fields are, this url provides a clean "view" of what's there and what I'm grabbing: http://konklone.io/json/
clean-up.php
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 

Author Comment

by:brucegust
Comment Utility
As an aside, I've attached my "process" thus far. Ray, after testing the process on a couple of files, I commented the trigger_error messages out for fear that the process would quit overnite. In hindsight, that may not have been a good move in light of what I saw when I came in this morning.

Bottom line: No errors, but I had 27,596,100 rows of parsed data, but only two days worth of JSON files. Upon closer inspection, the table that I'm using to "manage" the process - a list of the files that need to be parsed with a "start" and "end" time along with a "completed" column - had nothing in the "completed" column that represented a finished process. I'm thinking that means, I looped through the same JSON file over and over again since, having tested this, one JSON file is 450K rows. Two days should be around 1,000,000 rows, not 27,000,000.

I'm going to save what I've got, but here are my marching orders today:

add the twitter id field to my table and check to make sure I'm not getting ready to duplicate a record when I got to insert the parsed data
figure out why on the parse.php file, my "zipped_files" table wasn't properly updated as far as the "completed" column being updated with a "2." I'm thinking in light of my first select statement looking for rows that have not been completed, there's a chance I just kept doing the same file over and over again because of that flaw in the update statement.

Here's my first file that sees the .gz files and decompresses them:

<?php
ini_set('max_execution_time', 144000); //300 seconds = 5 minutes
include("carter.inc");
$cxn = mysqli_connect($host,$user,$password,$database)
or die ("couldn't connect to server");

error_reporting(E_ALL);

$dir = 'E:/verizon/all_files/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);
$daniel_check=0;

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    //echo PHP_EOL . "PROCESSING $file";

    $daniel = "SELECT file_name, completed, id FROM zipped_files WHERE file_name='$file' and completed=1";
    $daniel_query=mysqli_query($cxn, $daniel);
    $daniel_count=mysqli_num_rows($daniel_query);
	//good to go up to here
   // echo PHP_EOL . "FOUND $daniel_count DATABASE ROWS FOR $file";
		
    if ($daniel_count==1)
    {
	$daniel_row=mysqli_fetch_assoc($daniel_query);
	extract($daniel_row);
        $now = date('c');
        $nelson="update zipped_files set start_time ='$now' where id ='$daniel_row[id]'";
        $nelson_query=mysqli_query($cxn, $nelson);
        if(!$nelson_query)
        {
            //var_dump($nelson);
           // trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
		   //if something goes south here, you can't proceed because you won't have a $new_id value
        }
        $new_id = $daniel_row['id'];
        //echo PHP_EOL . "INSERTED ID=$new_id INTO raw_files TABLE";
     
    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen('E:/verizon/all_files/' . $file, 'rb');
    if (!$inp_handle)// trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);
	continue;

    $out_handle = fopen('E:/verizon/all_files/' . $out_file_name, 'wb');
    if (!$out_handle) //trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);
	continue;
		while(!gzeof($inp_handle))
		{
			$data = gzread($inp_handle, $buffer_size);
			fwrite($out_handle, $data);
		}
		fclose($out_handle);
		gzclose($inp_handle);
		
	header("Location:parse.php?id=$new_id");	
	exit();
	}
	else
    {
	//you've already processed this file
    //trigger_error("NO DATA INSERTED FOR $file", E_USER_ERROR);
	continue;
    }
}

$message="All files have been processed!";
?>

<!DOCTYPE html>
<html lang="en">
<head>
<title>Twitter Usage Search Page</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>

<body>

&nbsp;<span style="font-size:18pt; font-weight:strong;">Twitter JSON Processing Page</span>
<br><br>
This script handles the decompression and parsing of the Twitter JSON Files.
<br><br>
<div id="title">&nbsp;Twitter JSON Parsing Machine<div style="float:right;">click <a href="search.php" style="color:#ffffff;">here</a> to return to the search page&nbsp;</div></div>	<br><br>
<?php echo $message; ?>

</body>

</html>

Open in new window


After it finishes, on line 61, I do a redirect to parse.php. Here's that page:

<?php
ini_set('max_execution_time', 144000); //300 seconds = 5 minutes
include("carter.inc");
$cxn = mysqli_connect($host,$user,$password,$database)
or die ("couldn't connect to server");

error_reporting(E_ALL);

$dir = 'E:/verizon/all_files/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);
$daniel_check=0;

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    //echo PHP_EOL . "PROCESSING $file";

    $daniel = "SELECT file_name, completed, id FROM zipped_files WHERE file_name='$file' and completed=1";
    $daniel_query=mysqli_query($cxn, $daniel);
    $daniel_count=mysqli_num_rows($daniel_query);
	//good to go up to here
   // echo PHP_EOL . "FOUND $daniel_count DATABASE ROWS FOR $file";
		
    if ($daniel_count==1)
    {
	$daniel_row=mysqli_fetch_assoc($daniel_query);
	extract($daniel_row);
        $now = date('c');
        $nelson="update zipped_files set start_time ='$now' where id ='$daniel_row[id]'";
        $nelson_query=mysqli_query($cxn, $nelson);
        if(!$nelson_query)
        {
            //var_dump($nelson);
           // trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
		   //if something goes south here, you can't proceed because you won't have a $new_id value
        }
        $new_id = $daniel_row['id'];
        //echo PHP_EOL . "INSERTED ID=$new_id INTO raw_files TABLE";
     
    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen('E:/verizon/all_files/' . $file, 'rb');
    if (!$inp_handle)// trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);
	continue;

    $out_handle = fopen('E:/verizon/all_files/' . $out_file_name, 'wb');
    if (!$out_handle) //trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);
	continue;
		while(!gzeof($inp_handle))
		{
			$data = gzread($inp_handle, $buffer_size);
			fwrite($out_handle, $data);
		}
		fclose($out_handle);
		gzclose($inp_handle);
		
	header("Location:parse.php?id=$new_id");	
	exit();
	}
	else
    {
	//you've already processed this file
    //trigger_error("NO DATA INSERTED FOR $file", E_USER_ERROR);
	continue;
    }
}

$message="All files have been processed!";
?>

<!DOCTYPE html>
<html lang="en">
<head>
<title>Twitter Usage Search Page</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>

<body>

&nbsp;<span style="font-size:18pt; font-weight:strong;">Twitter JSON Processing Page</span>
<br><br>
This script handles the decompression and parsing of the Twitter JSON Files.
<br><br>
<div id="title">&nbsp;Twitter JSON Parsing Machine<div style="float:right;">click <a href="search.php" style="color:#ffffff;">here</a> to return to the search page&nbsp;</div></div>	<br><br>
<?php echo $message; ?>

</body>

</html>

Open in new window


It's at line 46 the update statement should've updated the "completed" column to a "2" and it didn't. I've got to figure out why that happened.

Mind you, I'm completely open to suggestions, especially those that allow for a more efficient process. I would love to be able to have this thing wrapped up by tomorrow morning.
0
 

Author Comment

by:brucegust
Comment Utility
For some reason the "parse.php" page didn't copy over. I just now saw that. Here's the parse page:

<?php
ini_set('max_execution_time', 144000); //300 seconds = 5 minutes
include("carter.inc");
$cxn = mysqli_connect($host,$user,$password,$database)
or die ("couldn't connect to server");
$vivian="select file_name, id from zipped_files where id='$_GET[id]'";
$vivian_query=mysqli_query($cxn, $vivian)
or die("Couldn't execute query.");
$vivian_row=mysqli_fetch_assoc($vivian_query);
extract($vivian_row);
$file=$vivian_row['file_name'];
$new_id=$vivian_row['id'];

//at this point, you've decompressed your file, now you do your parsing
$the_new_file=str_replace('.gz',"",$file);
$chunk_size=4096;
$url="E:/verizon/all_files/";
$url .=$the_new_file;
$handle=@fopen($url,'r');
	if(!$handle) 
	{
		echo "failed to open JSON file";
	}
while (!feof($handle)) 
{
$buffer = fgets($handle, $chunk_size);
	if(trim($buffer)!=='')
	{
	$obj=json_decode(($buffer), true);
	
	include('clean_up.php');	
	
	$insert = "insert into verizon (actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day) 
	values ('$actor_id', '$actor_display_name', '$posted_time', '$display_name', '$geo_coords_0', '$geo_coords_1', '$location_name', '$posted_day')";
		$insertexe = mysqli_query($cxn, $insert);
		if(!$insertexe) {
		$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
		die($error);
		}
		//echo $row_count.' | '. $obj['actor']['id'].' | '.$obj['actor']['displayName'].' | '.$obj['postedTime'].' | '.$obj['generator']['displayName'].' | '.$obj['geo']['coordinates']['0'].' | '.$obj['geo']['coordinates']['1'].' | '.$obj['location']['name'].' '.$trigger.'<br>';
	}
}
fclose($handle);
//you're done parsing and decompressing. Now we update the zipped_files table with the time we completed the processing
$now = date('c');
$brice="UPDATE zipped_files SET end_time = '$now',
completed=2 WHERE id=$new_id LIMIT 1";
$brice_query=mysqli_query($cxn, $brice);
	if(!$brice_query)
	{
		var_dump($brice);
		trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
	}
header("Location:breather.php");
exit();
?>

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
I'm getting a sense that we're lacking consolidation of thought on this problem, and it's turning from a question into an application development project.  For that sort of thing you might want to consider hiring a professional application developer.

Let me try to summarize what I believe to be true and ask you for a few other pieces of information so that I have a chance to get some part of this working in my own test environment.

1. You have a data source that enables you to get GZ files.  These files, once uncompressed, contain JSON strings that have some sort of Twitter data.
Where can I get the same GZ files, in the same format and quantity?

2. You have a database table zipped_files.
Please post the CREATE TABLE statement.

3. You have a database table verizon.
Please post the CREATE TABLE statement.

4. You have this: include('clean_up.php')
Please post the source for that script.

5. In the most recent postings you've got some extract() statements.
Have you verified that these are necessary and not overwriting any important variables?

Are there any other moving parts or pieces of the puzzle that I'm missing?
0
 

Author Comment

by:brucegust
Comment Utility
Ray, I apologize. I can see your point that this is no longer a question and the scope of my inquiry requires more than just a brief word of wisdom.

What I've got is working, although it's slow. We'll keep at it and we'll go from there.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
No apology solicited or needed at all.  I can't see any reason why it's so slow, or why it should be so slow at all.  If I could get to the test data, I could probably show you a design that would be faster.  But with only one of the JSON strings, no access to the GZ files, and no information about the database tables, I'm kind of flying blind.  If you want to show us those things, please post a new question.  Thanks.
0
 

Author Comment

by:brucegust
Comment Utility
I'll open up another question and get you the "stuff" you asked for.

Thanks!
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
10-4.  I'll try to help in any way I can.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Suggested Solutions

Both Easy and Powerful How easy is PHP? http://lmgtfy.com?q=how+easy+is+php (http://lmgtfy.com?q=how+easy+is+php)  Very easy.  It has been described as "a programming language even my grandmother can use." How powerful is PHP?  http://en.wikiped…
Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now