Link to home
Start Free TrialLog in
Avatar of Bruce Gust
Bruce GustFlag for United States of America

asked on

Why does this process the first file, then stop?

I've got a directory with two JSON files in it. The code below looks at the directory and then decompresses them one by one while simultaneously updating a database that keeps track of what files have been done, when they started being processed and when they finished.

It works!

But it does the first file, then just quits. When you look at the database that's keeping tabs on what's being done, I have this:

File Name                                                         Start Time                     End Time
00_8ptcd6jgjn201311070000_day.json.gz | 2014-09-29 21:00:51| 0000-00-00 00:00:00
00_8ptcd6jgjn201311060000_day.json.gz | 2014-09-29 21:01:39| 2014-09-29 21:01:51

At 21:00:51, the first file started and nothing happened. Then the second file started and I can see the JSON file in the directory just as it's supposed to be. Why did the first file not decompress? What am I missing?

Here's my code:

<?php
$dir_name = 'JSON/';
if ($dh = opendir("$dir_name"))
{
  while (($file = readdir($dh)) !== false)
  {
    //omitting the system default of listing "." and ".."
		if ($file!="."&&$file!="..")
		{
			//make sure we're only reading files with a .gz extension
			$info = new SplFileInfo($file);
			if($info->getExtension()=="gz")
			{
				//at this point, look to see if the name of that file is in the database and needs to be processed
				$daniel = "select file_name from raw_files where file_name='$file'";
				$daniel_query=mysqli_query($cxn, $daniel);
					if(!$daniel_query)
					{
					$rats=mysqli_errno($cxn).': '.mysqli_error($cxn);
					die($rats);
					}
				$daniel_count=mysqli_num_rows($daniel_query);
					if(!$daniel_count>0)
					{
					//insert current date and time into your raw_files table
					$now= date('Y-m-d H:i:s');
					$nelson="insert into raw_files (file_name, start_time) value('$file', '$now')";
					$nelson_query=mysqli_query($cxn, $nelson);
						if(!$nelson_query)
						{
						$nuts=mysqli_errno($cxn).': '.mysqli_error($cxn);
						die($nuts);
						}
					$novie_id = $cxn->insert_id;
					//here's your decompression code
					$file_name = $file;
					// Raising this value may increase performance
					$buffer_size = 4096; // read 4kb at a time
					$out_file_name = str_replace('.gz', '', $file_name); 
					// Open our files (in binary mode)
					$the_file = gzopen($file_name, 'rb');
					$out_file = fopen('JSON/'.$out_file_name, 'wb'); 
					// Keep repeating until the end of the input file
						while(!gzeof($the_file)) 
						{
						// Read buffer-size bytes
						// Both fwrite and gzread and binary-safe
						  fwrite($out_file, gzread($the_file, $buffer_size));
						}  
					// Files are done, close files
					fclose($out_file);
					gzclose($the_file);
					//here's where you update the raw_files database with a time it was completed
					$right_now= date('Y-m-d H:i:s');
					$brice="update raw_files set end_time = '$right_now' where id=$novie_id";
					$brice_query=mysqli_query($cxn, $brice)
					or die("Brice didn't happen.");
				}
			//here's where you're doing your parsing and putting things into the verizon table
			//$the_new_file=str_replace('.gz',"",$file);
			//echo $the_new_file;
			//start
			//sleep(10);
			}
		}
	}
}
closedir($dh);
echo "done!";

?> 

Open in new window

Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Maybe the first file isn't GZipped, even though the extension has .gz?

Try adding:
echo __LINE__;

Open in new window


...to various parts AFTER the insert queries and then run it. You should be able to see where the line #s stop (first file) and restart (second file) - that might give some insight.
As a general rule it's wise to test the return values from PHP functions.  It looks like the script does not test the return value from $the_file = gzopen($file_name, 'rb');.  You might also want to add error_reporting(E_ALL) to the top of the script.  If you have these gz files on a public-facing server where we can test, we would welcome the URL of the directory, and we could test the script with some breakpoints and diagnostics.

Some interesting user-contributed notes on this page:
http://php.net/manual/en/function.gzread.php

You might also consider using scandir() since it will let you get the files in a predictable order.
Avatar of Bruce Gust

ASKER

Yo, Gonzo!

I'm not sure I'm following you. I added the thing you suggested and I got something like 404040done.

I was able to identify something though, tell me if this doesn't help better determine where things are breaking down.

I commented out some things and renamed some variables in an effort to figure out what was going on. Here's the code as it looks now:

<?php
$dir_name = 'JSON/';
if ($dh = opendir("$dir_name"))
{
  while (($file = readdir($dh)) !== false)
  {
    //omitting the system default of listing "." and ".."
		if ($file!="."&&$file!="..")
		{
			//make sure we're only reading files with a .gz extension
			$info = new SplFileInfo($file);
			if($info->getExtension()=="gz")
			{
				//at this point, look to see if the name of that file is in the database and needs to be processed
				$daniel = "select file_name from raw_files where file_name='$file'";
				$daniel_query=mysqli_query($cxn, $daniel);
					if(!$daniel_query)
					{
					$rats=mysqli_errno($cxn).': '.mysqli_error($cxn);
					die($rats);
					}
				$daniel_count=mysqli_num_rows($daniel_query);
					if(!$daniel_count>0)
					{
					//insert current date and time into your raw_files table
					/*$now= date('Y-m-d H:i:s');
					$nelson="insert into raw_files (file_name, start_time) value('$file', '$now')";
					$nelson_query=mysqli_query($cxn, $nelson);
						if(!$nelson_query)
						{
						$nuts=mysqli_errno($cxn).': '.mysqli_error($cxn);
						die($nuts);
						}
					$novie_id = $cxn->insert_id;*/
					//here's your decompression code
					// Raising this value may increase performance
					$buffer_size = 4096; // read 4kb at a time
					$out_file_name = str_replace('.gz', '',$file); 
					// Open our files (in binary mode)
					$the_file = gzopen($out_file_name, 'rb');
					$out_file = fopen('JSON/'.$out_file_name, 'wb'); 
					// Keep repeating until the end of the input file
						while(!gzeof($file)) 
						{
						// Read buffer-size bytes
						// Both fwrite and gzread and binary-safe
						  fwrite($out_file, gzread($file, $buffer_size));
						}  
					// Files are done, close files
					fclose($out_file);
					gzclose($the_file);
					//here's where you update the raw_files database with a time it was completed
					/*$right_now= date('Y-m-d H:i:s');
					$brice="update raw_files set end_time = '$right_now' where id=$novie_id";
					$brice_query=mysqli_query($cxn, $brice)
					or die("Brice didn't happen.");*/
				}
			//here's where you're doing your parsing and putting things into the verizon table
			//$the_new_file=str_replace('.gz',"",$file);
			//echo $the_new_file;
			//start
			//sleep(10);
			}
		}
	}
}
closedir($dh);
echo "done!";

?> 

Open in new window


When I do "echo $file" I get "00_8ptcd6jgjn201309050000_day.json.gz"

Perfect!

But when I run the code, I get "Warning: gzopen(00_8ptcd6jgjn201309050000_day.json): failed to open stream: No such file or directory in C:\wamp\www\json\decompress.php on line 66" which is this part of the code:

                              $out_file_name = str_replace('.gz', '',$file);
                              // Open our files (in binary mode)
                              $the_file = gzopen($out_file_name, 'rb');

Specifically, "$the_file"

When I go out to the directory, I see 00_8ptcd6jgjn201309050000_day.json, so the file is there, yet the page says that it doesn't exist.

What do you think?
SOLUTION
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You might try it something like this.  You will be able to see the variables at certain points in the process, and the script should stop with an error message if something is completely out of whack.

<?php // demo/temp_brucegust.php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    echo PHP_EOL . "PROCESSING $file";

    $daniel = "SELECT file_name FROM raw_files WHERE file_name='$file'";
    $daniel_query=mysqli_query($cxn, $daniel);
    if(!$daniel_query)
    {
        var_dump($daniel);
        trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
    }
    $daniel_count=mysqli_num_rows($daniel_query);
    echo PHP_EOL . "FOUND $daniel_count DATABASE ROWS FOR $file";

    if ($daniel_count)
    {
        $now = date('c');
        $nelson="INSERT INTO raw_files (file_name, start_time) VALUES ('$file', '$now')";
        $nelson_query=mysqli_query($cxn, $nelson);
        if(!$nelson_query)
        {
            var_dump($nelson);
            trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
        }
        $new_id = $cxn->insert_id;
        echo PHP_EOL . "INSERTED ID=$new_id INTO raw_files TABLE";
    }
    else
    {
        trigger_error("NO DATA INSERTED FOR $file", E_USER_ERROR);
    }

    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen($file, 'rb');
    if (!$inp_handle) trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);

    $out_handle = fopen('JSON/'.$out_file_name, 'wb');
    if (!$out_handle) trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);

    while(!gzeof($inp_handle))
    {
        $data = gzread($inp_handle);
        fwrite($out_handle, $data);
    }
    fclose($out_handle);
    gzclose($inp_handle);

    $now = date('c');
    $brice="UPDATE raw_files SET end_time = '$now' WHERE id=$new_id LIMIT 1";
    $brice_query=mysqli_query($cxn, $brice);
    if(!$brice_query)
    {
        var_dump($brice);
        trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
    }
}

echo "done!";

Open in new window

Not a typo, just trying to figure out why the code doesn't "see" the JSON file that was supposedly just "opened."

The error that I'm getting is at line 40. It's there where I get Warning: gzopen(00_8ptcd6jgjn201309050000_day.json): failed to open stream: No such file or directory in C:\wamp\www\json\decompress.php on line 66

Why doesn't it see the file when I can see it in the directory?
Your fopen BELOW the gzopen is what creates the output file. That's why you see it but gzopen doesnt. That said, you shouldnt be gzopen-ing the output file...
Ray!

Here's the portion of code that you wrote that I experimented with:

<?php // demo/temp_brucegust.php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    echo PHP_EOL . "PROCESSING $file";



    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen($file, 'rb');
    if (!$inp_handle) trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);

    $out_handle = fopen('JSON/'.$out_file_name, 'wb');
    if (!$out_handle) trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);

    while(!gzeof($inp_handle))
    {
        $data = gzread($inp_handle);
        fwrite($out_handle, $data);
    }
    fclose($out_handle);
    gzclose($inp_handle);


}

echo "done!";
?>

Open in new window


I figured I've got a ninja writing the decompressing code - that's the thing that's killing me right now. So, using that, this is the error I got:
ray.png
What do you think? Where am I blowing it?
<?php // demo/temp_brucegust.php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    echo PHP_EOL . "PROCESSING $file";



    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen($file, 'rb');
    if (!$inp_handle) trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);

    $out_handle = fopen('JSON/'.$out_file_name, 'wb');
    if (!$out_handle) trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);

    while(!gzeof($inp_handle))
    {
        $data = gzread($inp_handle);
        fwrite($out_handle, $data);
    }
    fclose($out_handle);
    gzclose($inp_handle);


}

echo "done!";
?>

Open in new window

Change:
$inp_handle = gzopen($file, 'rb');

To:
$inp_handle = gzopen('JSON/'.$file, 'rb');
Gonzo!

After implementing your suggestion I get:

line 52 expects two parameters...

The line in question is $data=gzread($inp_handle).

Here's the code with your recommendations...

<?php // demo/temp_brucegust.php
error_reporting(E_ALL);

$dir = 'JSON/';
$arr = scandir($dir);
unset($arr[0]);
unset($arr[1]);

foreach($arr as $file)
{
    $info = new SplFileInfo($file);
    if($info->getExtension()!= "gz") continue;
    echo PHP_EOL . "PROCESSING $file";

    $buffer_size = 4096; // read 4kb at a time
    $out_file_name = str_replace('.gz', '',$file);

    $inp_handle = gzopen('JSON/'.$file, 'rb'); 
    if (!$inp_handle) trigger_error("UNABLE TO GZOPEN $file", E_USER_ERROR);

    $out_handle = fopen('JSON/'.$out_file_name, 'wb');
    if (!$out_handle) trigger_error("UNABLE TO FOPEN $out_file_name", E_USER_ERROR);

    while(!gzeof($inp_handle))
    {
        $data = gzread($inp_handle);
        fwrite($out_handle, $data);
    }
    fclose($out_handle);
    gzclose($inp_handle);

    $now = date('c');
    $brice="UPDATE raw_files SET end_time = '$now' WHERE id=$new_id LIMIT 1";
    $brice_query=mysqli_query($cxn, $brice);
    if(!$brice_query)
    {
        var_dump($brice);
        trigger_error(mysqli_errno($cxn).': '.mysqli_error($cxn), E_USER_ERROR);
    }
}

echo "done!";

?> 

Open in new window

How can you get an error on line 52 in a script that has only 44 lines?  Are you sure you're testing the right script?
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That'll do it, Ray!

What was I not doing right?
Thanks for the points.  I don't really know what might have been wrong - when the error says line 52 but the script only has 44 lines, I don't read the code at all - I just try to produce something that I think might work.  

As a general rule, more data visualization is better when you're trying to debug some code, so you'll often see a lot of echo and var_dump() statements in my programming.  

As another general rule, the if() statement without the else{} control structure is often a path to confusion.  It's like saying "If something happens do this, but ignore the facts if something didn't happen."  That kind of selective way of thinking about facts leads to assumptions that often fail in unit tests.   If you like geek jokes, you'll appreciate this one:

The QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.
You didn't include the buffer size in your gzopen command.