Solved

How can I grease these parsing wheels?

Posted on 2014-10-01
10
94 Views
Last Modified: 2014-10-06
Here are my table structures:

zipped files:

CREATE TABLE `zipped_files` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `file_name` varchar(150) NOT NULL,
  `completed` tinyint(11) NOT NULL,
  `page_number` int(11) NOT NULL,
  `start_time` datetime NOT NULL,
  `end_time` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=474 DEFAULT CHARSET=latin1

Verizon:

CREATE TABLE `verizon` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `twitter_id` varchar(150) NOT NULL,
  `actor_id` varchar(150) NOT NULL,
  `actor_display_name` varchar(150) NOT NULL,
  `posted_time` varchar(150) NOT NULL,
  `display_name` varchar(150) NOT NULL,
  `geo_coords_0` varchar(150) NOT NULL,
  `geo_coords_1` varchar(150) NOT NULL,
  `location_name` varchar(150) NOT NULL,
  `posted_day` date NOT NULL,
  PRIMARY KEY (`id`),
  KEY `id_day` (`posted_day`),
  KEY `id_geo_coords_0` (`geo_coords_0`),
  KEY `posted_day` (`posted_day`),
  KEY `location_name` (`location_name`),
  KEY `geo_coords_1` (`geo_coords_1`),
  KEY `geo_coords_0` (`geo_coords_0`)
) ENGINE=InnoDB AUTO_INCREMENT=28665 DEFAULT CHARSET=latin1

BTW: I'm looking at the last create table statement and I see the indexes I attempted to place on geo_coords_0, geo_coords_1 and posted day. It looks like I've got a "key" on every column, which wasn't my intention. We'll look at that later, but that's my table.

clean_up.php is attached...

The process, as it exists now is this: index.php  - lists all of the gz files currently in the directory) | decompress.php - uses a foreach statement that goes through each file and first looks to see if it's been completed by referring to the "zipped_files" table. If not, it proceeds to decompress it. After that process is completed, there's a redirect with an id embedded in the url that takes the process to.... | parse.php - here the decompressed file is parsed into a database. It's here that "clean_up.php" is used to make sure things don't start to cough and sputter if there aren't any values and once this process is completed...| breather.php - nothing more than a page designed to give the system a "breather" before being redirected to "decompress.php" where the process starts all over again.

I've got all the files attached along with a  sample gz file.

The code works, but it's slow. I'm looking to see where there's room for improvement so I can expedite the progress of decompressing and parsing 365 files, each one being 2-3 MG
clean-up.php
index.php
decompress.php
parse.php
0
Comment
Question by:brucegust
  • 5
  • 5
10 Comments
 

Author Comment

by:brucegust
Comment Utility
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
OK, I can read the GZ file, so that is a good start.   159MB more or less.  That seems to be a far cry from "2-3 MG."  There are 450,981 lines in the GZ file.  Does that sound right to you?

Please look this over carefully and let me know if it makes sense for a table structure.  I don't know what you want to do with the twitter id, so please tell me if that is supposed to be part of the retained data.  Also, we don't keep the day and time in separate columns - we keep only a single DATETIME type column.  The geocode info is typically kept as a Lat,Lon signed DECIMAL(16,8), and that part's fine.

Please look closely at line 120+ and tell me if that will capture the data you want to retain from each JSON object.

$jso = <<<EOJ
{
  "id": "tag:search.twitter.com,2005:389903668427763712",
  "objectType": "activity",
  "actor": {
    "objectType": "person",
    "id": "id:twitter.com:91239297",
    "link": "http://www.twitter.com/OGkush103",
    "displayName": "WalkingLick74",
    "postedTime": "2009-11-20T01:21:39.000Z",
    "image": "https://si0.twimg.com/profile_images/378800000593715086/755411d8bdc495472c2d7ed50e319582_normal.jpeg",
    "summary": "Self-Made, Self Paid..... I always had the mind to get it like a man, head first bout my younging Ean! #YOLO",
    "links": [
      {
        "href": null,
        "rel": "me"
      }
    ],
    "friendsCount": 468,
    "followersCount": 677,
    "listedCount": 0,
    "statusesCount": 25504,
    "twitterTimeZone": "Alaska",
    "verified": false,
    "utcOffset": "-28800",
    "preferredUsername": "OGkush103",
    "languages": [
      "en"
    ],
    "location": {
      "objectType": "place",
      "displayName": "Boston George Crib"
    },
    "favoritesCount": 26
  },
  "verb": "post",
  "postedTime": "2013-10-15T00:00:53.000Z",
  "generator": {
    "displayName": "Twitter for iPhone",
    "link": "http://twitter.com/download/iphone"
  },
  "provider": {
    "objectType": "service",
    "displayName": "Twitter",
    "link": "http://www.twitter.com"
  },
  "link": "http://twitter.com/OGkush103/statuses/389903668427763712",
  "body": "You a killer you on twitter, You'n do NO talking",
  "object": {
    "objectType": "note",
    "id": "object:search.twitter.com,2005:389903668427763712",
    "summary": "You a killer you on twitter, You'n do NO talking",
    "link": "http://twitter.com/OGkush103/statuses/389903668427763712",
    "postedTime": "2013-10-15T00:00:53.000Z"
  },
  "favoritesCount": 0,
  "location": {
    "objectType": "place",
    "displayName": "Mississippi, US",
    "name": "Mississippi",
    "country_code": "United States",
    "twitter_country_code": "US",
    "link": "https://api.twitter.com/1.1/geo/id/43d2418301bf1a49.json",
    "geo": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -91.65500899999999,
            30.146096
          ],
          [
            -91.65500899999999,
            34.996099
          ],
          [
            -88.097888,
            34.996099
          ],
          [
            -88.097888,
            30.146096
          ]
        ]
      ]
    }
  },
  "geo": {
    "type": "Point",
    "coordinates": [
      31.99686058,
      -88.72688823
    ]
  },
  "twitter_entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "twitter_filter_level": "medium",
  "twitter_lang": "en",
  "retweetCount": 0,
  "gnip": {
    "matching_rules": [
      {
        "tag": null
      }
    ],
    "language": {
      "value": "en"
    }
  }
}

$ins 
= 
"
INSERT INTO verizon 
( actor_id              /* JSON: actor->id (data past colon)   "91239297"
, actor_display_name    /* JSON: actor->displayName            "WalkingLick74"
, posted_time           /* JSON: postedTime:                   "2013-10-15T00:00:53.000Z",
, display_name          /* JSON: generator->displayName:       "Twitter for iPhone"
, geo_coords_lat        /* JSON: geo->coordinates[0]:          31.99686058
, geo_coords_lon        /* JSON: geo->coordinates[1]:          -88.72688823
, location_name         /* JSON: location->name:               "Mississippi"
) VALUES 
( '$actor_id'
, '$actor_display_name'
, '$posted_time'
, '$display_name'
, '$geo_coords_0'
, '$geo_coords_1'
, '$location_name'
)
"
;

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Update: Nevermind - I get the Twitter Id thing now.
0
 

Author Comment

by:brucegust
Comment Utility
That's it!

The only thing I need to add is the twitter id, so I'm thinking 120+ looks like this:

"
INSERT INTO verizon 
( twitter_id             /* JSON: "id: tag:search.twitter.com,2005:389903668427763712",
, actor_id              /* JSON: actor->id (data past colon)   "91239297"
, actor_display_name    /* JSON: actor->displayName            "WalkingLick74"
, posted_time           /* JSON: postedTime:                   "2013-10-15T00:00:53.000Z",
, display_name          /* JSON: generator->displayName:       "Twitter for iPhone"
, geo_coords_lat        /* JSON: geo->coordinates[0]:          31.99686058
, geo_coords_lon        /* JSON: geo->coordinates[1]:          -88.72688823
, location_name         /* JSON: location->name:               "Mississippi"
) VALUES 
( '$actor_id'
, '$actor_display_name'
, '$posted_time'
, '$display_name'
, '$geo_coords_0'
, '$geo_coords_1'
, '$location_name'
)
"

Open in new window


As an aside, in an effort to streamline things so I could catch a clear vision of what the bottom line was, I manually decompressed a gz file and ran it through the following:

 
<?php
include("carter.inc");
$cxn = mysqli_connect($host,$user,$password,$database)
or die ("couldn't connect to server");
$chunk_size=4096;
$url="E:/verizon/page_19/00_8ptcd6jgjn201308010000_day.json";
//$url .=$the_new_file;
$handle=@fopen($url,'r');
	if(!$handle) 
	{
		echo "failed to open JSON file";
	}
while (!feof($handle)) 
{
$buffer = fgets($handle, $chunk_size);
	if(trim($buffer)!=='')
	{
	$obj=json_decode(($buffer), true);
	
	include('clean_up.php');	
	
	$check="select twitter_id from verizon where twitter_id='$twitter_id'";
	$check_query=mysqli_query($cxn, $check)
	or die("Check didn't happen.");
	$check_count=mysqli_num_rows($check_query);
		if($check_count==0)
		{
		$insert = "insert into verizon (twitter_id, actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day) 
		values ('$twitter_id', '$actor_id', '$actor_display_name', '$posted_time', '$display_name', '$geo_coords_0', '$geo_coords_1', '$location_name', '$posted_day')";
		$insertexe = mysqli_query($cxn, $insert);
			if(!$insertexe) {
			$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
			die($error);
			}
			//echo $row_count.' | '. $obj['actor']['id'].' | '.$obj['actor']['displayName'].' | '.$obj['postedTime'].' | '.$obj['generator']['displayName'].' | '.$obj['geo']['coordinates']['0'].' | '.$obj['geo']['coordinates']['1'].' | '.$obj['location']['name'].' '.$trigger.'<br>';
		}
		else
		{
		continue;
		}
	}
}
fclose($handle);
echo "done"!;
//end parsing
?>

Open in new window


It's just the bare bones elements of the "parsing.php" page I had before without the decompression piece. The JSON file is a 1 GB (1,379,179 KB [sorry about the misquote earlier]). Maybe an hour should be expected, but that's how long this thing has been churning on that one file.

Thanks!
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Please clarify on this one:

( twitter_id             /* JSON: "id: tag:search.twitter.com,2005:389903668427763712",

I understand that you want to keep the JSON "id" valie.  Do you want to keep the entire string or just this quoted part, after the colon?

( twitter_id             /* JSON: id: "389903668427763712",
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 

Author Comment

by:brucegust
Comment Utility
Just the quoted part.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
Comment Utility
Bruce: I attached a timer to this script so we can see where it's spending its time.  Overall it looks like the process runs about 4~5 minutes +/- a bit, and it loads 450K records, suggesting a rate of about 1,500 INSERT queries per second.  If you have 365 such files, you're looking at loading 164MM rows over about a 30-hour period.

Opening the GZ file takes 48 seconds.  I assume that's because the open process has to read all the way to the end to get the counts and sizes.  After the load, retrieving all the records took about 3 seconds.

The process could be made more efficient by ganging up several INSERT rows into each query.  I expect that would speed things up a lot.

I did not add indexes at the moment of INSERT.  It's faster to INSERT without indexes, and you can always add them later.

Sample output:

450981 RECORDS READ
1349 BLANK RECORDS SKIPPED
0 BOGUS RECORDS SKIPPED
449632 VALID RECORDS LOADED
gzopen STOPPED  48,926.497 ms
load STOPPED  267,547.669 ms
retrieve RUNNING  2,838.136 ms

<?php // demo/temp_brucegust.php
error_reporting(E_ALL);
echo '<meta charset="utf8" />';
echo '<pre>';

// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// ACTIVATE THIS TO SHOW WHAT THE DB CONNECTION OBJECT LOOKS LIKE
// var_dump($mysqli);

$sql
=
"
CREATE TEMPORARY TABLE `verizon`
( `id`                 INT           NOT NULL AUTO_INCREMENT PRIMARY KEY
, `twitter_id`         VARCHAR(150)  NOT NULL DEFAULT ''
, `actor_id`           VARCHAR(150)  NOT NULL DEFAULT ''
, `actor_display_name` VARCHAR(150)  NOT NULL DEFAULT ''
, `posted_time`        DATETIME      NOT NULL DEFAULT '0000-00-00 00:00:00'
, `display_name`       VARCHAR(150)  NOT NULL DEFAULT ''
, `geo_coords_lat`     DECIMAL(16,9) NOT NULL DEFAULT 0.0
, `geo_coords_lon`     DECIMAL(16,9) NOT NULL DEFAULT 0.0
, `location_name`      VARCHAR(150)  NOT NULL DEFAULT ''
)
ENGINE=InnoDB
DEFAULT CHARSET=utf8mb4
"
;

// IF mysqli::query() RETURNS FALSE, LOG AND SHOW THE ERROR
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// LOAD THE DATA INTO THE TABLE
$tmr = new StopWatch;
$tmr->start('gzopen');
$url = 'http://www.brucegust.com/downloads/00_8ptcd6jgjn201311010000_day.json.gz';
$gzf = gzopen($url, 'r');
$tmr->stop('gzopen');

// COUNT THE LINES, SKIPPED, DUDS, AND LOADED
$cnt = $skp = $dud = $ldd = 0;

$tmr->start('load');
while (!gzeof($gzf))
{
    set_time_limit(2);
    $cnt++;
    $doc = gzgets($gzf, 12000);
    $doc = trim($doc);
    if (empty($doc))
    {
        $skp++;
        echo ' ';
        continue;
    }
    $sql = makeQuery($doc);
    if ($sql)
    {
        if (!$res = $mysqli->query($sql))
    	{
    	    $err
    	    = 'QUERY FAILURE:'
    	    . ' ERRNO: '
    	    . $mysqli->errno
    	    . ' ERROR: '
    	    . $mysqli->error
    	    . PHP_EOL
    	    . 'QUERY: '
    	    . $sql
    	    . PHP_EOL
    	    . 'LINE: '
    	    . $cnt
    	    ;
    	    trigger_error($err, E_USER_WARNING);
    	}
    	else
    	{
    	    $ldd++;
    	}
	}
	else
	{
	    echo "LINE $cnt: DUD: $doc";
	    $dud++;
	}
}
$tmr->stop('load');

// RETRIEVE THE ROWS TO TIME THE RETRIEVAL
$tmr->start('retrieve');
$sql = "SELECT * FROM verizon";
$res = $mysqli->query($sql);
while ($row = $res->fetch_object()) { }

echo PHP_EOL . "$cnt RECORDS READ";
echo PHP_EOL . "$skp BLANK RECORDS SKIPPED";
echo PHP_EOL . "$dud BOGUS RECORDS SKIPPED";
echo PHP_EOL . "$ldd VALID RECORDS LOADED";

echo PHP_EOL . $tmr->readout();
$tmr->reset();
unset($tmr);

function makeQuery($doc)
{
    global $mysqli;
    $jso = json_decode($doc);
    if (!$jso) return FALSE;

    $twitter_id          = !empty($jso->id)        ? explode(':',$jso->id)        : array(NULL, NULL, NULL);
    $twitter_id          = $mysqli->real_escape_string($twitter_id[2]);

    $actor_id            = !empty($jso->actor->id) ? explode(':',$jso->actor->id) : array(NULL, NULL, NULL);
    $actor_id            = $mysqli->real_escape_string($actor_id[2]);

    $actor_display_name  = !empty($jso->actor->displayName)     ? $mysqli->real_escape_string($jso->actor->displayName)     : NULL;
    $posted_time         = !empty($jso->postedTime)             ? $mysqli->real_escape_string($jso->postedTime)             : NULL;
    $display_name        = !empty($jso->generator->displayName) ? $mysqli->real_escape_string($jso->generator->displayName) : NULL;
    $geo_coords_lat      = !empty($jso->geo->coordinates[0])    ? $mysqli->real_escape_string($jso->geo->coordinates[0])    : NULL;
    $geo_coords_lon      = !empty($jso->geo->coordinates[1])    ? $mysqli->real_escape_string($jso->geo->coordinates[1])    : NULL;
    $location_name       = !empty($jso->location->name)         ? $mysqli->real_escape_string($jso->location->name)         : NULL;

    $qry
    =
	"
	INSERT INTO verizon
	( twitter_id
	, actor_id
	, actor_display_name
	, posted_time
	, display_name
	, geo_coords_lat
	, geo_coords_lon
	, location_name
	) VALUES
	( '$twitter_id'
	, '$actor_id'
	, '$actor_display_name'
	, '$posted_time'
	, '$display_name'
	, '$geo_coords_lat'
	, '$geo_coords_lon'
	, '$location_name'
	)
	"
	;
    return $qry;
}

// A SCRIPT TIMER FOR ALL OR PART OF A SCRIPT PHP 5+
// MAN PAGE http://php.net/manual/en/function.microtime.php
class StopWatch
{
    protected $a; // START TIME
    protected $s; // STATUS - IF RUNNING
    protected $z; // STOP TIME

    public function __construct()
    {
        $this->a = array();
        $this->s = array();
        $this->z = array();
    }

    // A METHOD TO PROVIDE A FINAL READOUT, IF NEEDED
    public function __destruct()
    {
        $ret = $this->readout();
        if (!$ret) return FALSE;
        echo
          __CLASS__
        . '::'
        . __FUNCTION__
        . '() '
        ;
        echo "<b>$ret</b>";
        echo PHP_EOL;
    }

    // A METHOD TO REMOVE A TIMER
    public function reset($name='TIMER')
    {
        // RESET ALL TIMERS
        if ($name == 'TIMER')
        {
            $this->__construct();
        }
        else
        {
            unset($this->a[$name]);
            unset($this->s[$name]);
            unset($this->z[$name]);
        }
    }

    // A METHOD TO CAPTURE THE START TIME
    public function start($name='TIMER')
    {
        $this->a[$name] = microtime(TRUE);
        $this->z[$name] = $this->a[$name];
        $this->s[$name] = 'RUNNING';
    }

    // A METHOD TO CAPTURE THE END TIME
    public function stop($name='TIMER')
    {
        $ret = NULL;

        // STOP ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                // IF THIS TIMER IS STILL RUNNING, STOP IT
                if ($this->s[$name])
                {
                    $this->s[$name] = FALSE;
                    $this->z[$name] = microtime(TRUE);
                }
            }
        }

        // STOP ONLY ONE OF THE TIMERS
        else
        {
            if ($this->s[$name])
            {
                $this->s[$name] = FALSE;
                $this->z[$name] = microtime(TRUE);
            }
            else
            {
                $ret .= "ERROR: CALL TO STOP() METHOD: '$name' IS NOT RUNNING";
            }
        }

        // RETURN AN ERROR MESSAGE, IF ANY
        return $ret;
    }

    // A METHOD TO READ OUT THE TIMER(S)
    public function readout($name='TIMER', $dec=3, $m=1000, $t = 'ms', $eol=PHP_EOL)
    {
        $str = NULL;

        // GET READOUTS FOR ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                $str .= $name;

                // IF THIS TIMER IS STILL RUNNING UPDATE THE END TIME
                if ($this->s[$name])
                {
                    $this->z[$name] = microtime(TRUE);
                    $str .= " RUNNING ";
                }
                else
                {
                    $str .= " STOPPED ";
                }

                // RETURN A DISPLAY STRING
                $lapse_time = $this->z[$name] - $start_time;
                $lapse_msec = $lapse_time * $m;
                $lapse_echo = number_format($lapse_msec, $dec);
                $str .= " $lapse_echo $t";
                $str .= $eol;
            }
            return $str;
        }

        // GET A READOUT FOR ONLY ONE TIMER
        else
        {
            $str .= $name;

            // IF THIS TIME IS STILL RUNNING, UPDATE THE END TIME
            if ($this->s[$name])
            {
                $this->z[$name] = microtime(TRUE);
                $str .= " RUNNING ";
            }
            else
            {
                $str .= " STOPPED ";
            }

            // RETURN A DISPLAY STRING
            $lapse_time = $this->z[$name] - $this->a[$name];
            $lapse_msec = $lapse_time * $m;
            $lapse_echo = number_format($lapse_msec, $dec);
            $str .= " $lapse_echo $t";
            $str .= $eol;
            return $str;
        }
    }
}

Open in new window

0
 

Author Comment

by:brucegust
Comment Utility
Ray, I wanted you to see the attached screen shot.

What you're looking is a portion of the 365 files that were processed using your suggestions. Not only did I use the solution you graciously provided, but I also borrowed from some of your previous thoughts and added the "management" dynamic so you could see what was happening as things were being processed.

To keep things from timing out, I did a redirect to a different page where a JavaScript had it wait 10 seconds before it jumped in again.

It took a few days to finish, but the weekend turned out to be a convenient timeframe so when I came in this morning I have close to 300 million rows! I did some basic testing on the integrity of the data and while there a couple hundred duds, it was a smashing success.

I didn't respond right away because I wanted to be able to come back with a five star report and there it is!

Thank you, sir!
sweetness.jpg
0
 

Author Closing Comment

by:brucegust
Comment Utility
Out of the park!
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Cool -- glad it's headed in the right direction!
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now