Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

How can I grease these parsing wheels?

Posted on 2014-10-01
10
Medium Priority
?
104 Views
Last Modified: 2014-10-06
Here are my table structures:

zipped files:

CREATE TABLE `zipped_files` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `file_name` varchar(150) NOT NULL,
  `completed` tinyint(11) NOT NULL,
  `page_number` int(11) NOT NULL,
  `start_time` datetime NOT NULL,
  `end_time` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=474 DEFAULT CHARSET=latin1

Verizon:

CREATE TABLE `verizon` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `twitter_id` varchar(150) NOT NULL,
  `actor_id` varchar(150) NOT NULL,
  `actor_display_name` varchar(150) NOT NULL,
  `posted_time` varchar(150) NOT NULL,
  `display_name` varchar(150) NOT NULL,
  `geo_coords_0` varchar(150) NOT NULL,
  `geo_coords_1` varchar(150) NOT NULL,
  `location_name` varchar(150) NOT NULL,
  `posted_day` date NOT NULL,
  PRIMARY KEY (`id`),
  KEY `id_day` (`posted_day`),
  KEY `id_geo_coords_0` (`geo_coords_0`),
  KEY `posted_day` (`posted_day`),
  KEY `location_name` (`location_name`),
  KEY `geo_coords_1` (`geo_coords_1`),
  KEY `geo_coords_0` (`geo_coords_0`)
) ENGINE=InnoDB AUTO_INCREMENT=28665 DEFAULT CHARSET=latin1

BTW: I'm looking at the last create table statement and I see the indexes I attempted to place on geo_coords_0, geo_coords_1 and posted day. It looks like I've got a "key" on every column, which wasn't my intention. We'll look at that later, but that's my table.

clean_up.php is attached...

The process, as it exists now is this: index.php  - lists all of the gz files currently in the directory) | decompress.php - uses a foreach statement that goes through each file and first looks to see if it's been completed by referring to the "zipped_files" table. If not, it proceeds to decompress it. After that process is completed, there's a redirect with an id embedded in the url that takes the process to.... | parse.php - here the decompressed file is parsed into a database. It's here that "clean_up.php" is used to make sure things don't start to cough and sputter if there aren't any values and once this process is completed...| breather.php - nothing more than a page designed to give the system a "breather" before being redirected to "decompress.php" where the process starts all over again.

I've got all the files attached along with a  sample gz file.

The code works, but it's slow. I'm looking to see where there's room for improvement so I can expedite the progress of decompressing and parsing 365 files, each one being 2-3 MG
clean-up.php
index.php
decompress.php
parse.php
0
Comment
Question by:brucegust
  • 5
  • 5
10 Comments
 

Author Comment

by:brucegust
ID: 40354921
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 40355109
OK, I can read the GZ file, so that is a good start.   159MB more or less.  That seems to be a far cry from "2-3 MG."  There are 450,981 lines in the GZ file.  Does that sound right to you?

Please look this over carefully and let me know if it makes sense for a table structure.  I don't know what you want to do with the twitter id, so please tell me if that is supposed to be part of the retained data.  Also, we don't keep the day and time in separate columns - we keep only a single DATETIME type column.  The geocode info is typically kept as a Lat,Lon signed DECIMAL(16,8), and that part's fine.

Please look closely at line 120+ and tell me if that will capture the data you want to retain from each JSON object.

$jso = <<<EOJ
{
  "id": "tag:search.twitter.com,2005:389903668427763712",
  "objectType": "activity",
  "actor": {
    "objectType": "person",
    "id": "id:twitter.com:91239297",
    "link": "http://www.twitter.com/OGkush103",
    "displayName": "WalkingLick74",
    "postedTime": "2009-11-20T01:21:39.000Z",
    "image": "https://si0.twimg.com/profile_images/378800000593715086/755411d8bdc495472c2d7ed50e319582_normal.jpeg",
    "summary": "Self-Made, Self Paid..... I always had the mind to get it like a man, head first bout my younging Ean! #YOLO",
    "links": [
      {
        "href": null,
        "rel": "me"
      }
    ],
    "friendsCount": 468,
    "followersCount": 677,
    "listedCount": 0,
    "statusesCount": 25504,
    "twitterTimeZone": "Alaska",
    "verified": false,
    "utcOffset": "-28800",
    "preferredUsername": "OGkush103",
    "languages": [
      "en"
    ],
    "location": {
      "objectType": "place",
      "displayName": "Boston George Crib"
    },
    "favoritesCount": 26
  },
  "verb": "post",
  "postedTime": "2013-10-15T00:00:53.000Z",
  "generator": {
    "displayName": "Twitter for iPhone",
    "link": "http://twitter.com/download/iphone"
  },
  "provider": {
    "objectType": "service",
    "displayName": "Twitter",
    "link": "http://www.twitter.com"
  },
  "link": "http://twitter.com/OGkush103/statuses/389903668427763712",
  "body": "You a killer you on twitter, You'n do NO talking",
  "object": {
    "objectType": "note",
    "id": "object:search.twitter.com,2005:389903668427763712",
    "summary": "You a killer you on twitter, You'n do NO talking",
    "link": "http://twitter.com/OGkush103/statuses/389903668427763712",
    "postedTime": "2013-10-15T00:00:53.000Z"
  },
  "favoritesCount": 0,
  "location": {
    "objectType": "place",
    "displayName": "Mississippi, US",
    "name": "Mississippi",
    "country_code": "United States",
    "twitter_country_code": "US",
    "link": "https://api.twitter.com/1.1/geo/id/43d2418301bf1a49.json",
    "geo": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -91.65500899999999,
            30.146096
          ],
          [
            -91.65500899999999,
            34.996099
          ],
          [
            -88.097888,
            34.996099
          ],
          [
            -88.097888,
            30.146096
          ]
        ]
      ]
    }
  },
  "geo": {
    "type": "Point",
    "coordinates": [
      31.99686058,
      -88.72688823
    ]
  },
  "twitter_entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "twitter_filter_level": "medium",
  "twitter_lang": "en",
  "retweetCount": 0,
  "gnip": {
    "matching_rules": [
      {
        "tag": null
      }
    ],
    "language": {
      "value": "en"
    }
  }
}

$ins 
= 
"
INSERT INTO verizon 
( actor_id              /* JSON: actor->id (data past colon)   "91239297"
, actor_display_name    /* JSON: actor->displayName            "WalkingLick74"
, posted_time           /* JSON: postedTime:                   "2013-10-15T00:00:53.000Z",
, display_name          /* JSON: generator->displayName:       "Twitter for iPhone"
, geo_coords_lat        /* JSON: geo->coordinates[0]:          31.99686058
, geo_coords_lon        /* JSON: geo->coordinates[1]:          -88.72688823
, location_name         /* JSON: location->name:               "Mississippi"
) VALUES 
( '$actor_id'
, '$actor_display_name'
, '$posted_time'
, '$display_name'
, '$geo_coords_0'
, '$geo_coords_1'
, '$location_name'
)
"
;

Open in new window

0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 40355170
Update: Nevermind - I get the Twitter Id thing now.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 

Author Comment

by:brucegust
ID: 40355194
That's it!

The only thing I need to add is the twitter id, so I'm thinking 120+ looks like this:

"
INSERT INTO verizon 
( twitter_id             /* JSON: "id: tag:search.twitter.com,2005:389903668427763712",
, actor_id              /* JSON: actor->id (data past colon)   "91239297"
, actor_display_name    /* JSON: actor->displayName            "WalkingLick74"
, posted_time           /* JSON: postedTime:                   "2013-10-15T00:00:53.000Z",
, display_name          /* JSON: generator->displayName:       "Twitter for iPhone"
, geo_coords_lat        /* JSON: geo->coordinates[0]:          31.99686058
, geo_coords_lon        /* JSON: geo->coordinates[1]:          -88.72688823
, location_name         /* JSON: location->name:               "Mississippi"
) VALUES 
( '$actor_id'
, '$actor_display_name'
, '$posted_time'
, '$display_name'
, '$geo_coords_0'
, '$geo_coords_1'
, '$location_name'
)
"

Open in new window


As an aside, in an effort to streamline things so I could catch a clear vision of what the bottom line was, I manually decompressed a gz file and ran it through the following:

 
<?php
include("carter.inc");
$cxn = mysqli_connect($host,$user,$password,$database)
or die ("couldn't connect to server");
$chunk_size=4096;
$url="E:/verizon/page_19/00_8ptcd6jgjn201308010000_day.json";
//$url .=$the_new_file;
$handle=@fopen($url,'r');
	if(!$handle) 
	{
		echo "failed to open JSON file";
	}
while (!feof($handle)) 
{
$buffer = fgets($handle, $chunk_size);
	if(trim($buffer)!=='')
	{
	$obj=json_decode(($buffer), true);
	
	include('clean_up.php');	
	
	$check="select twitter_id from verizon where twitter_id='$twitter_id'";
	$check_query=mysqli_query($cxn, $check)
	or die("Check didn't happen.");
	$check_count=mysqli_num_rows($check_query);
		if($check_count==0)
		{
		$insert = "insert into verizon (twitter_id, actor_id, actor_display_name, posted_time, display_name, geo_coords_0, geo_coords_1, location_name, posted_day) 
		values ('$twitter_id', '$actor_id', '$actor_display_name', '$posted_time', '$display_name', '$geo_coords_0', '$geo_coords_1', '$location_name', '$posted_day')";
		$insertexe = mysqli_query($cxn, $insert);
			if(!$insertexe) {
			$error = mysqli_errno($cxn).': '.mysqli_error($cxn);
			die($error);
			}
			//echo $row_count.' | '. $obj['actor']['id'].' | '.$obj['actor']['displayName'].' | '.$obj['postedTime'].' | '.$obj['generator']['displayName'].' | '.$obj['geo']['coordinates']['0'].' | '.$obj['geo']['coordinates']['1'].' | '.$obj['location']['name'].' '.$trigger.'<br>';
		}
		else
		{
		continue;
		}
	}
}
fclose($handle);
echo "done"!;
//end parsing
?>

Open in new window


It's just the bare bones elements of the "parsing.php" page I had before without the decompression piece. The JSON file is a 1 GB (1,379,179 KB [sorry about the misquote earlier]). Maybe an hour should be expected, but that's how long this thing has been churning on that one file.

Thanks!
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 40355213
Please clarify on this one:

( twitter_id             /* JSON: "id: tag:search.twitter.com,2005:389903668427763712",

I understand that you want to keep the JSON "id" valie.  Do you want to keep the entire string or just this quoted part, after the colon?

( twitter_id             /* JSON: id: "389903668427763712",
0
 

Author Comment

by:brucegust
ID: 40355225
Just the quoted part.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 40355724
Bruce: I attached a timer to this script so we can see where it's spending its time.  Overall it looks like the process runs about 4~5 minutes +/- a bit, and it loads 450K records, suggesting a rate of about 1,500 INSERT queries per second.  If you have 365 such files, you're looking at loading 164MM rows over about a 30-hour period.

Opening the GZ file takes 48 seconds.  I assume that's because the open process has to read all the way to the end to get the counts and sizes.  After the load, retrieving all the records took about 3 seconds.

The process could be made more efficient by ganging up several INSERT rows into each query.  I expect that would speed things up a lot.

I did not add indexes at the moment of INSERT.  It's faster to INSERT without indexes, and you can always add them later.

Sample output:

450981 RECORDS READ
1349 BLANK RECORDS SKIPPED
0 BOGUS RECORDS SKIPPED
449632 VALID RECORDS LOADED
gzopen STOPPED  48,926.497 ms
load STOPPED  267,547.669 ms
retrieve RUNNING  2,838.136 ms

<?php // demo/temp_brucegust.php
error_reporting(E_ALL);
echo '<meta charset="utf8" />';
echo '<pre>';

// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// ACTIVATE THIS TO SHOW WHAT THE DB CONNECTION OBJECT LOOKS LIKE
// var_dump($mysqli);

$sql
=
"
CREATE TEMPORARY TABLE `verizon`
( `id`                 INT           NOT NULL AUTO_INCREMENT PRIMARY KEY
, `twitter_id`         VARCHAR(150)  NOT NULL DEFAULT ''
, `actor_id`           VARCHAR(150)  NOT NULL DEFAULT ''
, `actor_display_name` VARCHAR(150)  NOT NULL DEFAULT ''
, `posted_time`        DATETIME      NOT NULL DEFAULT '0000-00-00 00:00:00'
, `display_name`       VARCHAR(150)  NOT NULL DEFAULT ''
, `geo_coords_lat`     DECIMAL(16,9) NOT NULL DEFAULT 0.0
, `geo_coords_lon`     DECIMAL(16,9) NOT NULL DEFAULT 0.0
, `location_name`      VARCHAR(150)  NOT NULL DEFAULT ''
)
ENGINE=InnoDB
DEFAULT CHARSET=utf8mb4
"
;

// IF mysqli::query() RETURNS FALSE, LOG AND SHOW THE ERROR
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// LOAD THE DATA INTO THE TABLE
$tmr = new StopWatch;
$tmr->start('gzopen');
$url = 'http://www.brucegust.com/downloads/00_8ptcd6jgjn201311010000_day.json.gz';
$gzf = gzopen($url, 'r');
$tmr->stop('gzopen');

// COUNT THE LINES, SKIPPED, DUDS, AND LOADED
$cnt = $skp = $dud = $ldd = 0;

$tmr->start('load');
while (!gzeof($gzf))
{
    set_time_limit(2);
    $cnt++;
    $doc = gzgets($gzf, 12000);
    $doc = trim($doc);
    if (empty($doc))
    {
        $skp++;
        echo ' ';
        continue;
    }
    $sql = makeQuery($doc);
    if ($sql)
    {
        if (!$res = $mysqli->query($sql))
    	{
    	    $err
    	    = 'QUERY FAILURE:'
    	    . ' ERRNO: '
    	    . $mysqli->errno
    	    . ' ERROR: '
    	    . $mysqli->error
    	    . PHP_EOL
    	    . 'QUERY: '
    	    . $sql
    	    . PHP_EOL
    	    . 'LINE: '
    	    . $cnt
    	    ;
    	    trigger_error($err, E_USER_WARNING);
    	}
    	else
    	{
    	    $ldd++;
    	}
	}
	else
	{
	    echo "LINE $cnt: DUD: $doc";
	    $dud++;
	}
}
$tmr->stop('load');

// RETRIEVE THE ROWS TO TIME THE RETRIEVAL
$tmr->start('retrieve');
$sql = "SELECT * FROM verizon";
$res = $mysqli->query($sql);
while ($row = $res->fetch_object()) { }

echo PHP_EOL . "$cnt RECORDS READ";
echo PHP_EOL . "$skp BLANK RECORDS SKIPPED";
echo PHP_EOL . "$dud BOGUS RECORDS SKIPPED";
echo PHP_EOL . "$ldd VALID RECORDS LOADED";

echo PHP_EOL . $tmr->readout();
$tmr->reset();
unset($tmr);

function makeQuery($doc)
{
    global $mysqli;
    $jso = json_decode($doc);
    if (!$jso) return FALSE;

    $twitter_id          = !empty($jso->id)        ? explode(':',$jso->id)        : array(NULL, NULL, NULL);
    $twitter_id          = $mysqli->real_escape_string($twitter_id[2]);

    $actor_id            = !empty($jso->actor->id) ? explode(':',$jso->actor->id) : array(NULL, NULL, NULL);
    $actor_id            = $mysqli->real_escape_string($actor_id[2]);

    $actor_display_name  = !empty($jso->actor->displayName)     ? $mysqli->real_escape_string($jso->actor->displayName)     : NULL;
    $posted_time         = !empty($jso->postedTime)             ? $mysqli->real_escape_string($jso->postedTime)             : NULL;
    $display_name        = !empty($jso->generator->displayName) ? $mysqli->real_escape_string($jso->generator->displayName) : NULL;
    $geo_coords_lat      = !empty($jso->geo->coordinates[0])    ? $mysqli->real_escape_string($jso->geo->coordinates[0])    : NULL;
    $geo_coords_lon      = !empty($jso->geo->coordinates[1])    ? $mysqli->real_escape_string($jso->geo->coordinates[1])    : NULL;
    $location_name       = !empty($jso->location->name)         ? $mysqli->real_escape_string($jso->location->name)         : NULL;

    $qry
    =
	"
	INSERT INTO verizon
	( twitter_id
	, actor_id
	, actor_display_name
	, posted_time
	, display_name
	, geo_coords_lat
	, geo_coords_lon
	, location_name
	) VALUES
	( '$twitter_id'
	, '$actor_id'
	, '$actor_display_name'
	, '$posted_time'
	, '$display_name'
	, '$geo_coords_lat'
	, '$geo_coords_lon'
	, '$location_name'
	)
	"
	;
    return $qry;
}

// A SCRIPT TIMER FOR ALL OR PART OF A SCRIPT PHP 5+
// MAN PAGE http://php.net/manual/en/function.microtime.php
class StopWatch
{
    protected $a; // START TIME
    protected $s; // STATUS - IF RUNNING
    protected $z; // STOP TIME

    public function __construct()
    {
        $this->a = array();
        $this->s = array();
        $this->z = array();
    }

    // A METHOD TO PROVIDE A FINAL READOUT, IF NEEDED
    public function __destruct()
    {
        $ret = $this->readout();
        if (!$ret) return FALSE;
        echo
          __CLASS__
        . '::'
        . __FUNCTION__
        . '() '
        ;
        echo "<b>$ret</b>";
        echo PHP_EOL;
    }

    // A METHOD TO REMOVE A TIMER
    public function reset($name='TIMER')
    {
        // RESET ALL TIMERS
        if ($name == 'TIMER')
        {
            $this->__construct();
        }
        else
        {
            unset($this->a[$name]);
            unset($this->s[$name]);
            unset($this->z[$name]);
        }
    }

    // A METHOD TO CAPTURE THE START TIME
    public function start($name='TIMER')
    {
        $this->a[$name] = microtime(TRUE);
        $this->z[$name] = $this->a[$name];
        $this->s[$name] = 'RUNNING';
    }

    // A METHOD TO CAPTURE THE END TIME
    public function stop($name='TIMER')
    {
        $ret = NULL;

        // STOP ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                // IF THIS TIMER IS STILL RUNNING, STOP IT
                if ($this->s[$name])
                {
                    $this->s[$name] = FALSE;
                    $this->z[$name] = microtime(TRUE);
                }
            }
        }

        // STOP ONLY ONE OF THE TIMERS
        else
        {
            if ($this->s[$name])
            {
                $this->s[$name] = FALSE;
                $this->z[$name] = microtime(TRUE);
            }
            else
            {
                $ret .= "ERROR: CALL TO STOP() METHOD: '$name' IS NOT RUNNING";
            }
        }

        // RETURN AN ERROR MESSAGE, IF ANY
        return $ret;
    }

    // A METHOD TO READ OUT THE TIMER(S)
    public function readout($name='TIMER', $dec=3, $m=1000, $t = 'ms', $eol=PHP_EOL)
    {
        $str = NULL;

        // GET READOUTS FOR ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                $str .= $name;

                // IF THIS TIMER IS STILL RUNNING UPDATE THE END TIME
                if ($this->s[$name])
                {
                    $this->z[$name] = microtime(TRUE);
                    $str .= " RUNNING ";
                }
                else
                {
                    $str .= " STOPPED ";
                }

                // RETURN A DISPLAY STRING
                $lapse_time = $this->z[$name] - $start_time;
                $lapse_msec = $lapse_time * $m;
                $lapse_echo = number_format($lapse_msec, $dec);
                $str .= " $lapse_echo $t";
                $str .= $eol;
            }
            return $str;
        }

        // GET A READOUT FOR ONLY ONE TIMER
        else
        {
            $str .= $name;

            // IF THIS TIME IS STILL RUNNING, UPDATE THE END TIME
            if ($this->s[$name])
            {
                $this->z[$name] = microtime(TRUE);
                $str .= " RUNNING ";
            }
            else
            {
                $str .= " STOPPED ";
            }

            // RETURN A DISPLAY STRING
            $lapse_time = $this->z[$name] - $this->a[$name];
            $lapse_msec = $lapse_time * $m;
            $lapse_echo = number_format($lapse_msec, $dec);
            $str .= " $lapse_echo $t";
            $str .= $eol;
            return $str;
        }
    }
}

Open in new window

0
 

Author Comment

by:brucegust
ID: 40363767
Ray, I wanted you to see the attached screen shot.

What you're looking is a portion of the 365 files that were processed using your suggestions. Not only did I use the solution you graciously provided, but I also borrowed from some of your previous thoughts and added the "management" dynamic so you could see what was happening as things were being processed.

To keep things from timing out, I did a redirect to a different page where a JavaScript had it wait 10 seconds before it jumped in again.

It took a few days to finish, but the weekend turned out to be a convenient timeframe so when I came in this morning I have close to 300 million rows! I did some basic testing on the integrity of the data and while there a couple hundred duds, it was a smashing success.

I didn't respond right away because I wanted to be able to come back with a five star report and there it is!

Thank you, sir!
sweetness.jpg
0
 

Author Closing Comment

by:brucegust
ID: 40363773
Out of the park!
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 40363947
Cool -- glad it's headed in the right direction!
0

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to count occurrences of each item in an array.
Suggested Courses

886 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question