?
Solved

Placing 1.8 GB of data in database without hogging resources

Posted on 2008-01-29
15
Medium Priority
?
290 Views
Last Modified: 2013-12-12
Hello,

The entire dmoz directory in RDF format can be downloaded here:

http://rdf.dmoz.org/
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

I downloaded the entire thing.  It is compressed to about 300mb but when I extract it it is actually about 1.8 gigabytes.

I want a PHP script to place every single URL in  a mySQL database like this:
          $query = "INSERT INTO URL VALUES ('http://www.example.com')";

The problem is that I am worried that this could kill my poor server by hogging all its resources.

I would like your ideas on how this could be done in such a way as to preserve system resources.  I don't care if it takes two weeks to completely place all URLs in the database.

Also, please tell me if you think another language such as PERL or a linux sh script would be a better choice for this than PHP.
0
Comment
Question by:hankknight
  • 10
  • 5
15 Comments
 
LVL 10

Expert Comment

by:adrpo
ID: 20770322

You could use perl.
See the script below.

You call it like this:
gzcat content.rdf.u8.gz | script.pl

This way you don't even need to unpack the file.

Cheers,
za-k/
#!/usr/bin/perl
use DBI;
 
my $conn = DBI->connect("dbi:MySQL:dbname=YOURDB", "password", "");
 
# read one line at a time
while (($URL=chomp($line=<STDIN>)) ne '') 
{
  my $query = $conn->prepare("insert into URL values('" . $URL . "');");
  $query->execute();
  sleep 5; # sleep for 5 seconds
}

Open in new window

0
 
LVL 10

Expert Comment

by:adrpo
ID: 20770655

Ok. I disregarded the RDF format.
What do you want from there?
Here is a small example:
http://rdf.dmoz.org/rdf/content.example.txt
You want all the
<link r:resource="http://www.geocities.com/Hollywood/Boulevard/8928/" />
To be inserted in the DB? Or you want also these links:
<ExternalPage about="http://us.imdb.com/title/tt0118531/">

Let me know so I can change the script.

Cheers,
za-k/
0
 
LVL 16

Author Comment

by:hankknight
ID: 20771226
Thanks, adrpo for your response and your question.

1. I do NOT want "link r:resource" to be included
   <link r:resource="http://www.example.com/" />

2. I ONLY want "ExternalPage about" to be included
   <ExternalPage about="http://www.example.com/">

3. I have a long list of catagories that should be EXCLUDED including:

          Top/Adult/
          Top/Business/Arts_and_Entertainment/Gaming/
          Top/Games/Gambling/
          Top/Society/Sexuality/Activities_and_Practices/Swinging/
          Top/World/

Any URL in one of those catagories should NOT be included in the database.

For example, this website should NOT be included:

          <ExternalPage about="http://www.xxxxxxxx.xxx/">
            <d:Title>XXX XXX</d:Title>
            <d:Description>XXX XXX XXX</d:Description>
            <topic>Top/Adult/Image_Galleries/Fetishes/xxx/</topic>
          </ExternalPage>
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
LVL 10

Expert Comment

by:adrpo
ID: 20772687

Ok, here we go, are you ready? :))
Just copy the snipped or take the attached file and rename
ParseAndInsert-DMOZ.pl.txt
to
ParseAndInsert-DMOZ.pl
Edit you MySQL particulars in the connect statement.
Edit your $sleep to match your compuer power.

Enjoy!

Cheers,
za-k/

Check result.txt to see how the output looks like.

#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = 5; 
 
# I HAD ONLY PostgreSQL, so you need to test this!
# my $conn = DBI->connect("dbi:Pg:dbname=assets;host=localhost;port=5432", "postgres", "YOUR_PASS");
my $conn = DBI->connect("dbi:mysql:dbname=YOUR_DATABASE;host=localhost", "YOUR_USER", "YOUR_PASS");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult", "Top/Business/Arts_and_Entertainment/Gaming", "Top/Games/Gambling", "Top/Society/Sexuality/Activities_and_Practices/Swinging", "Top/World");
    foreach $b (@banned)
    {
       if (($b =~ m/$topic/) == 1) # if topic start
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

ParseAndInsert-DMOZ.pl.txt
result.txt
0
 
LVL 10

Expert Comment

by:adrpo
ID: 20772707

Hi again,

Just small details.
Make the script executable:
$ chmod +x ParseAndInsert-DMOZ.pl

Then after fixing your MySQL stuff and $sleep, just call the script like this:
$ gzcat content.rdf.u8.gz | ./ParseAndInsert-DMOZ.pl

I hope it works. For me it does very well :)
I tested it with this input:
http://rdf.dmoz.org/rdf/content.example.txt

Cheers,
za-k/
0
 
LVL 10

Expert Comment

by:adrpo
ID: 20772743

One more thing.
Just add your banned topics to the "@banned" array in
the "isBanned" subroutine and if the there exists any
@banned[i] that starts with the string contained in the
current topic, then the link will NOT be inserted.

Cheers,
za-k/
0
 
LVL 10

Expert Comment

by:adrpo
ID: 20774075

Small error:
if (($b =~ m/$topic/) == 1) # if topic is part of the banned
should be:
# if topic is part of the banned or banned is part of the topic
if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/) == 1)

This way:
if $b = "Top/Adult"; and $topic = "Top/Adult/MoreWeirdosHere/Etc"
the inclusion is actually $b is part of the $topic, not the other way around.

Cheers,
za-k/

0
 
LVL 16

Author Comment

by:hankknight
ID: 20774208
Thanks,  adrpo, you rock!

It works perfectly with one exception: the database is filling up with adult content!

Cheers,
Hank
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult/", "Top/Arts/Bodyart/", "Top/Business/Arts_and_Entertainment/Gaming/", "Top/Religion_and_Spirituality/", "Top/Arts/Performing_Arts/Hypnotism/", "Top/Arts/Performing_Arts/Magic/", "Top/Games/Dice/", "Top/Games/Gambling/", "Top/Games/Roleplaying/", "Top/Recreation/Nudism/", "Top/Recreation/Tobacco/", "Top/Health/Men%27s_Health/Sexuality/", "Top/Shopping/Tobacco/", "Top/Society/Gay,_Lesbian,_and_Bisexual/", "Top/Society/Issues/Abortion/", "Top/Society/Sexuality/", "Top/Society/Transgendered/", "Top/Society/Relationships/Alternative_Lifestyles/", "Top/World/");
    foreach $b (@banned)
    {
       if (($b =~ m/$topic/) == 1) # if topic start
       {
	   return 1;
       }  
    }
    return 0;
}

Open in new window

sample.rdf.txt
0
 
LVL 10

Expert Comment

by:adrpo
ID: 20774254

Hehe, sorry :)
Not intentional. I'll review the script with the sample.rdf.txt you gave me.
I'll post it here asap.

Cheers,
za-k/
0
 
LVL 10

Accepted Solution

by:
adrpo earned 2000 total points
ID: 20774306

I did also the change i talked about:
if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/) == 1)
This should fix the problem.

All the links are displayed but the adult ones are marked --BANNED--
and the one inserted are marked --INSERTED--

So here it is.

Cheers,
za-k/
#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = 5; 
 
# I HAD ONLY PostgreSQL, so you need to test this!
#my $conn = DBI->connect("dbi:Pg:dbname=assets;host=localhost;port=5432", "postgres", "PASS");
my $conn = DBI->connect("dbi:mysql:dbname=YOUR_DATABASE;host=localhost", "YOUR_USER", "YOUR_PASS");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult", "Top/Business/Arts_and_Entertainment/Gaming", "Top/Games/Gambling", "Top/Society/Sexuality/Activities_and_Practices/Swinging", "Top/World");
    foreach $b (@banned)
    {
       # if topic starts with banned or the other way around
       if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/))
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

ParseAndInsert-DMOZ.pl.txt
trace-execution.txt
0
 
LVL 10

Expert Comment

by:adrpo
ID: 20774314

Look at the end in trace-execution.txt
In the db are only the Parental Filtered Ones :)

adrpo@Kafka ~/dmoz
$ psql -U postgres -h localhost assets
Password:
Welcome to psql 8.0.7, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

assets=# select * from url;
                 link                  
---------------------------------------
 http://www.shamrockcottages.co.uk/
 http://www.holidayrentalsireland.com/
 http://www.dreamireland.com/
 http://www.groundwork.ie
 http://homepage.tinet.ie/~knp/duchas/
(5 rows)

assets=#

Cheers,
za-k/
0
 
LVL 16

Author Comment

by:hankknight
ID: 20780517
0
 
LVL 10

Expert Comment

by:adrpo
ID: 20781906

The link you gave is pointing to exactly this question :)

Cheers,
za-k/
0
 
LVL 16

Author Comment

by:hankknight
ID: 20781936
0
 
LVL 16

Author Comment

by:hankknight
ID: 20782030
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The title says it all. Writing any type of PHP Application or API code that provides high throughput, while under a heavy load, seems to be an arcane art form (Black Magic). This article aims to provide some general guidelines for producing this typ…
Article by: Tammy
MySQLTuner is a script written in Perl that allows you to review a MySQL installation quickly and make adjustments to increase performance and stability. The current configuration variables and status data is retrieved and presented in a brief forma…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses

598 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question