Link to home
Start Free TrialLog in
Avatar of doughoman
doughoman

asked on

Using PHP to import 600,000 records from csv into mySQL - timing out

I have a 212MB text file containing about 600,000 records I am trying to import into mySQL using php.  There is a lot of manipulation that must be done so I can not do a straight import into mySQL.  The import works great and handles about 8,000 recrods per minute.  My problem is I am running the php script by calling it thorough a web browser.  My issues are:

1) When it gets to about the 93,000 record, it stops.  I have set the set_time_limit property to 0 so I'm not sure why this is happening

2) How can I monitor the progress of the script?  I am currently sending output to the webbrowser with the count of the record number but I'm thinking there has to be a better way.  Is there a way to run/monitor the script locally on the server?  If so, can you pause the execution somehow?

Avatar of alain34
alain34
Flag of United Kingdom of Great Britain and Northern Ireland image

what you need to do is a script that perform one insert a time.
the output of the script should produce a form with the next record to be processed and this form should be submitted automatically when the page is finished loaded
like
<body onLoad="myForm.submit()">
<form action="myScript.php" action="post"
<input type=hidden name=id value="<? echo $id ?>">
<? echo "Processing next record is " . $id ?>
</form>

this way you only process one record at a time and you are not limited by any time out or size restriction in mysql!
Avatar of doughoman
doughoman

ASKER

Wouldn't this require the script to open the text file with every call and then read through all the previous records?  I am currently reading the text file like this:
$fp = fopen('/home/doug/import.txt','r') or die("can't open file");
while($csv_line = fgetcsv($fp,1024)) {
....
}
ASKER CERTIFIED SOLUTION
Avatar of Marcus Bointon
Marcus Bointon
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
yes you will need to read the file everythime, however, I will suggest to work using a copy of the file, and every time that you do an insert, trim the record from the file that has been processed. this way, the beginning of your file will be the next record to process. therefore as your program process more and more record, your working file become smaller and smaller.
I also intend to agree with squinky, one record at a time is maybe too much, however nothing stop you to use the same logic by group of 10,100 or x number of record to process! You are just storing the key of the next record to process. therefore you can do 100 insert, storing the key of the last record process and start again by calling itself.
There is a set_time_limit(int seconds) function in PHP. You can check it out. I think if you set to 0, it will work for no limit.

You can also check out the value of max_execution_time in /etc/php.ini. Refer http://in2.php.net/function.set-time-limit

a few things:
- set_time_limit won't work if your server is running safe_mode
- depending on the server, you won't be able to override the default without modifying PHP.INI directly (as gtkfreak noted).
- I would consider doing this as a step by step process:
1. script that starts off by taking the input file, and splitting it into a bunch of output files in a particular subdir, say 25-50K records per output for safety.  then launches script 2:
2. script with a form that shows last processed steps output (as a POST or GET param), and a GO button to do the next set.  When GO is clicked, the script finds the next 'split' file in the subdir, processes it into the DB, then DELETES it, then posts state back to itself for the next run.
3. if the script runs out of files in the subdir to process, it prints out that it's done, gives a quick SELECT count(*) from the database to prove how many records were inserted (though, you could do that every step when the form reloads...).

At 8K records/sec, if you split into 40K files, you'd get 'feedback' by completion of a step and the form reloading after 5s from clicking GO.  click GO 15 times, and you should be all set for 60K records.

-d
So far, I've like Squinky's suggestion of running the script from the command prompt using 'php myscript.php' on a command line.  That was the answer I was looking for.  How do I get the program to display output to the commandline window?  My echo and print commands are not doing anything.  I am able to see that the script is updating the database by going into phpmyadmin and seeing the tables growing.

The links provided to control the execution in the script using shared memory and signals were beyond my abilities.  Is there any easier way to control this or can a sample be provided for how I would set either of these up?

Thanks,
Doug
Maybe MySQL is timing out. in /ect/my.cnf, you can tweak mysql's settings. See the mysql manual for details.

MySQL by default ca only take a 16MB file at once, I think it is 16MB, could be off a bit.

You could also just manually break it up into about 10 different files.


my my.cnf:



[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
set-variable = innodb_buffer_pool_size=500M
set-variable = innodb_additional_mem_pool_size=100M
set-variable = innodb_file_io_threads=4
set-variable = innodb_lock_wait_timeout=50
key_buffer=256M
max_allowed_packet=16M
table_cache=256
sort_buffer_size=4M
read_buffer_size=4M
read_rnd_buffer_size=4M
record_buffer=4M
myisam_sort_buffer_size=128M
thread_cache=128
query_cache_limit=32M
query_cache_type=1
query_cache_size=32M
key_buffer=16M
join_buffer=4M
table_cache=1024
interactive_timeout=100
wait_timeout=100
connect_timeout=10
max_connections=200
max_user_connections=100
max_connect_errors=10
skip-locking
skip-networking

[mysql.server]
user=mysql
basedir=/var/lib

[safe_mysqld]
err-log=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
open_files_limit=8192

[mysqldump]
quick
max_allowed_packet=16M

[mysql]
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
#safe-updates

[isamchk]
key_buffer=128M
sort_buffer_size=128M
read_buffer=16M
write_buffer=16M

[myisamchk]
key_buffer=128M
sort_buffer_size=128M
read_buffer=16M
write_buffer=16M

[mysqlhotcopy]
interactive-timeout/log
for checking status, you can simply write to a file on disk as you progress through the process.  or just write a webserver page that refreshes every few seconds with a query of the number of records in the database...

echo/print should output to stdout, which should be the console.  hmmm.  wonder if you have some odd setting in your php.ini.

if you are running under linux, there should be a man page.  otherwise, check out:
http://us4.php.net/features.commandline
http://www.asleep.net/hacking/php/intro_to_php_cli.php

-d
Running the script from the command line worked perfectly and did not timeout.  So that is not an issue anymore.

The only additional thing I would like to do is be able to see the progress and hopefully be able to pause execution.  Is there any kind of debugger program where you can step through each statement of your PHP script?  I am used to programming in VBA and being able to step through each line of code and pause execution.
You can't do that directly in plain PHP, but you can in some PHP IDEs (also breakpoints, call tracing etc), such as:

http://zend.com/store/products/zend-studio/
http://www.waterproof.fr/
http://www.nusphere.com/

It's also nice to work with xdebug in regular PHP as it gives much better error messages and call stacks. http://xdebug.org/

I've had trouble in the past with some kind of output buffering for stdout, where nothing appears until the script is finished, but for the most part if you run from a command line and just use print or echo, it will just appear on the console. One thing I find is that some shells effectively hide single lines of output as they get overwritten by the prompt, so it's important to add a line break on the end of anything you print. Try this as a test:

<?php echo "Hello\n"; ?>
BTW the MySQL packet size limits don't really have anything to do with the data file size, it's only the size of an individual query that matters. I've had no trouble importing multi-gigabyte SQL dumps through the mysql command line client. In this case, it sounds like it's a CSV file that's being turned into SQL with PHP and submitted one record at a time.
Actually, I meant to ask a similar question.

Why can't you pre-process the CSV file into the 'format' it needs to be in, and then use any of the mysql tools to do the insert?  I mean, this question has 'evolved' a lot since you originally asked it... ;)

Is the objective a tool that can be used again by you, by other people?  Is it a one time tool, but you are trying to use PHP and MySQL to 'learn on the job'?

I mean, seems like there's a number of fine solutions -- but the way you originally asked the question, I don't normally assume someone processing massive datasets actually can run commandline php, or a php debugger at that point.

Anyway, sounds like you are close to finished here.

-d