Link to home
Start Free TrialLog in
Avatar of finnegandev
finnegandev

asked on

MySQL LOAD DATA INFILE question

We have an extremely large database that another company gives to us.  Everyday they drop around 20 different CSV files with new records and updates.  I'd like to use LOAD DATA INFILE because it's so much faster, is there a way to have it overwrite the records that already have ID's as well as add the ones that don't?

If I need to do this with php, is there a "universal" script I can write that will automate it without having to enter thousands of different fields?  I wanted to use a full dump of each table every night but the company won't play nice so all I get are the updates to the table.  

Enough rambling, what's the best way to get these "updates" into mysql without writing a script to tranverse 200k records every day, one field at a time?
Avatar of Julian Hansen
Julian Hansen
Flag of South Africa image

I would do it like this

Use the LOAD DATA INFILE to load the data into a temporary table.

Then run an insert with  ON DUPLICATE to update the records that exist or insert if they don't - something like this
DROP TABLE IF EXISTS '_temp_import';

-- CREATE THE TEMP TABLE
CREATE TABLE `_temp_import`( 
   `client_id` int NOT NULL , 
   `firstname` varchar(100) , 
   `surname` varchar(100) , 
   PRIMARY KEY (`client_id`)
 )
-- LOAD THE DATA
LOAD DATA LOCAL INFILE 'path_to_file/input_file.csv'
INSERT INTO TABLE `_temp_import`
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(client_id, firstname, surname);

INSERT INTO `ACTUALTABLE` (client_id, firstname, surname)
SELECT client_id, firstname, surname FROM `_test_import`
ON DUPLICATE KEY UPDATE 
  firstname=VALUES(firstname), 
  surname=VALUES(surname);

DROP TABLE `_temp_import';

Open in new window

Use "REPLACE INTO":
LOAD DATA LOCAL INFILE 'your_path/your_file.csv' REPLACE INTO TABLE your_table ......... blah blah.
Why you should not use REPLACE INTO. It is effectively a delete and insert. An ON DUPLICATE key allows you to do the same thing without the overhead (and side effects) of a delete and insert. For example if you have an autonumber on your table a REPLACE INTO will cause all the autonumbers to change - as the record will be removed and re-inserted - this may not be desirable.

From the docs

REPLACE is a MySQL extension to the SQL standard. It either inserts, or deletes and inserts. For another MySQL extension to standard SQL—that either inserts or updates—see Section 13.2.5.3, “INSERT ... ON DUPLICATE KEY UPDATE Syntax”.

Open in new window

You're 100% correct, Julian.
But, well, the author had these 2 phrases in his question:
- have it overwrite the records that already have ID's (very likely there's no autonumber)
- without having to enter thousands of different fields (yours requires entering those fields in the update part)

Some experts are more like educators and want to guide people into a better programmer.  I respect them very much, but I, having been in technical support for too many years, tend to solve problems directly, one by one.
ASKER CERTIFIED SOLUTION
Avatar of Julian Hansen
Julian Hansen
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of finnegandev
finnegandev

ASKER

Would replace into delete all the data that's not duplicated?
Replace into does a delete and a re-insert of the data so it will remove a record if it exists and then add the record back again from the source.
But will it write the new data to the table also?  

Would there be a problem if I did a regular load data infile and then did the replace after?  I would assume that the regular load data would error on the duplicate key?
Yes - however, I would recommend the process described above -  the script posted is fully functional and generic - you provide it the table names and import files and it should do the rest.
The only problem with that is I'm writing hundreds of thousands (900k+) in several of the tables and I'm worried about the script execution time and bogging down the server while the script runs. Some of the csv files I'm reading from are 30+ mb.  I see it's using the load data infile but how is MySql at working between tables with large data sources?
This is what I currently have, but if everyones in agreement that the other method isn't too taxing then I'd be happy to use that way.

$mysqli = mysqli_init();
mysqli_options($mysqli, MYSQLI_OPT_LOCAL_INFILE, true);
mysqli_real_connect($mysqli,$host,$user,$password, $database);

$query = SetupReplace($thedate,"SO.csv","bt_so");
  $result = $mysqli->query($query);
  if (!$result) {
    echo "Failed to load " . $mysqli->error . "<br/>";
  }

function SetupReplace($folder,$file,$table)
{

	$myquery = <<< QUERY
    LOAD DATA LOCAL INFILE '/var/csv/extracted/{$folder}/{$file}' REPLACE INTO TABLE {$table} FIELDS TERMINATED BY ','  ENCLOSED BY '"'  LINES TERMINATED BY '\\r\\n' IGNORE 1 LINES
QUERY;

	return $myquery;

}

Open in new window


on a side note, that script posted is beautiful.
bogging down the server while the script runs
Irrespective of which method you use you will have to use server processing time so not sure why this is any different.
The LOAD INFILE is the fastest way to bring the data in - however you are going to have to experiment with the process to find what works best for you.
Given the size of your data temporary tables might not work as there is a memory limit imposed on them. You might need to use a standard table to import to first and then import from that to the real table.
Thanks for all the help.
You are welcome.