• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 194
  • Last Modified:

Update otherwise insert (A.K.A Upsert), vs Truncating data. Which operation will be more efficient?

Writing a nightly process, which will query all inventory in our current CRM and dump all the data to Postgress 9.2 schema. The goal is to create a single system snapshot containing the latest records for reporting every night. The source data for some of the records may change during the course of the day. Since this is a complete data dump, it will be fairly large. I am deliberating which approach will be more efficient. Please also not that I am using the npgSQL data provider from Visual Studio 2013.

Do I:

A.) Truncate all records in the schema and reload all records every night?

B.) Use the following technique to perform an Upsert? (http://www.the-art-of-web.com/sql/upsert/)

Any other suggestions are also welcome...
0
MRS
Asked:
MRS
  • 3
  • 3
  • 2
1 Solution
 
chaauCommented:
Upserting only makes sense when you are adding a set of records to the existing set. Imagine this situation: you have a library of books, say there are 10,000 books. Another library nearby is closing for renovation and bringing their books to your library. You can't truncate data from your tables. Obviously, you would have some titles already in your library: you would not want to create duplicate entries. For the existing entries you would have to just update the quantity. There would be also books in the new collection that do not exist in your database. they will need to be inserted as new records. The upsert will work for this situation.

In your case, however, the new collection is that is ultimately should become the new database. Not only Upsert will be a complete waste of time, but it can potentially lead to the incorrect data. Imagine that during the day some data from your new collection has been deleted. Upsert will not catch this situation: it is designed for Insert / Update. When you run the statement the records in the target that do not exist in the source will not be affected. They will stay as they are. MS SQL MERGE have a built-in support for this, you can specify "WHEN NOT MATCHED BY SOURCE THEN DELETE". The technique described in the article provided by you does not cover this scenario.

Bottomline: TRUNCATING then INSERTING is not only more performant, but better suits in some cases
0
 
MRSAuthor Commented:
Excellent example. Thank you! I did fail to mention that records will never be deleted from the source. Assume we will be inserting 5 million records, but only 2 million of these may have changed in some way. Is it still better to TRUNCATE?
0
 
chaauCommented:
I think if the new set represents the whole table truncate is better. Do not forget, Postgresql has the COPY command for bulk inserting which very effective. When you say "changed in some way" you will have to find this change. E.g. Customer phone has changed. Your upsert syntax will have to check that for matching customer IDs it needs to check if any of the columns do not match. And update only when not matched. Otherwise, it will have to update all 5million rows. My point is that upsert will be more effective if it manages to correctly identify these 2million records and update only them. Plus insert any new records. If you have a LastUpdated column and ot os reliable you can use it, otherwise you will have to locate this particular column that has been changed, or blindly update all rows, which is quite ineffective
0
Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

 
MRSAuthor Commented:
I understand.... makes sense. Do you happen to know if npgSQL supports the COPY command?
0
 
chaauCommented:
COPY is an SQL command, it is not an external utility. So, I think it can be used by npgSQL. Please note that the file that you are copying from should be accessible by the server. It is better to put the source file somewhere on the server or a network path and specify the path as a UNC path
0
 
gheistCommented:
Why not establish Postgresql replication and just keep it back for working hours and kick in at night?
0
 
MRSAuthor Commented:
gheist - The source data comes from non ODBC data source. Otherwise that's a good suggestion!

chaau - I did get this working with the COPY command. Apparently the copy command can take a direct stream. I was able to query my source data and feed it directly to the COPY command as a stream.

Thanks for all the good suggestions...
0
 
gheistCommented:
It has nothing to do with ODBC.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 3
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now