Link to home
Start Free TrialLog in
Avatar of satmisha
satmishaFlag for India

asked on

What is fastest bulk insert technology in live database

Hi Experts,

What is the fastest way to bulk insert in live db.

Currently I am using sql server 2008 R2 where not satisfied with the performance currently and i.e. why looking beyond that any technology\machenism to speed-up the bulk insert into live db.

we need to insert\update 40 million records in live db that contains cluster index, non-cluster index and it is live db so we do not have as such outage window but yes we do have soft window when load is less. What best direction\technology should I look into taking into account this requirement ?

I understand that what I am asking quite open ended question with I guess very minimum information but appreciate if you guys can help.
SOLUTION
Avatar of Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3]
Flag of Luxembourg image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of satmisha

ASKER

Thanks Guy  Hengel for your prompt reply.

I was expecting some technology beyond MS sql server where there is provision of fast bulk insert\update like parallel execution of multiple activities types ? Appreciate your table partition concept that I am looking into.

In the meantime is there anything beyond sql that could help in exponential gain in terms of performance ?
parallel query execution always is tricky, as it requires concurrent access...
external tools will just help data load, but not help on performance directly.
Thanks Guy Hengel, Is there anything beyond MS SQL that could help in faster update\insert in live database ?
The only way of data into MSSQL tables is via INSERT/UPDATE/MERGE and of course BULK INSERT.
Partitioning is a good idea. If you have big live data it actually is a must have, Why don't you partition billions of rows?

To me this also smells like you haven't got a good understanding about the whole MSSQL Server "universe" and possibilities of using SAN, high availability groups, replication.

What data do you actually stage? Is this affiliate data?

Bye, Olaf.
Thanks Olaf, I do have affiliate data that we stage in staging server and from there bulk insert into live db in smaller chunks.

yes, you are right that I am not having good understanding of db "universe" thats why struggling to figure out in which direction I should dig into.
You're reprograming, what SQL Server offers you with replication, then. A staging process would only be needed, if you also hava a transformation process or you merge in data you collect from external resources.

You seem to have distributed data via locally same installations of your software in affiliates world wide and then don't make use of replication? That's odd. It's quite a wonder you grew that large.

The problem can't be new or are you just starting the idea to collect data from affiliates to one central master database?

Bye, Olaf.
Thanks olaf.

we do some processing over staged data like checking of few validations, changing of data formats etc like date before insert\update on live server.

since staged data is quite high in vol so we push(insert\update) that data in live db in smaller chunks.
If this staging data is coming from affiliates from the same software, same database structure, the need for validation and reformatting points out misconfiguration. Even if local databases are set to locales in regard of number, date formats etc. you can do the bulk export in a standard US format, for example. All exchange data can be conformant to one type of data formats.

Also replication will work very different anyway and transport data on the binary level, eg with transaction log shipping only shipping the log changes.

Unless you have a heterogenous situation, different software versions or licenses, the process of replication would be cleaner, though not necessarily easy to setup in the way needed. You normally for example won't simply replicate any and all changes of any table, you will define master and slave databases, at the very roughest granularity level and can and maybe should make decisions per table and even per column and even protect some rows in all databases to differ, for example a row containing area/locale info has to differ in each location and isn't replicated or synced in any way.

It's not a simple thing you may switch to in five minutes, but replication then does all the needed steps to export and import data in the way replication is configured to do. To make it work, most probably you will have to do some structural changes to the database(s), if only for easier configuration of what to replicate in which way.

The gist of all this simply is, I see replication as a way out of your seemingly self developed data syncing via export/staging import/data cleanising/merge operation, but it's not easily set up and someone going for that task would need to get full understanding of your database structures, table and column meanings before replication can be set up at all. This is not merely a thing someone versed in replication can setup as an outsourcable job or forum question answer, this is a full blown job.  A simple try on error very probably just leads to a situation you'd need a recovery.

Part of the problem is, you surely won't want every affiliate database to have all data of all other affiliates. You want your centraal database to collect affiliate informations, but keep affiliates at the data they need locally. This is not a replication in the sense to propagate every change everywhere, so every database has all info available.

Bye, Olaf.
Thank you so much olaf for such a detailed explanation.

my staging data is coming from different files or from different vendors rather from same software.

These files contains raw data which requires further processing based on the business rules set by the business.

Once data is processed we pick smaller chunks one by one and insert into live db during soft window when there is less load on the server but sometimes we often faces locking issues.
In that case staging really makes sense. It's rather the case I mentioned as
merge in data you collect from external resources.

What could be more ideal is the structure of the live data, and the praparation of staging data should go as far, as only needing to UPDATE some or all columns of matching rows, no transformations or computations should be needed at that step. Dito with needing INSERTS of new rows, if there are any.

Updating the live tables you don't have the choices you have with bulk operations, as you don't just want to insert, but update data (judging from your previous posts and code provided there).

You might introduce steps removing data from staging, which doesn't actually hold new or changed data for you, for example by computing a hash for all rows and then see, if there is a hash change at all. That would only have a read load on your live database, which isn't causing a locking problem, especially if done with NOLOCK. Also you reduce the amount of data to merge in.

Bye, Olaf.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks a lot Olaf. I got something out of this discussion to start with. In the mean time I am also collecting few more relevant details to discuss. As of now closing this thread and thanks to you from bottom of my heart for your detailed explanations.

Thank you so much.
Hi Olaf,

just as a little hint: You don't need to add a hash value as computed column, SQL Server has the timestamp datatype for this purpose which is a binary counter over all tables which uses this column. Another possibility is to use a GUID column ("uniqueidentifier") which is changed to a new one on every update so the value is different from the one in the master table to compare.

Hi satmisha,

I would also first try to reduce the amount of data first. If such a big number of records must be imported from external resources I would try to reduce these externals or at least create a direct interface between the source database and yours ("linked servers"). It is often a lot faster to work with the external database directly instead of trying to import i.e. CSV data.

Cheers,

Christian
Bitsqueezer,
If data comes from external sources you can't rely on timestamp.

Timstamp would only hint on INTERNAL changes of the live database, you can't compare it with anything inside a CSV or other flat file coming from an external resource to the staging db.

The nature of the data coming from external vendors is making a checksum value quite the only simple change detection, you can of course also go as far as comparing all values of records, this'll just take even more time than you potetnially save.

Bye, Olaf.
Hi Olaf,

yep, you're right, didn't thought of that.

Cheers,

Christian
Thanks Experts.

I am closing this as I got few directions from this. Also I have realized that my prob lies in the sp responsible for update/inserttaking too much time. So I broke that sp to find out time taken by various queries. Going to open a new thread.
Thanks experts.

I have opened the new thread and closing this one. Appreciate your comments there..

https://www.experts-exchange.com/questions/28947982/Dynamic-Sql-Query-in-Sp-optimization-stage-to-live-update-insert.html