SQL Query Needed

Hi Experts!

Hope somebody could possibly help here? Im no SQL Expert and pulling my hair out trying to find an answer!

We have a SQL Database which has a Link Database to Google BigQuery. On the SQL Server I can run the following query which returns all the data.

SELECT * FROM OPENQUERY([BQ], 'SELECT * FROM dataviews.salesdata')

What I would like to be able to do is import all this data into a table in our SQL Database called SalesData.

I only want to insert the fields that have changed though (or don't exist).

Could somebody possibly help me with a SQL Statement

Thanks in advanced for any Help offered!
Simon GreenSenior Infrastructure and Security ManagerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Mark WillsTopic AdvisorCommented:
Welcome to EE - At least, to asking a SQL question...  :)

If you are happy with the results from SELECT * FROM OPENQUERY([BQ], 'SELECT * FROM dataviews.salesdata')

The I would suggest the first step is to secure that information into a 'space' which you can then manipulate / compare / update your target table in the  SalesData DB. So, lets create a uniquely named table and store the dataviews.salesdata into that by using SELECT ... INTO ... FROM

The only "gotcha" is the table you are selecting into cannot already exist. We sometimes use temporary tables (prefix of #) for such a purpose, but a uniquely identifiable table will give you more flexibility.... For this purpose, I will use the name 'tmp_load_salesdata' which becomes more like a staging table.

USE [salesdata]  -- set your DB location.

IF object_id('tmp_load_salesdata','U') is not null drop table tmp_load_salesdata

-- now get our data

INTO tmp_load_salesdata
FROM OPENQUERY([BQ], 'SELECT * FROM dataviews.salesdata')

-- now check that we have our data

SELECT * FROM tmp_load_salesdata

Open in new window

Once we are happy with that, we can then use that data to
a) find differences to the salesdata table
b) add new entries to salesdata table

So, do you already have the salesdata table ?
Do the column names match the new tmp_load_salesdata ?

Can we get a list of column names you want to compare / insert from both tables ?

Mark Wills
Echo Mark's suggestion to bring the data on board where direct subsequent queries can be run.

Your question does not include enough detail dealing with logic, business process on whose basis the data you want versus the data you don't.

You further seem to be looking at comparing data  ........
An update using join to compare, update.

A sample illustration of data from one and sat from the other describing what the resulting data would look like, might help clarify your need and what the possible solution might be.
Scott PletcherSenior DBACommented:
You might first check to see if Google BigQuery has a way to query only rows that have changed since a certain datetime.  That would make your task vastly easier if it's available and can be implemented.  For example, perhaps something like using EXCEPT and FOR SYSTEM TIME AS OF to limit rows to get only added/changed rows since that time (NOTE: I'm certainly not proficient in Google BQ, that is just a possible idea to maybe toggle some ideas from your BQ folks on how to do it).

Assuming it can't, you'll need to consider:
1) how large the Google data is; and
2) how often you'll need to copy/sync it.

For the actual load, you'll want to go into a db that has enough log space pre-allocated to handle the full load.
If you want the table structure to be dynamic, you should create the table structure only first using something like:
USE <db_name>;
IF OBJECT_ID('dbo.<table_name>') IS NOT NULL
    DROP TABLE dbo.<table_name>
SELECT * INTO dbo.<table_name> FROM OPENQUERY([BQ], 'SELECT * FROM dataviews.salesdata LIMIT 0')
--create clustered index here if Google data is already in a known sequence, which will avoid a sort later--
Then load the full data using:
INSERT INTO dbo.<table_name> WITH (TABLOCK) /*the TABLOCK hint is critical here*/
SELECT * FROM OPENQUERY([BQ], 'SELECT * FROM dataviews.salesdata')

Loading the full table with SELECT ... INTO <table_name> will cause internal locks on system objects to be retained until the end of the INSERT (actually, I think until the INSERT is committed, which could be even longer).  Those locks can cause other tasks to be blocked, or even for SQL to consider it deadlocked in very rare cases.

Sorry, don't have time for any other suggestions right now, but hopefully this gives you something to work with.
Simon GreenSenior Infrastructure and Security ManagerAuthor Commented:
Hi All!

Many thanks for everyones response.

To answer some of the questions ...

Mark Wills - Yes, SalesData Table already exists and is like for like on what will come from the BQ Query.
                      Table has the following Feilds
                      Data_1 is always Unique                    

Scot Pletcher - BigQuery seems to be a Royal PITA. I don't think at present (Maybe wrong) you can check for changes in the last x minutes.

Basically anything in the DB within BigQuery could change except Coloum_1 (Data_1)

The aim is to have onsite SQL with the data in. People will then query the SQL Box onsite. We have some site which have extremely low bandwidth and pulling the data live is just becoming un-workable.

Mark WillsTopic AdvisorCommented:
when you say Data_1 is always unique, is that from day to day, or is it like a primary key that you can search on to compare against yesterday's data.

Or is it a replacement file (table) ?

What is the frequency of download ?

What kind of volumes are we talking about ?

Last time I played with BigData, we could set up schemas and row definitions.... And there are some API's....

I would still download to a staging table so we can then put away to the SalesData table. The salesdata table will grow after each addition / download of (daily) data.

After download to staging, need to put an index on whatever fields / columns are used to identify unique data (in the example / demo below it would be data_1 and data_2)

Depending on frequency this could all be an overnight job to minimise impacts on daily working hours performance.

Then run your updates - for demo purposes, lets say Data_1 and Data_2 is enough to identify existing rows in the salesdata table.

You would run a two stage update.
1) update existing rows with new data
2) insert new rows

For the purpose of example, I will assume data_1 and data_2 are used to uniquely identify a row, and data_3 and data_4 are a couple of columns we want to update (or insert) you would need to expand to account for all data_1 through to data_8.

-- create indexes

create index idx_load_salesdata on tmp_load_salesdata (data_1,data_2)   -- could be unique clustered index

-- update stage - a couple of different ways to execute the update. Field by field updates, or all fields in one update
begin transaction
update s set s.data_3 = t.data_3
from salesdata s
inner join tmp_load_salesdata t on s.data_1 = t.data_1 and s.data_2 = t.data_2
where s.data_3 <> t.data_3

begin transaction
update s set s.data_4 = t.data_4
from salesdata s
inner join tmp_load_salesdata t on s.data_1 = t.data_1 and s.data_2 = t.data_2
where s.data_4 <> t.data_4

-- or

begin transaction
update s set s.data_3 = iif(s.data_3 <> t.data_3,t.data_3, s.data_3)
            ,s.data_4 = iif(s.data_4 <> t.data_4,t.data_4, s.data_4)
from salesdata s
inner join tmp_load_salesdata t on s.data_1 = t.data_1 and s.data_2 = t.data_2

-- now insert new rows

begin transaction
Insert salesdata (data_1,data_2,data_3,data_4)
select t.data_1, t.data_2, t.data_3, t.data_4
from tmp_load_salesdata t
left outer join salesdata s on s.data_1 = t.data_1 and s.data_2 = t.data_2
where s.data_1 is NULL

Open in new window

Of course, depending on a few assumptions I have made, it might need to be massaged a bit, but should give the general idea...

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Query Syntax

From novice to tech pro — start learning today.