Link to home
Start Free TrialLog in
Avatar of i_viorel2010
i_viorel2010

asked on

sybase iq 12.7 incremental loadind

hello!

how can I do data loading in Sybase IQ 12.7, but I want to load only the modified/deleted/added records. I can't use Sybase ETL. I want to use only SQL.
Can I do this efficiently using Java support from Sybase?

Thanks in advanece.
Avatar of Joe Woodhouse
Joe Woodhouse

This is a rather broad question - it usually takes a few months in a project to get this stuff working as a custom in-house solution.

Broadly speaking you need:

1

to be able to identify changed rows in your source system(s)

2

some way to deal with multiple operations on a single row - if the same row gets updated twice you need to consolidate this into one update; if a row gets inserted, updated and then deleted what do you want to have happen in your IQ?

3

to then apply a batched insert, batched delete and multiple update statements in IQ
Note this will be very slow. IQ is not designed to deal well with single-row updates. Whenever I hear someone talking about wanting to do updates and deletes in IQ I want to ask them if they're sure they're doing the right thing. What is your business requirement for this? It sounds to me like you are making the mistake of thinking of IQ as a transactional RDBMS - which of course it isn't.

Many sites have instead chosen to only ever insert, with some kind of "DateLoadedAt" column to track when any version of a row was inserted. You can then define views that show you only currently live rows, plus you also get the ability to go back in time and do "as at" queries for a specific point in time. Not only is this a lot more powerful for the users, it will be much more efficient in loading because you're only ever inserting, which is IQ's strength. Yes it means more data but do we care about that in IQ?

So the quick answer to your question is yes you can do this, yes you can build your own custom solution, but it will take you a while, it will never be very fast, and it is probably the wrong solution anyway.
Avatar of i_viorel2010

ASKER

Thanks for the response.
My idea is :
-create a stored procedure
-use load table based sql syntax  to load data in buffers
-for every buffer:
  -add new records from buffer to system data
  -update the records  
  -delete from system data the records marked as deleted in buffer
.
Suppose you have the following tables: create table myData(ID int, Sales biginy);
 and the following data in file sal.csv

3 1000 A
4 1200 A
3 1500 U
5 1600 A
4 1300 U
3 1500 D

A is for new record, U is for updated record and D for deleted record
Let be tmp_data the following temporary table.

create tmp_data(ID int, Sales int, Op char);

load table tmp_data(ID,Saled,Op,'\n')
from'd:\\sal.csv'
QUOTES OFF
ESCAPES OFF
FORMAT ascii
delimited by ';'    
skip 1;

How modify load table to load data in buffers?
 
I'm not sure what you're asking...
Ok. I want to implement the following algorithm:

declare start_id int;
declare block_size int;
declare ok int;
set skip_records=0;
set block_size=1000; %we load data in batch of 1000 records
set ok=1;
create tmp_data(ID int, Sales int, Op char);

while ok>0 loop
delete from tmp_data;
load table tmp_data(ID,Saled,Op,'\n')
from'd:\\sal.csv'
QUOTES OFF
ESCAPES OFF
BLOCKSIZE block_size
SKIP skip_records
FORMAT ascii
delimited by ';'    

insert into myData
select id,sales
where Op='A'

update myData
set myData.sales=tmp_Data.sales
from tmp_data inner join myData on
(tmp_Data.id=myData.id) and (tmp_Data.Op='U')

delete from myData
inner join tmp_Data
on (tmp_data.id=myData.id) and (tmp_data.Op='D')

set skip_records=skip_records+block_size;
select count(*) in OK
from tmp_data;

end loop

Can be optimized?It's a good approach?
Um.

Firstly the BLOCKSIZE parameter in the IQ LOAD TABLE statement has nothing to do with how many rows are in your batch. That is a specific measure for how memory should be allocated during loads and should not be changed unless you have very specific technical reasons for doing so. You haven't said anything to suggest you're in any of those circumstances.

Secondly a batch size of 1000 is absurdly small for IQ. I suspect you are in the habit of tuning batch jobs in ASE where indeed a smaller batch size helps things like transaction logging. This is not a factor in IQ and you kinda need to reverse all of your standard RDBMS habits. :) Load in as large a batch as you can.

Do you have a specific business requirement for only working on 1000 changes at a time?

Before we even get into how to optimise and implement your pseudocode, you need to go back to the drawing board. How will you deal with multiple operations on the same primary key? What happens if the same row is updated three times? Your current approach will fail because it will appear in your temp table three times and thus no longer be unique.

Don't even think about commands, syntax & tuning just yet - you still have quite a way to go before you're ready to look at physical implementation. You need to figure out many logical design issues like how to deal with (or prevent) multiple operations to the same primary key, how to detect and deal with errors at any stage of the process, whether it even makes sense to load the deltas directly into IQ (where is the data coming from? It might make more sense to have IQ connect directly to that source).

You also need to have a good think about whether it makes sense to do any updates or deletes at all. Not everyone agrees with this but I believe that if someone is talking about deletes in a data warehouse there is a problem with their business analysis and design. I suspect you are still thinking in relational transactional database terms. A data warehouse is not a database the what makes sense for one does not make sense for the other. Why would we willingly destroy information? Hint: what reasons are there for deleting data in your source systems? Most of those reasons will not apply to a data warehouse.

There's a reason I say this sort of thing takes months to do right. You are not going to get the answer in just a couple of hours from a web site, this is a huge job. ETL often takes 50% of a data warehouse project!
Thanks for the response. I
 It's about BLOCK FACTOR. It's my mistake. The batch size can be huge. But for 3-4 TB data,
1 GB(in the number of records approximately) batch size can be good?  

What happens if the same row is updated three times? Your current approach will fail because it will appear in your temp table three times and thus no longer be unique.

If the id is unique and the change of a field different from primary key is recorderd in the Op column,
for example the data is in the following form:

3 1500 U
5 1600 A
3 1300 U

can I suppose the order in sequentially form for update statement?

I will remove the delete operation.

The code become:


declare start_id int;
declare block_size int;
declare ok int;
set skip_records=0;
set block_size=1000000000; %we load data in batch of 1 billion records
set ok=1;

create tmp_data(ID int, Sales int, Op char);
while ok>0 loop
delete from tmp_data;
load table tmp_data(ID,Saled,Op,'\n')
from'd:\\sal.csv'
QUOTES OFF
ESCAPES OFF
block factor block_size
SKIP skip_records
FORMAT ascii
delimited by ';'    

insert into myData
select id,sales
where Op='A'

update myData
set myData.sales=tmp_Data.sales
from tmp_data inner join myData on
(tmp_Data.id=myData.id) and (tmp_Data.Op='U')

set skip_records=skip_records+block_size;
select count(*) in OK
from tmp_data;

end loop

I think you're being distracted by the technology. :) I do that too.

It's way too soon to be writing a single line of code. Get the design right. The coding is the easy part but it has to be done last or it all goes to hell.

Re. multiple updates: ordering sequentially won't work, because you are proposing to update the permanent rows based on a join to the temp table - and your temp table will have multiple rows in it. You have several approaches - either collapse the updates into one update per row (difficult - you have to detect which column(s), figure out what the final after-image should look like), or you can enforce a rule that only one change to a row is applied in any one load (difficult - what if there are dependencies on this row? You'll have to identify those and push them into the next load also).

And even then, how do you want to handle a row being inserted, then updated, then deleted?

Also if you're thinking in terms of loops your code isn't ready for a database yet. :) Do as much as possible in set-based operations. If you're batching up your loads - what do you want to happen if the second batch fails? Do you still want to do the third? What if the third contains an update to a row that was inserted in the second (failed) batch and now therefore won't work?

It's way too soon to be writing code. It'll be too soon to write code next week, too. :) Like I keep saying, this takes months to get right. If you're only doing this for one table then perhaps it will only take weeks. :)
Thanks for the advice.

The idea is that primary key doesn't change.
The problem with selection of last update can be solved with an inner join and rowid.


And even then, how do you want to handle a row being inserted, then updated, then deleted?

The insertion suppose a new ID.

The size of  batch can be reduced if I will delete all records that have the state add and delete.

I can use transaction for batch insert.

But the mean question is: Can I use approach with while?
They are many problems and I suppose that batch load with transaction can help me.
Or is better idea to load all data in a table and to make changes at final?
In Sybase IQ 12.7 I don't have incremental loading and I try to implement this facillity.
ASKER CERTIFIED SOLUTION
Avatar of Joe Woodhouse
Joe Woodhouse

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the advices. i will give 500 points.
Hope that helped! Sorry there isn't a quick & easy answer here, this is actually the most difficult part of any DW implementation...