Solved

sybase iq 12.7 incremental loadind

Posted on 2011-02-14
11
1,355 Views
Last Modified: 2012-05-11
hello!

how can I do data loading in Sybase IQ 12.7, but I want to load only the modified/deleted/added records. I can't use Sybase ETL. I want to use only SQL.
Can I do this efficiently using Java support from Sybase?

Thanks in advanece.
0
Comment
Question by:i_viorel2010
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
11 Comments
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 34893325
This is a rather broad question - it usually takes a few months in a project to get this stuff working as a custom in-house solution.

Broadly speaking you need:

1

to be able to identify changed rows in your source system(s)

2

some way to deal with multiple operations on a single row - if the same row gets updated twice you need to consolidate this into one update; if a row gets inserted, updated and then deleted what do you want to have happen in your IQ?

3

to then apply a batched insert, batched delete and multiple update statements in IQ
Note this will be very slow. IQ is not designed to deal well with single-row updates. Whenever I hear someone talking about wanting to do updates and deletes in IQ I want to ask them if they're sure they're doing the right thing. What is your business requirement for this? It sounds to me like you are making the mistake of thinking of IQ as a transactional RDBMS - which of course it isn't.

Many sites have instead chosen to only ever insert, with some kind of "DateLoadedAt" column to track when any version of a row was inserted. You can then define views that show you only currently live rows, plus you also get the ability to go back in time and do "as at" queries for a specific point in time. Not only is this a lot more powerful for the users, it will be much more efficient in loading because you're only ever inserting, which is IQ's strength. Yes it means more data but do we care about that in IQ?

So the quick answer to your question is yes you can do this, yes you can build your own custom solution, but it will take you a while, it will never be very fast, and it is probably the wrong solution anyway.
0
 

Author Comment

by:i_viorel2010
ID: 34894345
Thanks for the response.
My idea is :
-create a stored procedure
-use load table based sql syntax  to load data in buffers
-for every buffer:
  -add new records from buffer to system data
  -update the records  
  -delete from system data the records marked as deleted in buffer
.
Suppose you have the following tables: create table myData(ID int, Sales biginy);
 and the following data in file sal.csv

3 1000 A
4 1200 A
3 1500 U
5 1600 A
4 1300 U
3 1500 D

A is for new record, U is for updated record and D for deleted record
Let be tmp_data the following temporary table.

create tmp_data(ID int, Sales int, Op char);

load table tmp_data(ID,Saled,Op,'\n')
from'd:\\sal.csv'
QUOTES OFF
ESCAPES OFF
FORMAT ascii
delimited by ';'    
skip 1;

How modify load table to load data in buffers?
 
0
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 34894373
I'm not sure what you're asking...
0
Secure Your Active Directory - April 20, 2017

Active Directory plays a critical role in your company’s IT infrastructure and keeping it secure in today’s hacker-infested world is a must.
Microsoft published 300+ pages of guidance, but who has the time, money, and resources to implement? Register now to find an easier way.

 

Author Comment

by:i_viorel2010
ID: 34894816
Ok. I want to implement the following algorithm:

declare start_id int;
declare block_size int;
declare ok int;
set skip_records=0;
set block_size=1000; %we load data in batch of 1000 records
set ok=1;
create tmp_data(ID int, Sales int, Op char);

while ok>0 loop
delete from tmp_data;
load table tmp_data(ID,Saled,Op,'\n')
from'd:\\sal.csv'
QUOTES OFF
ESCAPES OFF
BLOCKSIZE block_size
SKIP skip_records
FORMAT ascii
delimited by ';'    

insert into myData
select id,sales
where Op='A'

update myData
set myData.sales=tmp_Data.sales
from tmp_data inner join myData on
(tmp_Data.id=myData.id) and (tmp_Data.Op='U')

delete from myData
inner join tmp_Data
on (tmp_data.id=myData.id) and (tmp_data.Op='D')

set skip_records=skip_records+block_size;
select count(*) in OK
from tmp_data;

end loop

Can be optimized?It's a good approach?
0
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 34894890
Um.

Firstly the BLOCKSIZE parameter in the IQ LOAD TABLE statement has nothing to do with how many rows are in your batch. That is a specific measure for how memory should be allocated during loads and should not be changed unless you have very specific technical reasons for doing so. You haven't said anything to suggest you're in any of those circumstances.

Secondly a batch size of 1000 is absurdly small for IQ. I suspect you are in the habit of tuning batch jobs in ASE where indeed a smaller batch size helps things like transaction logging. This is not a factor in IQ and you kinda need to reverse all of your standard RDBMS habits. :) Load in as large a batch as you can.

Do you have a specific business requirement for only working on 1000 changes at a time?

Before we even get into how to optimise and implement your pseudocode, you need to go back to the drawing board. How will you deal with multiple operations on the same primary key? What happens if the same row is updated three times? Your current approach will fail because it will appear in your temp table three times and thus no longer be unique.

Don't even think about commands, syntax & tuning just yet - you still have quite a way to go before you're ready to look at physical implementation. You need to figure out many logical design issues like how to deal with (or prevent) multiple operations to the same primary key, how to detect and deal with errors at any stage of the process, whether it even makes sense to load the deltas directly into IQ (where is the data coming from? It might make more sense to have IQ connect directly to that source).

You also need to have a good think about whether it makes sense to do any updates or deletes at all. Not everyone agrees with this but I believe that if someone is talking about deletes in a data warehouse there is a problem with their business analysis and design. I suspect you are still thinking in relational transactional database terms. A data warehouse is not a database the what makes sense for one does not make sense for the other. Why would we willingly destroy information? Hint: what reasons are there for deleting data in your source systems? Most of those reasons will not apply to a data warehouse.

There's a reason I say this sort of thing takes months to do right. You are not going to get the answer in just a couple of hours from a web site, this is a huge job. ETL often takes 50% of a data warehouse project!
0
 

Author Comment

by:i_viorel2010
ID: 34895153
Thanks for the response. I
 It's about BLOCK FACTOR. It's my mistake. The batch size can be huge. But for 3-4 TB data,
1 GB(in the number of records approximately) batch size can be good?  

What happens if the same row is updated three times? Your current approach will fail because it will appear in your temp table three times and thus no longer be unique.

If the id is unique and the change of a field different from primary key is recorderd in the Op column,
for example the data is in the following form:

3 1500 U
5 1600 A
3 1300 U

can I suppose the order in sequentially form for update statement?

I will remove the delete operation.

The code become:


declare start_id int;
declare block_size int;
declare ok int;
set skip_records=0;
set block_size=1000000000; %we load data in batch of 1 billion records
set ok=1;

create tmp_data(ID int, Sales int, Op char);
while ok>0 loop
delete from tmp_data;
load table tmp_data(ID,Saled,Op,'\n')
from'd:\\sal.csv'
QUOTES OFF
ESCAPES OFF
block factor block_size
SKIP skip_records
FORMAT ascii
delimited by ';'    

insert into myData
select id,sales
where Op='A'

update myData
set myData.sales=tmp_Data.sales
from tmp_data inner join myData on
(tmp_Data.id=myData.id) and (tmp_Data.Op='U')

set skip_records=skip_records+block_size;
select count(*) in OK
from tmp_data;

end loop

0
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 34895203
I think you're being distracted by the technology. :) I do that too.

It's way too soon to be writing a single line of code. Get the design right. The coding is the easy part but it has to be done last or it all goes to hell.

Re. multiple updates: ordering sequentially won't work, because you are proposing to update the permanent rows based on a join to the temp table - and your temp table will have multiple rows in it. You have several approaches - either collapse the updates into one update per row (difficult - you have to detect which column(s), figure out what the final after-image should look like), or you can enforce a rule that only one change to a row is applied in any one load (difficult - what if there are dependencies on this row? You'll have to identify those and push them into the next load also).

And even then, how do you want to handle a row being inserted, then updated, then deleted?

Also if you're thinking in terms of loops your code isn't ready for a database yet. :) Do as much as possible in set-based operations. If you're batching up your loads - what do you want to happen if the second batch fails? Do you still want to do the third? What if the third contains an update to a row that was inserted in the second (failed) batch and now therefore won't work?

It's way too soon to be writing code. It'll be too soon to write code next week, too. :) Like I keep saying, this takes months to get right. If you're only doing this for one table then perhaps it will only take weeks. :)
0
 

Author Comment

by:i_viorel2010
ID: 34895357
Thanks for the advice.

The idea is that primary key doesn't change.
The problem with selection of last update can be solved with an inner join and rowid.


And even then, how do you want to handle a row being inserted, then updated, then deleted?

The insertion suppose a new ID.

The size of  batch can be reduced if I will delete all records that have the state add and delete.

I can use transaction for batch insert.

But the mean question is: Can I use approach with while?
They are many problems and I suppose that batch load with transaction can help me.
Or is better idea to load all data in a table and to make changes at final?
In Sybase IQ 12.7 I don't have incremental loading and I try to implement this facillity.
0
 
LVL 24

Accepted Solution

by:
Joe Woodhouse earned 500 total points
ID: 34895419
Right, the PK won't change, but what if you have multiple changes to the same row?

There's no need for any WHILE loop here.

IQ will be very fast for all reads and inserts, slow for deletes and very very slow for updates. You want to minimise those.
0
 

Author Comment

by:i_viorel2010
ID: 34903491
Thanks for the advices. i will give 500 points.
0
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 34903538
Hope that helped! Sorry there isn't a quick & easy answer here, this is actually the most difficult part of any DW implementation...
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Small DB [ Application with 20GB Storage Size Minimum ] 4 520
dbisql 5 664
Problems writing binary files to Sybase from PHP 2 192
Clean up a mailbox 5 162
Active Directory security has been a hot topic of late, and for good reason. With 90% of the world’s organization using this system to manage access to all parts of their IT infrastructure, knowing how to protect against threats and keep vulnerabil…
The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
Email security requires an ever evolving service that stays up to date with counter-evolving threats. The Email Laundry perform Research and Development to ensure their email security service evolves faster than cyber criminals. We apply our Threat…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question