[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1377
  • Last Modified:

How does Sybase IQ store data vertically?

Hi,

I have a question regarding Sybase IQ, a data warehousing product. I have read that it stores data on pages vertically rather than horizontally as is on normal databases. However, i'm failing to understand how this is done.

What is the real definition of a table scan? Is it when an entire page is brought into memory and scanned? Or is it when all the pages associated to a table are brought into memory? The reason I ask is because I read in the manuals that Sybase IQ never performs a table scan but a column scan at most.

Can someone please explain how the data is physically stored on an IQ page? Its really bugging me because the manuals just say the data is stored vertically which increases the speed of queries but doesn't explain or give examples on how the data is stored!

THANKS.
0
SybaseUk
Asked:
SybaseUk
  • 4
  • 3
1 Solution
 
grant300Commented:
IQ is what is called a "column store" database.  Instead of storing a record at a time and stacking records into pages and pages into page strings, it stores single column values into pages and pages into page strings.

When data is loaded, each column is treated separately.  IQ will light up a thread for each column of the table being loaded.  Each thread handles just one column.

In the process of loading a column, the data is compressed and indexed.  How this is done depends on the data type, cardinality of the domain, and the index type(s) explicitly defined for that column.  For instance, very load cardinality fields such as GENDER (Male/Female) will be defined with a "Low Fast" index.  This creates a bit map rather than storing the actual character string and then applies a compression algorithm to the bit map.  As a result, instead of taking an average of 5 bytes/40 bits (plus overhead) per column value for GENDOR, it takes less than 1 bit given the compression and you get an index on the column for free.

IQ pages are not really fixed size, either.  Pages start out at some predetermined size set at database creation time, typically 512KB.  The page holds the raw data for the column and, of course, the number of column  entries it contains depends on the size of the raw data.  The page as it is stored on disk is the size after compression and indexing.  In this way, the actual I/O bandwidth and disk required can be smaller than the raw data size and is always much smaller (usually a quarter to an eight) than the overall space required for a traditional RDBMS after loading and indexing.

Row reconstruction is done by using the ROWID.  Column values for a particular row can be constituted into a traditional row because the ROWIDs are tracked for each page.

The speed of IQ comes in part because you need only search the data (columns) that appear in the WHERE clause of a query, in part because every column is (or can be) indexed, and in part because the data is spread out over a lot less disk space.

The outline above is a greatly simplified view of how IQ function internally but it gives you a rough idea of what is going on.  For a more detailed look, you can see my article, "An Introduction to Sybase IQ" in one of the ISUG Tech Journals in 2007.  

BTW, the nominal cost of membership not only gets you a subscription to the Tech Journal, it also gets you access to a lot of Sybase software for free or at greatly reduced prices.  For instance, you can get a fully functional, non-expiring Sybase IQ Developer's Edition for FREE.

Regards,
Bill
0
 
SybaseUkAuthor Commented:
Thanks Bill, I shall be awarding you the points.

I just want to understand how a typical table with say 5 columns would be stored internally.

For example if I have a table called 'authors' with the columns:

au_id
au_fname
au_lname
gender
city

Could you explain how this data would be stored in IQ please?

Thanks!

p.s. could you mail the article to me please? I'll greatly appreciate!

sybaseveteran at gmail dot com
0
 
grant300Commented:
I ma not sure how better to explain it than I already did.  Each column is stored individually on it's own pages and pages chains.

You can think of your 5 column table as being stored in 5, separate, 1 column tables with coordinated internal row ids.  When it gets loaded, the record gets split into it's columns and each one loaded into the appropriate one column "table".  When you query it and get back a particular row, say rowid=593, the database goes out to the one-column "tables", gets the values for rowid 593 and reconstructs the entire row.

Of course, the indexing and compression going on in IQ makes this technique much more efficient than implementing it in a standard relational database would be.

I am not allowed to distribute the articles and columns I write for the ISUG Tech Journal without permission.  Sorry about that.  It is copyrighted (by ISUG) and, quite frankly, the Tech Journal is one of the big draws of ISUG membership.  Should you opt to become a member, all of the past editions are on line in the archive.

Regards,
Bill
0
New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

 
SybaseUkAuthor Commented:
Thanks Bill once again.

Thats what I wanted to know. So basically, each column has its own page exclusively? So 'Gender' would be on a page itself and no other columns would exist on the same page?

This is starting to make sense and I can see the key difference in Sybase IQ now...If each column has its own page wouldn't there be more I/O i.e. more pages being read/written?
0
 
grant300Commented:
Correct, 'Gender' would have it's own page and page chain exclusively.

As far as I/Os are concerned, it depends on the operation.
 - For individual Inserts, Update, and Deletes, you pay a huge penilty
 - For operations extracting most or all of the columns across a bunch of rows, you either come out ahead of the game or behind depending on the relative location of the rows being returned.  For a bunch of contiguous rows, you come out ahead; for a few widely dispersed random rows you come out behind.
 - For analytic queries, e.g.what percentage of red head kids go out for team versus individual sports, what is the distribution of missed installation appointments by county and miles traveled, you come out way, way ahead on I/O.  These kinds of queries typically use a tiny fraction of the columns and scan large sections of the ones that are accessed.  This is where being able to deal with just the data columns required and dealing with them in a compressed and indexed form (remember, just about every column has at least one index) greatly reduces the amount of I/O required.  It also plays into the trend in hardware by shifting more of the work to the CPU and away from the disk subsystem.  We have 3Ghz quad core chips but, after 20 years, disk drives still only do about 100 I/O ops per second.

Regards,
Bill
0
 
SybaseUkAuthor Commented:
Thanks!

Lastly on this subject, can you explain a little about bit-wise indexing? What does this mean and how does it reduce the storage space required as opposed to B-Tree indexing...

Very Much appreciated!
0
 
grant300Commented:
In it's simplest form, bit-wise indexing can be explained as follows:
For a column that takes on one of two values, say MALE and FEMALE, all that is require to represent those values is 1 bit.  There is a column lookup table kept internally that keeps track of what the bit means.  As I said before, 1 bit beats the average of 40 bits (5 bytes * 8 bits per byte) so you get 40 to 1 compression versus a VARCHAR(6) with the values MALE and FEMALE.

Because the page only contains values for one column, you simply have a string of hundreds or thousands of bits that represent the values for the same number of records.  For example, if you have 4800 GENDER column values on a page, the page shrinks from roughly 24,000 bytes to just 600 bytes.  If you then apply even simple minded compression, say run length encoding, you probably shrink it by half again.  In terms of disk I/O, you went from needing 47 disk sectors to only one.

This can be expanded to higher cardinality domains; 2 bits store 4 values, 3 bits, 8, etc.

Regards,
Bill
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now