Solved

SQL Pagination optimization in Java??

Posted on 2011-03-03
22
1,580 Views
Last Modified: 2012-05-11
Hello to all experts and laymen alike!

My question deals with optimization of pagination using Java.

We are trying to write a Java database comparison tool that compares very large query results in two databases for regression testing purposes. Query results are often 20 million records large!

Right now, we are working with MySQL with migration to Oracle in mind.
We have tried to use the LIMIT/OFFSET approach built into MySQL in a syntactically simple manner, but have run into performance issues when trying to get the pages representing the millionth plus records.

What we have done is Implement a chunk approach using specific 'where' clauses that limit the actual query results, based on an auto incremented column that we know is there.

The problem is that now we need to compare generic DBs and queries without being certain of it's structure, or contents.
We have also looked into JPA and Hibernate and would love to hear whether that could serve us here.

THANK YOU SO MUCH!
0
Comment
Question by:eranhazout
  • 9
  • 7
  • 6
22 Comments
 
LVL 3

Expert Comment

by:mwiercin
ID: 35027115
Can't see simple solution here, unless you are able to retrieve records in guaranteed ordered manner, you won't be able to have it perform well. Your last resort is to parse the table structure, to determine sortable fields, indexes, auto increments (with MySQL SHOW INDEXES would help, watch the cardinality with InnoDB though as it's not a reliable value, it's dynamic estimate of InnoDB) and try to determine best way to parse it. It won't cover you in all scenarios though. Keep in mind that SELECT * FROM table LIMIT xxxx does not give you *any* guarantees on what rows you will get (and that's true in most databases, you may get the same results in most cases, but it's not guaranteed).

If you won't be able to find out database structure or use it for your task, then you're facing need of retrieving everything with memory being bottleneck.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35029807
Isn't it possible to use some meaningful parameter with reasonable spread - to create an index on it and then
first run some analysis to determine the intervals to ensure you get reasonably sized chunks,
and then get records ordered by that parameter. You can use this parameter for splitting and then
perhaps combination of fields for ordering inside chunks - just thinking outload, probably not quite understanding
your problem
0
 

Author Comment

by:eranhazout
ID: 35043747
@mwiercin: HI , thank you for your answer. I have tried to test the order of the results received and have still not encountered a different order between different times. I do realize that, probably because of the SET implementation of the records, there might be cases of a mismatch in order, but it has not happened yet. I WOULD LOVE TO HEAR MORE ON THIS

@yan: To find and/or create a meaningful parameter on a specified query result, wouldn't I first have to perform that query, and then in order to run the analysis, assign the results to a ResultSet? if so (and atm I don't see any other way...) then I am still left with the memory problem.

Thanks guys, I would love to hear more, and would REALLY appreciate if it a Java expert would join in on the fun...
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35043779
No, you don't need to retrieve the whole data set to perform the analysis - you ca do it while not getting the data from database using the databse facility to do - finding max, min values and
even in more sophisticetd ways. Of course it depends on kind of adatabases you are using, but I believe most of them have a lot of such operations inside databses - they are of course much more effective than reading in the whole ResultSet
0
 
LVL 3

Expert Comment

by:mwiercin
ID: 35043788
@eranhazout

In 99.99% of cases you will get repeatable results. Ordering can easily change during any maintenance operations, i.e. realigning data on the disk etc.

Are you allowed to alter structure of the data to assign single sequential id to the records for time of the comparison?


0
 

Author Comment

by:eranhazout
ID: 35043957
@mwiercin
If I understand correctly, first I need to create a temporary table from the query and then add the ID column?
This would speed things up since I will not be reading so many results into memory, and I can use a 'where' clause on the added ID to dramatically increase performance...
How would I add this ID, by a simple ALTER statement?
please excuse my DB newbiness...

@for yan
max and min values of which column?, I have no knowledge of the DB structure. don't I first have to perform the query, insert it into a temp table like I wrote before, and then analyse it? can I find the primary key or an autoincremented field? how would you suggest I do this, and please understand my unfamiliarity with this field...
0
 
LVL 3

Expert Comment

by:mwiercin
ID: 35043994
In MySQL you can easily do:

alter table mytable add column id int not null auto_increment unique key

Open in new window


You can have only one auto_increment column per table (shouldn't make a difference in your case, as you only need one).
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35044035
Do you have any business knowledge of the data in these tables? I mean if you know that if you are talking of say data for people you may guess that for certain applications you may seprate them by say date of birth and eachyear will have much less data - you ca hve your chunks to compare. And tehn within the chunks you'll order by last name, first name and zip code, etc- these kind of natural chunks would be something to consider. If you do not have any idea of what this data is about then I agree it may probably be  more difficult. Still you may look at max and mins of columns analyze theri distrubutions aall wuth in-database operations with SQL queries even before you get toi porogramming in Java
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35044091
On the other hand, by the way, you should not think that when you slect everythin from some table in java program then all these data descend on your porogram and you have immediately memory issue. Most of the reasonable database drivers do not opreate this way therefore if youdesugn good ordering of your rows based perhaps on multiple columns you may do this comparison directly - it does not meanthat you'll have to keep all your rows of both tables in memory. in general if you plan to transitin to Oracle and coould wait till you have it in Oracle, then it would probably be easier to accomoplish as Oracle and drivers etc are designed to deal with millions of rows
0
 

Author Comment

by:eranhazout
ID: 35046095
@mwiercin
what is there already is one auto incremented field?  how can I find it using NySQL syntax??

@for yan
1. we are trying to make it general in the sense that we do not know the structure of the tables
I'd love to find out how I can ask questions in SQL in order to get info like: is there an auto incremented column? what is it? what is the key? etc.
2. I did not quite understand what you meant in the second reply. You mean the database drivers take care of pagination for you? this would mean that when you use the resultSet  = statement.executeQuery("some query"), the results are queried in chunks or pages in a way that is transparent to the developer?
further explanations would be very appreciated!
0
 
LVL 3

Expert Comment

by:mwiercin
ID: 35046128
If you have auto_increment field already then in my opinion you can use it easily for your task, it's guaranteed to be NOT NULL. You can get table definition by runinng:

SHOW CREATE TABLE mytable; 

Open in new window


, or:

DESC mytable; 

Open in new window


First one is easier to read, and latter is easier to use of application, as it breaks down fields in separate structured records.

If you don't have AUTO_INCREMENT field, you can still search for NOT NULL fields having a key on them (ideally UNIQUE one). Run:

SHOW INDEXES FROM mytable; 

Open in new window


Cardinality field describes how selective index is, higher is better. Cardinality of 1 means the index is almost useless for you. I reckon you can write something up that will read structure of the table and try to find out most selective fields to use. AUTO_INCREMENT or NOT NULL integers with UNIQUE key on them would be perfect. I would myself look in this order:

1. Primary key
2. NOT NULL field with UNIQUE index on, if you have more choose the one with higher cardinality
3. NOT NULL field with a regular index on, same with above

If none of these exist you can try looking for composite keys, analyze cardinality of non-indexed fields yourself (you may end up with quick win by adding single index), etc.

Other point - if you're using InnoDB table and it doesn't have defined primary key, simply add one. It won't cost you
anything

Open in new window

if you make it integer. That's because if there is none, InnoDB adds hidden one itself (it needs it for ordering the data in clustered index), which you can not access yourself.

Btw. If table's InnoDB then cardinality metric is not an exact science, it's just a rough estimate (run it few times and you will find out it changes almost every single execution), but it's pretty accurate.


0
Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

 
LVL 47

Expert Comment

by:for_yan
ID: 35046166
Well, I don't think I can say it abouit all databases and all drivers. The experience I have is mostly with the Oracle database and thin JDBC driver which I'm using for more than 10 years. I did read millions of rows in one query and never associated it even with necessity of increasing memory heap requirements with java command. I always believed that driver takes care of pulling the data using some chunks going forward as you traverse your result set. That's why these traversing methids are normally one way - if you processed particular row you cannot return back. I guess, this the most common way of dealing with databases - have you ever used, say, TOAD, which is a great tool to deal with SQL - again I'm using it on top of Oracle. No matter how big table you have, you can always say select * from sometable - I didn't see it ever complain after that command that it does not have enough memory and it immediately responds and shows you some top rows  immediately - I'm sure it cannot retrieve the whole table in one great fit of data loading - most real tables in such case would have overwhelmed any conputer memory.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35046208
This is the SQL query  you can use to determine  primary key in Oracle:
http://www.techonthenet.com/oracle/questions/find_pkeys.php
0
 
LVL 3

Expert Comment

by:mwiercin
ID: 35048270
@for_yan

I agree the database driver can take care of retrieving the data as you process it. But the fact is that some data/metadata about the result set have to be stored *somewhere*, whether that's a client or a server. The result has to be consistent - at the time you receive last bit of it, it needs to be the same as it was at the time the query started. Imagine that the last row of 20 Gb table has been deleted after you run the query, but before your processing script got to the end - this row still has to be passed to the client. It doesn't  mean that the copy of the result set has to be stored in full in the memory, it's being taken care by MVCC engine of the database, what implies an overhead at that layer.

The point of the author was to retrieve these records in an efficient manner which LIMIT/OFFSET certainly is not (if you do LIMIT 100 with offset million it will process million and hundred records, overall complexity is exponential). Regardless of him processing it in one go having help from the driver or doing that in repetitious reads, reading 20Gb table in full is not a cheap operation, what needs to be highlighted. There will be a cost of doing so, the goal is to minimize it :)
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35048358
@mwiercen:

I agree, I was not advocating comparing two tables of 20 mln records in one shot; I was just explaining that it doesnt go all in memory together.

Actually, what I think is reaonable is to devise effective ordering of records and then
keep track of the last record compared (maybe, say every certain chunk of records) either in a separate table or in a file,
so that if the process stops for some unexpected reason - you can then restart comparison adding the starting point into next query.
With such precaution you can probably indeed embark on a big comparison operation
without splitting data



 
0
 

Author Comment

by:eranhazout
ID: 35067083
Thanks for the detailed comment. Now pardon my newb questions:
@mwiercin

1. How do I add the primary key efficiently? I can use DESC to see whether an auto increment key exists, but if not, that is what I need to do so the ordering and pagination would work

2. Is Oracle InnoDB?


@for_yan
I understand your idea for keeping track of last successful comparison.
So you are saying I should still use pagination despite the fact that Oracle deals with it for you?
please explain, did you mean I should perform the query and then use next() and getObject(), letting Oracle/MySQL deal with memory, or I should implement pagination using WHERE on a auto incremented primary key?

do you guys think I should order the records myself or leave it up to the user defining the query for comparison?
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 333 total points
ID: 35067222
I should use some natural ordering using possibly more discriminative combination of fields and use where clause with this combination higher than some value. In the beginning of comparison I'll use the lowest possible value of this (possibly combined) key in the where clause. Then I'll start the job which would shoot for comparison of the whole tables, but every time I do some comparison, I would store externally (relative to program) the current value of the ordering  field, so that in case there is unexpected interruption, you could next time start from the latest value of the key achived in the prvious run. Maybe you'll be lucky and compare the whole tables in one shot, but if you are not so lucky, then your previous job will not be total waste of time
0
 

Author Comment

by:eranhazout
ID: 35068026
@for_yan
ok, that would make for very good productivity of my comparison program, I am using log4j to write  log file so I can store the row number or id there.
But maybe there is something I don't yet understand... if you say you would start a job that compares the whole table then what is the use of the WHERE clause? my intention was to use this clause on a possibly combined key in order to very quickly get pages separately (e.g. WHERE 0 < id <10000 or something like that)  and then compare each chunk. If memory is not an issue, than all I have to do is use DESCRIBE, find the key or auto incremented id and then use order by in my query to make sure ordering of the records will be the same.
Is there anything in your answer that I missed??
again, thank you guys for staying with me here...
0
 
LVL 3

Accepted Solution

by:
mwiercin earned 167 total points
ID: 35086361
1.
alter table my_table add column id int not null primary key auto_increment

Open in new window

2. InnoDB is MySQLs storage engine so no  (both InnoDB and MySQL are owned by Oracle though ;)).
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 333 total points
ID: 35086536
You are right in the very beginning the "where" clause may not be necessary - this is more for generality -
if you whole task works in one shot - you can get away without where clause at all, but if you stop somwhere in the meiddle then the where clasue
wiill be necessary not to start from the very begnning therefire, and after next interruption and after next, etc.
 I suggest to have this weher clasue in the format
of the query, but in the first instance just use the value lower than anything else.
If there is a chnace it will work the whole thing in one shot - you don't need to bother about the where clause at all
0
 

Author Comment

by:eranhazout
ID: 35088766
Thank you very much guys, I will use your notes in my program.

If you don't mind, there is one thing in MySQL i'd like to ask:

I changed a key in an person table to be an auto incremented key, and saved the old key in a new column called old_key. Now I want to go to all the tables that contain this old key and change it to the new one using the old key for reference.
I tried this, and it is still running (very slow), the example is for a person_hobbies table
UPDATE person, person_hobbies
SET
person_hobbies.key = person.key
WHERE
person_hobbies.key = person.old_key

Open in new window


again, this is taking a while...
I use MySQL 5.1 at the moment, and we will move to oracle soon.
is this the right syntax? is there a better way?
Thanx!!
0
 
LVL 3

Expert Comment

by:mwiercin
ID: 35111676
You should have indexes on both person_hobbies.key and person.old_key.

So
alter table persons_hobbies add key (`old_key`)

Open in new window

0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
This video shows how to configure and send email from and Oracle database using both UTL_SMTP and UTL_MAIL, as well as comparing UTL_SMTP to a manual SMTP conversation with a mail server.
This video explains what a user managed backup is and shows how to take one, providing a couple of simple example scripts.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now