We help IT Professionals succeed at work.

Mysql JDBC: Optimizing a query wich returns a big amount of data (70k rows)

Irune
Irune asked
on
792 Views
Last Modified: 2007-11-27
I am using JDBC to query a database for a list of rows, wich can be as many as 80k or more in a single RecordSet.

Currently I create an statement, execute the query (I cannot use prepared statements as there's IN clausules with different number of parameters each of the 6 iterations), and then manage the data.

The thing is the query executes veeery slowly, and it's much slower than PHP for example so I know it's not a mysql problem. We're talking orders of magnitude slower. The java version I am running it now it's a 1.4.2 for linux,  but I can change it and will probably end up running on a 5.0, if it's worth running the app at all.

Any way to optimize this?
Comment
Watch Question

Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
> (I cannot use prepared statements as there's IN clausules with different number of parameters each of the 6 iterations)

that doesn't mean u cannot, but it won't save you anything

> Any way to optimize this?

try increasing the fetch size
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:

Author

Commented:
When do I use this? Before executing the query? AFterwards?

Author

Commented:
And agree with prepared statements, was to avoid anybody suggesting it :)
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
> When do I use this? Before executing the query? AFterwards?

after creating the statement

> And agree with prepared statements, was to avoid anybody suggesting it :)

then u probably shouldn't have accepted the comment saying u couldn't in the previous question
EE is not only a place for peple to ask questions, but also a knowledge base. Users reading that question may well be misled into thing they cannot use a PreparedStatement with IN clause.

Author

Commented:
Ok, tried it and it's going even slower now :(

It's 6 iterations of the same query, and the number of results increases exponentially, so the 5-6 iterations are the worse that can return over 70k rows. Each row is composed of a single ID, an integer, so it's small. Right now using a fetchSize of 50k has resulted in slowing the first 4 iterations and I am not seeing any improvement on the 5-6 iterations. I probably am doing something wrong here :( Can you give more details on your solution or do you need more information from my side?

Thanks!

Author

Commented:
Well, the solution was in my specified case I couldn't use a preparedStatement for the objective of performance wich was the original question idea. If you have a problem with my accepted answers then you should bring it up with a moderator :)
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
what database/driver?

Author

Commented:
Mysql using JDBC, as said in the title. MySQL is a standard 4.*, while the JDBC driver is the latest one I could download shortly ago.

Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
>>btw. how many different values are in your IN clause? reason I ask is to consider using multiple queries, one for each value. Just a thought.
one for each ID returned on the previous iteration. So it's not an option I think. 70.000 queries is a bit too much.


Following your example it doesn't show any improvement, it even slows the app with a fetch size of 50.000, what kind of fetch size should I be using to see an improvement? I am not familiar with this enough to decide on a correct fetch size, so that might be the problem?

Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
i'd start at around 1-2K and experiment from there.
how much different is the time from php?
What do you plan to do with the 70K records that are returned.
You cannot display them at once
You could write them to a file, but that too does not require to have all the records at once.

So try getting the data in smaller batches of about 50-100.

You can do this by using the LIMIT clause (along with the offset clause).

That way you will get your data in batches and you can work happily on that!!

There is another thing that you can do and that is create a temporary view of the result that you need. Then access only that (again using LIMIT and OFFSET), so that your searches will be even faster.


select * from x limit 100;


LIMIT Clause

The LIMIT clause consists of two independent sub-clauses:

LIMIT { count | ALL }
OFFSET start

count specifies the maximum number of rows to return, while start specifies the number of rows to skip before starting to return rows. When both are specified, start rows are skipped before starting to count the count rows to be returned.

Author

Commented:
>>how much different is the time from php?
From a few seconds to about half an hour or more before I just lose patience and kill the program.

>>i'd start at around 1-2K and experiment from there.
What does the fetch size influence? If it's bigger or smaller? Like, if you increase it it improves memory usage and decreases CPU or that kind of thing? I know I can go testing from 1-2k onwards, but I'd hate that by chance a test that was good didn't work because other factors slowed the machine and I lost it :) Besides not having the time :(
This solution is based on the principle that even if you receive all the data (70K rows), there is not much you can do with all of that at once.
It will also result in a bloat in your heap size.
So its better to treat it in smaller chunks. The whole operation will be much faster than before. You can also implement ,ultiple threads and connection pooling for this, so that you can prefetch the next batch, so that you dont have to wait for the DB read again!!

Author

Commented:
@keyurkarnik
I am using the results to return all at once to do more stuff with them. I need them all at once (among other things, the results from a query are the input for the next iteration) and I cannot make several queries to retrieve them as the queries are quite costly. I know making the query isn't an issue for MySQL because PHP does it in a few seconds, but I need to improve how JDBC handles the query.
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
it controls the number of rows the driuver will grab from the database wach time.
Are you currently just reading the rs, or are you also doing some processing which could be clowing it down?
read my second post... you can use connection pooling and prefetching to do this. I have suggested this because I have used this in multiple applications and it works just fine.

Author

Commented:
@keyurkarnik Sorry, I cannot use that option :) It's not in my hands in this case, and for our requirements the number of queries launched must be 6 per call to the application, wich is the bare minimun.

@objects Just reading it and inserting it to a hashset. But the time it takes to just do:
rs = stmt.executeQuery(query);
it's the problem. The equivalent statement in php takes a couple seconds. In java it takes looong time (Over 20 minutes). Once I get to loop through the rs it goes quite fast . Maybe the fact that we might be working with 0.5Mb of data returned from the query has something to do when creating the recordset?

Author

Commented:
@keyurkarnik
>>read my second post... you can use connection pooling and prefetching to do this. I have suggested this because I have used this in multiple applications and it works just fine.

Ok, I'll try to explain the problem more acurately. The query goes:

SELECT ids FROM table
WHERE ids_a IN  (a list of variable IDs that come from the query in the loop before)
AND
ids_b NOT IN (a list of all the distinct IDs we have found so far in all the queries)

the list of the IDs are stored in two HashSets and made into a string, that's ok and already done. The application will be receiving lots of queries at the same time (it's on the internet) so it'll have plenty of threads working at the same time already. Also, I cannot go to the next loop in the query (wich is quite costly due to the INs and NOT INs and 600.000 rows in the table) before having all the results from the present row. There's 6 loops.

Now, how do I use connection pooling and pre-fetching for this in a way I can convince my boss to let me try it? Right now we're trying to make the java do more work to remove some load from the database, if I make more queries that by design must look at each of the rows of the table (NOT IN clausules) then I just charge the database more I think.

Also, in PHP it works fine, and the query takes at most 2s to execute against the database, so I think the problem is more when the java builds the RecordSet than with the amount of queries done, but I can obviously be wrong as otherwise I wouldn't be asking for help :)



Commented:
What about Java Heap Size ? Can you see how much memory your Java process consumes ?

If max Java heap size is close to be reached, the gc can be the bottleneck. In this case, you can try with a larger Java heap size using -Xmx and -Xms java command-line options.
you need to run the query only once...
this is what you can do.. :

create a view = the result of SELECT ids FROM table
WHERE ids_a IN  (a list of variable IDs that come from the query in the loop before)
AND
ids_b NOT IN (a list of all the distinct IDs we have found so far in all the queries)


then run a select - limit query on the view...
CERTIFIED EXPERT
Top Expert 2016

Commented:
Is that table being written to in the course of your processing?

Commented:
I think posting your query and your table description would help

Author

Commented:
@elfe69 Java heap size is no problem, uses about 20Mb of memory and I have a minimun of 256 allocated and a max of 512.

@keyurkarnik Views in 4.* doesn't work, so not an option. As we're having some trouble with stored procedures in Mysql 5 I don't think we'll be switching to it.

@elfe69 Teh query:
SELECT ids FROM table
WHERE ids_a IN  (a list of variable IDs that come from the query in the loop before)
AND
ids_b NOT IN (a list of all the distinct IDs we have found so far in all the queries)

That's all the relevant info for the query, all the fields are ints.

Here the relevant problematic code:
ids and idsNot are hashsets.
makeString is a function made by me that makes the hashset into a string.

ids.add(idParam);
idsNot.add(idParam);
                      for (int i = 0;i < 6;i++)
                      {
                            String query =  "SELECT id_to FROM table" +
                                              " WHERE id_from IN ("+makeString(ids)+")" +
                                              "AND id_to NOT IN ("+makeString(idsNot)+")";
                            ids.clear();
                            stmt = conn.createStatement();
                            rs = stmt.executeQuery(query);
//Suggested by objects, but just slowed it more with this parameter
//                            rs.setFetchSize(50000);
//HERE IS THE BOTTLENECK, BEFORE STARTING THE LOOP TO READ THE RS
                            while (rs.next())
                            {
                                  ids.add(rs.getString("id_to"));
                                  
                            }
                            idsNot.addAll(ids);
                            rs.close();
                            stmt.close();
                      }

It's about transversing a graph.

Commented:
OK, I'll try your code and see that...

I've understood you have about 600'000 records in your table. Any idea about the number of initial elements in ids and idsNot ? And about the size of the resultset for each loop ?

Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
> //HERE IS THE BOTTLENECK, BEFORE STARTING THE LOOP TO READ THE RS

that suggests the problem is processing the query
try it without reading the result set at all to see where the time is spent.

> //                        rs.setFetchSize(50000);

thats a big fetch size, try a smaller value
CERTIFIED EXPERT
Top Expert 2016

Commented:
It would help to post your table definition
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
i'm assuming u have an index on those two columns
though from what u have said it sounds like there is one.

Author

Commented:
@CEHJ
Table:
id_to INT
id_from INT

there's an index on both columns I think.

@elfe69
Initial number of ids: 1
It's looking for all the related nodes in a graph up to a 6th level from a starting node.

@objects
I already tried. The bottleneck is before starting to read the results. That's checked :) It's right after executing the query statement. Before reading it at all.
Will try smaller values then and see if it improves. How smaller? 2000? 20000? I'd love to know how it exactly influeces performance besides trial and error on my side because I'll be working against this database for a bit :)

The query is allright because it's executed on php and works perfectly, the same algorithm in php just takes a couple seconds. In fact, the query executed against the database directly takes less than 1s, the mysql is not stressed at all and answers to the query fast, so the indexes and table definition are as allright as they can be :) I need your help to figure out a way to improve the execute statement performance so it works in less than 30 minutes. It's not a mysql or sql problem, it's a java usage of JDBC issue I need help with :)


CERTIFIED EXPERT
Top Expert 2016

Commented:
>>
@CEHJ
Table:
id_to INT
id_from INT

there's an index on both columns I think.
>>

Do the data in there form one continuous graph or do they form several?
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
> I already tried. The bottleneck is before starting to read the results. That's checked :) It's right after executing the query statement. Before reading it at all.

then it sounds like its in the processing of the query, not the retrieving of the results

> I'd love to know how it exactly influeces performance besides trial and error on my side

It controls how many rows are requested each time when retrieving the resulkt set, so u can't avoid a bit of trial and error.
But if the time is taken executing the query then it may not help you.

I'd suggest turning on debug on your connection and see if you can get some clues from that, eg. autoGenerateTestcaseScript, explainSlowQueries and logSlowQueries

Author

Commented:
>>But if the time is taken executing the query then it may not help you.
The time isn't taken executing the query. The time is taken when java executes the query and puts all the results in the recordSet, or at most while Java and the MySql are communicating, but MySQL has already done the query  (once they're all in the recordset it goes pretty fast, and if I execute the query in any other language or directly against the MySQL goes pretty faster). Will try to find the number that speeds stuff up with objects suggestion.

Thanks all, more later :)

>>Do the data in there form one continuous graph or do they form several?
There's nodes connected to nodes, maybe an node has no neighbours or another node has 500 neighbours. That isn't causing the problem because the query works perfectly in PHP and directly against MySQL. Unless JDBC has a special option to retrieve this kind of data specifically of course.
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
sounds like its a driver issue, hopefully the debug will give u some clues.

Author

Commented:
It's a driver issue :)
The query is done and MySQL is already gone for breakfast minutes before the JDBC finishes "obtaining" the data into the recordSet. It's also not a matter of the "net" because both java and mysql are in the same machine, I think.

So far, and after testing with different setFetchSize numbers from 1000 to 35000 and 50000 hasn't shown any significant improvement in the case of ids with a lot of nodes in their neighbourhoods. The best results might be at around 20k, but then it doesn't improve performance to something acceptable.

So, any other idea? I know it has been done before because at a certain moment it worked in java :)

Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
didn't u say earlier that it was slow before u even started looping thru the resulkt set?
remove the reading of the result set completely, set the fetch size to 1 and compare the performance as a test
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
When java does a query through the JDBC it first calls the query to MySQL and then puts the returned results on a RecordSet. All of this is done by the driver and I don't have idea how it does it. This part goes slooow. The MySQL part (purely the query) isn't slow because when I do it using PHP it goes orders of magnitude faster (2 seconds vs 20 minutes) and I check the the MySQL has finished its work minutes before the statement that returns a recordset is finished.

Afterwards I read the recordset (or not) wich goes pretty fast, specially compared to the 10 minutes it took for it to do the first part.
If I finish the program here:
rs = stmt.executeQuery(query);
It goes exactly as slow as when I AFTER doing it read the loop, because I have been measuring the time it takes to do the  rs = stmt.executeQuery(query); all the time (and ignoring how long it takes to loop through the recordset as that didn't give me performance problems and wasn't the problem).

I need options (configuration options like this setFetchSize thing) that I could use so that the creation of the recordSet was faster, so I could retrieve from 1 to 70000 registers in a fast way. Maybe options when creating the recordset? or sets like this setFetchSize? I know the query is allright and I know the database is allright.

Hope this clears it, I am probably not explaining myself correctly :(


Author

Commented:
@elfe69
You are my hero!!! It was that, i didn't ever think about it being the problem as I was so pissed at the JDBC ;)
Thanks to everybody else who helped and sorry for misleading you into thinking it was the JDBC :( I am applying as well objects solution fr the bit extra performance so points go split too :)
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
I suggested using that in your previous question ;)
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.