Solved

Memory doubt in java

Posted on 2012-03-18
50
192 Views
Last Modified: 2012-03-30
Hi Experts...

I have a table which has around 67 columns.The scenario is such that i retreive the records of this table based on a certain condition and there are around 5 million records retreived.

So far from the 5 million records retreived only 15 columns have data in them.

I have to read each and every record and based on several computations and condtions in the 15 existing columns of data i need to fill the renaining values in these records.

Here i am loading data for several students and each student have several thousand data records.Here my doubt was do i procure all the 5 million rows by creating those many objects like say collections and update the rest of the columns for each of these rows one at a time at astretch or do i procure for each student and do it the same way for every student one a time.

My doubt is if i choose it to do 5 million rows at a stretch how do i ensure there is no memory shortage.
Can anyone suggest regarding the same and point out how this shoud be done and which is a better method.
Please help...
0
Comment
Question by:gaugeta
  • 29
  • 14
  • 6
  • +1
50 Comments
 

Expert Comment

by:AllThingsJava
ID: 37735004
If you are utilizing JPA, you can get a subset of results and continue through until you are complete:

javax.persistence.Query#setFirstResult(0)
javax.persistence.Query#setMaxResults(10)
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37735082
Normally JDBC drivers are written in such a way that they don't fetch the whole ResultSet into memory . When you travers the ResultSet with normal whiel(rs.next()){...}
oepartion you go through resultset sequentially and JDBC driver will not load all your table into your memory.
If all your oeparations are row -specific (you don't need info adbout another row when you are processing some particular row) then if you travers your result set and process your rows within the result traversal looop one by one then JDBC driver wil take care of your memory issues - you don't need to do anything special.
I did hava experience with Oracle processing reaklly big tables sequrntially in one quesry
and did not have any memory issues.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37735523
so waht I would do, I would first try it in the most straighforward way,
I would upfront  create PreparedStatement which will be used to update tthe row, like

PreparedStatement pst = connection.prepareStatement("update my_table set filed101=?, field102=?,....... where mykey=?);

then I'll have a big select statement:

Statement stmt = connection.createStatement();

ResultSet rs = stmt.executeQuery("select key, field1, field2, ... from my_table");

while(rs.next()){

int key = rs.getInt(1);
String field1 = rs.getString(2);
...

//then here you do all your calculation -create all objects in  here which you want
// for the enxt rqw they of course will take the same memory space, so they will
// not muptiply

pst.setString(1, filed102);
...

pst.setInt([some_number], key);


pst.execute();

}

In a sense of memory JDBC should handle this - it should not import your whole table into memory.
That's why ResultSet is only one-way street - you cannot return back to previous row - that is inconvenience which actually serves the purpose of handling memory in case you deal with huge tables.

You can then think about some batch update options and do updates in some chunks of records and experiment with it (again the chunks should not be comparable with the saize of the table) - it may happen to be faster with batches (though probably still no guarantee that it would be faster), but if I were you, I'd start with that very straightforward way, I showed above.
First see iif it works and how slow it works in this very straightforward way.
Then make decision based on that experience.
0
 
LVL 35

Expert Comment

by:mccarl
ID: 37735789
@gaugeta,

The method proposed by for_yan above, is valid and the way to go if you HAVE to do these computations in Java, however, I would first investigate the possibility of doing this inside the database. For example, the following example sql...
update my_table set column_43 = column_1 + column_13, column_55 = column_7 - column_8 where column_15 = 5

Open in new window

would be magnitudes faster than selecting out those columns, doing the calc in Java and then running the update back to the DB. It all depends on exactly what those calc's and conditions are (and possibly what DB you are using).
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37735801
I guess, we already went through that part with gaugeta several times  and tried to convince them to do as much as possible on the back end either with just update queries or even if necessary to go more sophisticated and with databese stored procedure, say PL/SQL or something, as those languages are usually pretty powerful. But obviously they still prefer to do it throough java
0
 
LVL 35

Expert Comment

by:mccarl
ID: 37735813
Ahh, so this question has a history! Well, @gaugeta, you can +1 another vote from someone who recommends you to do this on the DB side!!
0
 

Author Comment

by:gaugeta
ID: 37736079
@for_yan & @mccarl:Thanks a lot for your replies.
I will try your suggestions and try to do the processing on the db side if possible.
I will get back if i face any issues.
0
 

Author Comment

by:gaugeta
ID: 37769291
@for_yan:I tried the loading procedure in the following ways...
1) I tried to update the columns as you had suggested one by one and updated the table then an there after every record is read and found out that i could update abut two records in one second.

2) I tried the batch update where i can only batch 100 records where the column size being(52 columns).Updating these 100 records takes about 23-25 seconds.Considering i have to update 5 million records this is taking a very long time.

Here i read the relevent record ony-by-one and add it to the batch statement and as soon as there are hundred batch updates i execute the batch.

I what way this scenario can be improved.
I atleast need a few hundred to be updated every second to meet the maximum time.
Please help...
0
 
LVL 35

Expert Comment

by:mccarl
ID: 37769301
> I what way this scenario can be improved.

Do it in the update statement on the DB side!!!! What calculations are you doing that you don't think can be done on the DB side?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769303
Make sure that your tables are not indexed when you update them,m - tynhat may slow down your updfdate
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769309
I many times wrote you about doing everything on the DB side - that is the only thing which potentially can provide you really siginificant savings in time -svelera times or even somethimes order of magnitude. No thredas or whatever will not give you such saving as you want.
Still make sure you drop all indexes before you are dioing the updates
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769325
If you are not comfortable with your DB server side language to formulate your calculations on server side - then post the question on EE with the appropriate information specifying in general terms what you need -there is a chance there will be some folks who could give you a good advice. I know that Oracle PL/SQL zone is very lively and I got the response to the question of such kind in a matter of minutes. If timing is so critical - this is a way to go.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769345
In general these are not exactly Java Programming Language questions - this is datbase specific question - there arre suuallay some uoptions for bulk updates which are really faster that doing stuff only throgh JDBC -- you need to investigate that area.
Are you using Sybase ?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769355
0
 

Author Comment

by:gaugeta
ID: 37769361
@for_yan & @mccarl:Thanks for the reply.
Yes  i have a stored procedure where i batch upto hundred records.
The sql is not embedded in the java code.
The only thing is i compute the values required for the update in the java code and pass it to the stored procedure.

@for_yan
>>Make sure that your tables are not indexed when you update them,m - tynhat may slow down your updfdate
I did not quite understand this.Can you explain how indexes affect the speed in this case.And by dropping them how will be gaining speed.

@mccarl>> What calculations are you doing that you don't think can be done on the DB side?
Currently there a 3 operations done in the java side and passed to the stored procedure
1) I gather 2 maps each about 52 in size and i take one of them and divide all of them by 1000 and save for later use.
2) I then do another computation which deals with calculating all the percentages for all the 52 values pre-computed
2)The there are about ordinary mathematical operations to be done before the update .
How is the support for all these operations on the db side for sybase? What sort of data structures can be used if chosen to be on the db side.

Please help...
0
 

Author Comment

by:gaugeta
ID: 37769366
@for_yan:Yes I'm using sybase.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769369
I don't know about Sybase, but assume it is the same thing as in Oracle.
If the table is indexed than all insert operations will have to update the indexes - adn that will slow down the process. If you are updating the rows and the fields you are changing are not indexed then probably it may not have that big effect
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769372
Are all these updates happening within the row - are the clalulations dependent only on the values within the same row or you are doing something across the rows ?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769376
Do you know about bcp utlity ?

Maybe it would be indeed faster to dump your table to file simpulatenaously calculating the required new columns and then re-build the table form the file using bcp
0
 

Author Comment

by:gaugeta
ID: 37769381
@for_yan:All these updates are based on the values of a single row.
BTW I'm not working with a table its a view.Does this make a difference.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769394
All these questions are database specific - you need to ask the Sybase specialist
Yes, I normally don't  unpddate views - i beleive that it is slower on Oracle.
And it kind of makes no sense logically. I always update the underlying tables.
0
 

Author Comment

by:gaugeta
ID: 37769401
@for_yan:Would it help if i tried to transfer all my calculations to the db side.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769402
as a rule bulk insert into a new table - is probably the operation which can be made the fastest. SO if you are concnerdn with the time, I'd rather sleect evrything form your view, do the calculations and prepare input for bulk insert int undrlaying tables based on what i selected and calculated. then create a new table and issue the bulk insert commnad , then
drop the old table and rename the new table to the old name. I think it would be the fastest
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769406
>@for_yan:Would it help if i tried to transfer all my calculations to the db side.
yes in comparison with JDBC retrieval to java and then JDBC updtaes - it should be faster.
The longes part in JDBC is theis network transfer and hand off of the data between teh db serevr an your clinet - you want to avoid it as mauch as you can.


The scenatio as above based on the bulk update executed on the DB server itself - that can probably be even faster, as insert into new fresh table may be faster than update.
But it is hard to say without exepriementing.
0
 
LVL 35

Assisted Solution

by:mccarl
mccarl earned 250 total points
ID: 37769407
> Would it help if i tried to transfer all my calculations to the db side.

Yes, this would be the easier path to try first anyway, and see if you get your required performance. The fact that you are using a view may make a difference, but it will still be faster than doing it in Java and updating the view that way. There should be no problems with the support for basic mathematical operations on Sybase (I just had a quick look and it has a range of advanced mathematical operations, so basic stuff should be fine)
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 47

Expert Comment

by:for_yan
ID: 37769409
and you don't need to update veiw on the db side - you can update underlying tables
0
 
LVL 35

Expert Comment

by:mccarl
ID: 37769425
> and you don't need to update veiw on the db side - you can update underlying tables

It depends. Maybe the 'new' values in the update are based on view columns that originate in different tables. So if you do change it to update the tables, you still might need a multi-table update.

All I am saying, is to do the simplest thing first, which is move the calculations to the DB side, using the same view that you are already using. If you get the required performance, than you are done. If it is still not quite there, then you could look at optimizing further by investigating seperate table only (non-view) updates, or if still not enough, bulk-insert type operations. But follow the easiest path first, to save possibly doing unnecessary work!
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769433
I also agree that follow the easiest change first and  if it works aceeptably you don't go to further  complications.
Change to server db operations  - it is reasonable move - and maybe after that you'll be hhappy - try  it.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37769440
and you can use Java to initiate the db procedure - that is usally quite convenienet
0
 

Author Comment

by:gaugeta
ID: 37772134
@for_yan & @mccarl:Thanks for the reply.
I have created a new stored procedure which takes in only raw values which are about in 45 in number and it performs all the calculation previously done by the java code and also updates the columns.
Unfortunately I'm not getting any increase in speed and it feels as though it was executing previously through java code.
I'm almost out of options .
Here what i do is i parse in the resultset of a query which happens to be the source one row at a time and add the relevant fields to the batch statement and when there are about 100 of then they execute.
Here i have 2 sets of values which are involved in the update.
One set remains constant say for a entity like a student and the other set is different for each row.And it has to manually get to that number till the number becomes 100.

Can you suggest regarding the same.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37772211
It is strange that database site operation didn't give you increase.

If you make a test and compare if the insert of certain number of trows to afresh table will be much fatser than the update?

I think it should be.

If so, you can probably write to a file rows selected + calculated vlaues and then make
bulk insert and see if that would be faster as a whole than update.

Are you executing everything locally - on one machine - and the database is on the same machine?
0
 

Author Comment

by:gaugeta
ID: 37772299
@for_yan:I am working with multiple databases one present locally and the other present in another system.
And the speed at which the set of values which vary for every row are retreived very slowly.
If i just let the program print it just prints one row for every second.
Is this normal or slow behaviour?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37772308
No, that is extremely slow.
You mean just actual select works for you - one row per second ?
Then it is very very slow
0
 

Author Comment

by:gaugeta
ID: 37772346
@for_yan:Yes when i just try to print the whole resultset of the query i get the result one row per second.
What can be done in this scenario?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37772370
I think you should talk to your DBA,
and probably you want to organize your database better, in one place, etc.
After all 5 mln rows, especially if you don't have somelong  binary fields in there
 noawdays is not that drmataic, but selecting one row per second is not acceptable.

DO you have direct access to the database - to sql prompt, not through java or anything ?
0
 

Author Comment

by:gaugeta
ID: 37772381
@for_yan:But the result of the query would be only a single row and new values need to be feed to the quey to get the new record.
Is it still slow where the result itself would be one row.
I am trying to get all the values one by one by providing new input values at the beginning of the loop.
Taken this scenario is it till too slow or was this suppose to retrun values more quickly.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37772484
yes, if select is so slow, then someing is wrong
Of course update is usually much slower than select, but slecthsould never be one row per sceond
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37772531
Those are all questions for database folks.
You probably have something wrong with database setup.
It should not be that slow.
0
 
LVL 35

Expert Comment

by:mccarl
ID: 37774310
> But the result of the query would be only a single row and new values need to be feed to the quey to get the new record

So are you saying that you do a select query, and get 1 row as the result, then send a DIFFERENT select query, and get 1 row as that result, and then send a 3rd DIFFERENT select query and get 1 row?? If that is what you were explaining then yes, it could be that slow.


> I have created a new stored procedure which takes in only raw values ... and it performs all the calculation previously done by the java code and also updates the columns.

Are you saying that you call this stored procedure once, for each 5 million rows that you want to update? If so, that wasn't what we were saying to do.


> I am working with multiple databases one present locally and the other present in another system.

Ok, so maybe this is that golden nugget of info that we needed to know better what is happening. Does this mean that you are taking some data from one database, doing calculations on it and then updating rows in the OTHER database? If so, then yes, you will have performance issues regardless, because information therefore NEEDS to be sent across your network at some point. You may possibly need to look at doing a 'bulk insert' type operation as for_yan has mentioned.



If the above is not the case, then you really do need to explain more fully and in more detail, what you are trying to do. Otherwise, we will still just be guessing at things, that will possibly turn out to be wrong because we don't have all the info about your problem!
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37774337
did you really mean that you created stored procedure which you call for ever row?
No, of course it should be one procedure, called only once and doing all taks in one run for the whole table.

But anyway, if y0u have even select running at the rate of 1 sec per row ,
then something is wrong with your database setup. It should never be so slow.
0
 

Author Comment

by:gaugeta
ID: 37774738
@for_yan & @mccarl:Thanks for the replies.I am answering all your questions.

@mccarl:
>>So are you saying that you do a select query, and get 1 row as the result, then send a DIFFERENT select query, and get 1 row as that result, and then send a 3rd DIFFERENT select query and get 1 row??
No ,it is the same select query to which i send different data at the beginning at the loop so that i get new data to update in the table.And this select query gives only a single row of output .Is this justified that this query gives one row/sec or should it have been faster.  

>>Are you saying that you call this stored procedure once, for each 5 million rows that you want to update? If so, that wasn't what we were saying to do.
Practically i have to call this stored procedure for every row but I do a batch update where i gather until a hundred batches and then commit the batch.Is this what you were suggesting me to do?

>>Ok, so maybe this is that golden nugget of info that we needed to know better what is happening. Does this mean that you are taking some data from one database, doing calculations on it and then updating rows in the OTHER database?
I am taking data from one local database and one remote database and updating the rows in the local database.

>> You may possibly need to look at doing a 'bulk insert' type operation as for_yan has mentioned.
I am worried that if the input is not fast eneough it might create the same problem in trying to to create a file which could take a long time.

@for_yan:  
did you really mean that you created stored procedure which you call for ever row?
As i told while clarifying mccarl i create batch updates for every 100 rows and then call the stored procedure once i have a 100 statements to update.

Please help...
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37774754
>Practically i have to call this stored procedure for every row but I do a batch update where i gather until a hundred batches and then commit the batch.Is this what you were suggesting me to do?
No, the whole point was to do everything on database side - not to hand over data to Java - and back form java to database that is where you lose most of the time - even if you are doing it on the same machine.
That makes no sense, and no batch update will help you. It all should be done in one run using PL/SQL (or whatever it is called in Sybase)
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37774778
I think you need to redesign your system and copy your table which is on remote database to local databse, and then run your process on one server as competely database baseed operation.
and you need to talk to databse people who are familiar with your specific dataabses and your specifuc operation. All these advice, that we can give, they are only very general. In such cases when you are facing serious timing issues you need to have databse specilaist who knows details of your database operatuions,
and has idea about your busines in general, who knows about indexing, transaction handling, etc. I'm sure such person looking at your specific system will have plenty of ideas on how to improve it.
0
 

Author Comment

by:gaugeta
ID: 37774825
@for_yan:Thankd for the reply.
>>No, the whole point was to do everything on database side - not to hand over data to Java - and back form java to database that is where you lose most of the time - even if you are doing it on the same machine.
That makes no sense, and no batch update will help you. It all should be done in one run using PL/SQL (or whatever it is called in Sybase)

Here what am i doing wrong.I thought that instead of having the sql code embedded in the java code you were suggesting i have it in the format of a stored procedure.
What is wrong with batch updates here.
I have to atleast provide data to the stored procedure using batch statements using java for every update rigth?
Or how different were you suugesting me to do the same.
Please help...
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37774836
No, ideally you don't want to have java to do anything with this - if wvereything is one one database - you do everything with database script - with language which execute within database - and data never leaves database.
In Oracle such language is called PL/SQL, in
Sybase it is called Transact-SQL - so it all should be written in Transact-SQL - as one big procedure

http://manuals.sybase.com/onlinebooks/group-as/asg1250e/sqlug/@Generic__BookView

-evereything should happen on database and data do not need to leave database.
Your operation is 100% serever side - you don't need to present or display any data - so there is no need to hand over the data to java at all.
0
 
LVL 47

Accepted Solution

by:
for_yan earned 250 total points
ID: 37774841
If you have some part of the data on remote databse - then Oracle has for such cases database links which are very effective - this is a special communication between databases - again no java intermediate or anythiing.
I don't know if Sybase had those links, still if you have such huge data operations - I think it would be more efective to copy that remote table temporarily to the local server and then copy it back after you rewrite it. It would be more effective if all the tables are on the same database - then your Transact_SQL script should be most effective and most straightforaward
0
 

Author Comment

by:gaugeta
ID: 37774850
@for_yan:Thanks for the reply.
I will try to get the remote table into the local server and then try to do the same procedure.
I think at this stage changing the entire thing to transact-sql will be a big chore.And i am not familiar with transact-sql.
Do you think getinng the remote table to the local server will yield better results with mu current java code.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37774853
It may get better results but I'm afraid not that dramatically.

Are you also writing back to that remote table or only reading from it?
0
 

Author Comment

by:gaugeta
ID: 37774859
@for_yan:Thanks for the reply.
I am only reading from the remote table not writing to it.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37774889
then the gain would obviously be less, if you are not writing there. Still if you move it to one database it wiill bring you closer to do it eventually on the database.

I can tell you that I once had similar task -  I had a table of original data - anout few hunder thousand rows and then andother table - about a million - I needed to go through the whole first table - find new data corresoponding in the secodn table and populate the third table form bothe the fisrt and the secodn. Firts I wrote everything through JDBC - and it was tamking about half a day and it was growing woth growing tables and I saw that I would need soon moee than a day to do that. Then I rewrote everything on PL/SQL - and it took about an hour and tables were  growing suince then - and now almost ten years later it still takes one and a half hours. So that change can be really dramatic.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
This video teaches viewers about errors in exception handling.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now