asked on

Best way to perform massive # of inserts from java...

Hi experts. I am working on a batch process that is quite lengthy. It tries to find an association between one thing and another based on about 51 conditions. Each condition, if matched, adds a point so to speak. For each match that has received a certain number of points or more, I add that object (it's a business entity object) to an array list. The last thing the batch does before completing is call the Data Access method of this object and pass the arraylist of matches to the method. The following code is how I attempt to perform the insert. Keep in mind that we are using a prepared statement here.

                  INSERT_NEW_MATCHES +=
                  "      INSERT INTO Mytable " +
                  "      ( F1, F2, F3, F4, F5) " ;

                  System.out.println("in insert method - before for loop. SCMLS size: " + scmls.size() );
                  for(int i=0; i < scmls.size() ; i++ )
                  {
                        //iterationIndex = i;
                        //System.out.println("in insert method - before for get in for loop. Iteration: " + i );
                        //scml = (myObj)scmls.get(i);
                        INSERT_NEW_MATCHES +=
                              " SELECT " +
                                                 "
? ," +
                                           " ? ," +
                                           " ? ," +
                   " SYSDATE, " +
                               " ? " +
                          " FROM DUAL " +
                          " UNION " ;

                  }

Followed By:

                  if( INSERT_NEW_MATCHES.indexOf("UNION") > -1 )
                        INSERT_NEW_MATCHES = INSERT_NEW_MATCHES.substring(0, INSERT_NEW_MATCHES.lastIndexOf("UNION"));

                  System.out.println("FINAL INSERT QUERY: " + INSERT_NEW_MATCHES );

                  pstmt = con.prepareStatement(INSERT_NEW_MATCHES );

                  for(int a=0;a<scmls.size() ;a++)
                  {
                        iterationIndex = a;
                        scml = (StudentCampusMatchList)scmls.get(a);
                        pstmt.setInt(++count, scml.F1() );
                        pstmt.setInt(++count, scml.F2());
                        pstmt.setDouble(++count, scml.F3());
                        pstmt.setString(++count, scml.F4() );
                  }

                  System.out.println("in insert method - before executeUpdate ");
                  int rc = pstmt.executeUpdate();

As you can imagine this is pretty heavy on the memory. I did not anticipate that this process would have to insert 1 million plus records, but it does. I realize that I will likely run into a memory issue with the string. My question is, how can I implement this in a better way? What ever the suggestion is, it must be able to utilize an arraylist of business objects.

Thanks ahead of time guys!

ASKER CERTIFIED SOLUTION

Manish

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jbaisden

ASKER

The first portion of the code is pretty much concatenating a string. The way the insert works is something like

/* PSUEDO CODE */
Insert into <table name>
( <field1>, <field2>, <field3> )
Select val1, val2, val3 from dual
UNION
Select val1B, val2B, val3B from dual
UNION
Select val1C, val2C, val3C from dual
UNION
Select val1D, val2D, val3D from dual
...N

The basic idea is that we know all of the values that are being inserted so we can just use one object from our arraylist per insert. We use various fields from that object to set the fields in the prepared statement accordingly.

Does that clear things up at all? BTW Thanks for the batch suggestion. I will try that.

SOLUTION

DizzyDiz

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jbaisden

ASKER

Wow Dizzy (Cliff)! My head is spinning. I'm struck with the feeling that I just read something immensely profound but don't have a clue of what it means. I will pick at it over the next few days. On the other hand my issue is more immediate. I've never been one to accept a solution without understanding it, but what you said above is just too much for me to grasp at the moment.

My basic problem (In tech jargon) is this:

I need a way to insert LOTS of data into the database. I am using an arraylist to hold the data that needs to be inserted. On the most recent run of this process it was about 1 million records. My current approach to this is this:

1) Create my connection
2) Determine number of loop iterations needed to insert records in lots of 1000.
3) Loop over code that sets my prepared statement and query string.
4) Execute the prepared statement

I'm having issues with the process mysteriously stopping. That's right. Stopping. It doesn't throw an error in the try/catch or anything. It simply just stops. Last night when it ran it inserted 100,000+ records and then stopped. As previously said, there were no errors thrown or caught by the log.

I am at a loss. I have a very pressing dead line and have no idea how to finish this. Furthermore, I am about to switch jobs and don't want to leave this unfinished especially when its so close.

DizzyDiz

My apologies. I have a tendancy too run with a particular idea. In simple terms I am only trying to point out that you should break down your original problem down into big chunks and then break those big chunks into smaller chunks and break those smaller chunks into even smaller chunks and repeat until you get into something that can be coded in 3-4 lines of code. You have multiple ideas mixed together in your code.

jbaisden

ASKER

Please don't misunderstand me, Dizzy. I am immensely greatful for all the effort you put into that post and will actually print it out to keep on file. It is an excellent description of how the thought processes of programmers and software engineers can be melded. While one may think they should be one in the same, this often is not the case.

One thing that did confuse me is the idea that I should make an interface for each condition that needs to be tested. I don't see the benefit of doing that; furthermore, I would end up with about 51+ files for one process. Did you mean that I ought to put the conditional testing into a single object or interface where by I can pass my object into it and it provide me with the numeric result from the testing of that object?

DizzyDiz

No you don't make an interface for each condition. You make one interface that takes the BusinessObject and tests it against each condition. You don't need a numeric result from the interface either. The interface would look like this:

public interface BusinessObjectRanker { //can't think of a better name
void setAllowedScore(int score);
boolean BusinessObjectRanksAboveScore(BusinessObject obj);
}

DizzyDiz

Regarding your current approach:

1) Create my connection
2) Determine number of loop iterations needed to insert records in lots of 1000.
3) Loop over code that sets my prepared statement and query string.
4) Execute the prepared statement

All the step involve details. None of the steps explain or focus your overall goal. Focusing on the details (connections, number of iterations, query strings and prepared statements) misses the point. The point is:

>For a given item, iterate all busniess entities. Score associations on each using lots of (51) conditions and if the score is more than the magic number save the entity.

I understand your timeline but you should understand that taking the wrong approach takes more time than taking the right approach. I'll do you a favor and stub out all of the high level ideas. From there you can play "fill in the blank" then it becomes a matter of "connect the dots". Here's an example that took me just 15 minutes to stub out:

public interface Mapper {
Object mapCurrentResult(ResultSet current);
}

public interface ResultFilter {
void accept(Object obj);
}

public interface QueryService {

//each one of the following methods return "this" after setting internal fields with the given parameters
QueryService useQuery(String query);

QueryService useMapper(Mapper mapper);

QueryService useResultFilter(ResultFilter resultFilter);

//This method
// uses the preset query,
// executes it,
// iterates the results,
// and calls the Mapper first then the ResultFilter on each iteration,
Collection execute();
}

public class BusinessObjectRepository implements Mapper
{
Collection findAllBusinessObjects()
{
return getQueryService()
.useQuery(getQuery())
.useMapper(this)
.useResultFilter(new ResultFilter()
{
public void accept(Object obj)
{
}
}
)
.execute();
}

public Object mapCurrentResult(ResultSet current)
{
BusinessObject obj = new BusinessObject();
//pull values from current resultset record and set them on obj
return obj;
}

/**
* You'll want to complement this with a setter
* @return
*/
private QueryService getQueryService()
{ return null; }

/**
* This method could actually be made abstract if you wanted to be really fancy..
* @return query necessary to pull data from tables. The data pulled is used to create business objects
*/
protected String getQuery()
{
return "select * from businessobjtbl where whatever";
}
}

From this (poor man's SpringFramework) you can start putting pieces in there proper place. The JDBC logic goes into a custom query service. That object has only logic to run arbitrary JDBC iterate records and call methods on the given interfaces. I made the setter methods "self returning" as it's a pattern I follow often. (See my blog entry on the topic: http://codeforfun.wordpress.com/2006/06/23/avoid-a-void/) They're also cleverly named so it looks like you're building the object up incrementally. 90% of the code for the query service (everything except for the result filter stuff) can be found in the SpringFramework if you're not up to writing it yourself. The QueryService is really just a framework for running SQL and turning it into business objects. My example leaves a lot of things out. That's the fill in the blank part. you "fill in" the tables involved with constructing the business objects using a SQL string. You "fill in" the logic that maps the columns in the resultset to the business oject. You "fill in" implmentations of all the interfaces, each of which is focused on ONE thing. Then there's the connect the dots part. You write a little bit of glue code to put the pieces all together. This code create the Business object repository and plugs in the filtering logic, the query service and the other interface implementations and then the repository is ready for you to ask for the final collection. All logic is pipelined so that the result is the trimmed down list containing only what you wanted.

jbaisden

ASKER

I feel that I ought to add a comment to explain my point distribution. Ultimately, it was karanw approach that solved this issue. Despite this, I cannot let all you effort go without reward Dizzy, especially in light of the fact that Karanw made only 1 post.

I am very appreciative of both of your feedback. This will be a question that I will come back to and reread to see what all was stated.

Thank you both again for your time.