Solved

Best way to perform massive # of inserts from java...

Posted on 2006-10-23
9
660 Views
Last Modified: 2008-02-01


Hi experts. I am working on a batch process that is quite lengthy. It tries to find an association between one thing and another based on about 51 conditions. Each condition, if matched, adds a point so to speak. For each match that has received a certain number of points or more, I add that object (it's a business entity object) to an array list. The last thing the batch does before completing is call the Data Access method of this object and pass the arraylist of matches to the method. The following code is how I attempt to perform the insert. Keep in mind that we are using a prepared statement here.

                  INSERT_NEW_MATCHES +=
                  "      INSERT INTO Mytable   " +
                  "      (  F1, F2, F3, F4, F5)   " ;
                  
                  System.out.println("in insert method - before for loop. SCMLS size: " + scmls.size() );
                  for(int i=0; i < scmls.size()  ; i++ )
                  {
                        //iterationIndex = i;
                        //System.out.println("in insert method - before for get in for loop. Iteration: " + i );
                        //scml = (myObj)scmls.get(i);
                        INSERT_NEW_MATCHES +=
                              " SELECT       " +
                                                 "
                                                                                                                       ? ," +
                                                 " ? ," +  
                                               " ? ," +  
                          "            SYSDATE, " +
                                                                           " ? " +
                            " FROM DUAL " +
                            " UNION " ;

                  }

Followed By:

                  if( INSERT_NEW_MATCHES.indexOf("UNION") > -1 )
                        INSERT_NEW_MATCHES = INSERT_NEW_MATCHES.substring(0, INSERT_NEW_MATCHES.lastIndexOf("UNION"));
                  
                  System.out.println("FINAL INSERT QUERY: " + INSERT_NEW_MATCHES );
                  
                  pstmt = con.prepareStatement(INSERT_NEW_MATCHES );
                  
                  for(int a=0;a<scmls.size() ;a++)
                  {
                        iterationIndex = a;
                        scml = (StudentCampusMatchList)scmls.get(a);
                        pstmt.setInt(++count, scml.F1() );
                        pstmt.setInt(++count, scml.F2());
                        pstmt.setDouble(++count, scml.F3());
                        pstmt.setString(++count, scml.F4() );
                  }

                  System.out.println("in insert method - before executeUpdate ");
                  int rc = pstmt.executeUpdate();      

As you can imagine this is pretty heavy on the memory. I did not anticipate that this process would have to insert 1 million plus records, but it does. I realize that I will likely run into a memory issue with the string. My question is, how can I implement this in a better way? What ever the suggestion is, it must be able to utilize an arraylist of business objects.

Thanks ahead of time guys!            
0
Comment
Question by:jbaisden
  • 4
  • 4
9 Comments
 
LVL 11

Accepted Solution

by:
Manish earned 250 total points
ID: 17788395
Few things I can suggest you.
Use batch statement.

for(int a=0;a<scmls.size() ;a++)
               {
                    iterationIndex = a;
                    scml = (StudentCampusMatchList)scmls.get(a);
                    pstmt.setInt(++count, scml.F1() );
                    pstmt.setInt(++count, scml.F2());
                    pstmt.setDouble(++count, scml.F3());
                    pstmt.setString(++count, scml.F4() );
                   //add in batch
                     psmt.addBatch();
               }
        //execute batch
       pstmt.executeBatch();


I am not getting what you are doing in first portion.
0
 

Author Comment

by:jbaisden
ID: 17788876


The first portion of the code is pretty much concatenating a string. The way the insert works is something like

/* PSUEDO CODE */
Insert into <table name>
( <field1>, <field2>, <field3> )
Select val1, val2, val3 from dual
UNION
Select val1B, val2B, val3B from dual
UNION
Select val1C, val2C, val3C from dual
UNION
Select val1D, val2D, val3D from dual
...N

The basic idea is that we know all of the values that are being inserted so we can just use one object from our arraylist per insert. We use various fields from that object to set the fields in the prepared statement accordingly.

Does that clear things up at all? BTW Thanks for the batch suggestion. I will try that.
0
 
LVL 5

Assisted Solution

by:DizzyDiz
DizzyDiz earned 250 total points
ID: 17791360
(SoC) Separation of Concerns!
Model your logic in Java then apply your data access logic separately. Your busines logic is stated thusly:

>It tries to find an association between one thing and another based on about 51 conditions. Each condition, if matched, adds a point so to speak. For each match that has received a certain number of points or more, I add that object (it's a business entity object) to an array list.

Reworded in SoC tongue it means:
For a given item, iterate all busniess entities. Score associations on each using lots of (51) conditions and if the score is more than the magic number save the entity.

Notice I didn't use Arraylist, table, prepared statement or anything like that in the reworded mission statement. That's important because it allows you to model your solution without tripping over the details of specific technologies. When you see the actual problem for what it is you realize that it's not a memory intensive operation. (Though you certainly can make it into one.) On the highest level you have:
1. iteration (looping over each business object)
2. filtering (making a decision based on some condition or conditions)
3. persistance (saving a business entity in some storage)

I like to use a pipelined approach to problems like this because it greatly reduces memory footprint and maximizes reuse across all operations. That is the gathering of all enitities is not specific to the decision which is not specific to the persistence. To be even more clear, your recording of data does not have to operate under the constrains (using specific data structures) of your filtering. And your filtering does not have to be restricted to operating on a data-structure that just happened to be the first idea that fit the iteration. The approach buys you the powerful ability to interchange different ideas.

Let's get specific. Look at each numbered step above.
1. Can you write some easy code that iterates all of the business entity objects and gives each to an "interface" asking a yes/no question?
(Using an interface is very important in this step.)
2. Can you write yes/no object that operates on a single entity pushing it through all (51) conditions?
3. Can you write an object that persists a single business entity without regard for where it came from?

After answering yes to all three questions you'll see that you can plug the three different ideas into one another to solve your original problem. That means the iterator can hand each object in turn to the filtering decision maker who can answer the yes/no question. The iterator can then give the entity to the persistance object contigent on the result from the yes/no object. Any and all of those steps are high level and can be broken down further if necessary. For example, iterating the objects may not be simple as it may mean complicated processing involving multiple DBMSes. Also, the yes/no object in step 2 is a perfect candidate for further disassembly.

What about performance and excessive network connections? A detail-oriented eye would immediately pickup on a potential problem in my above rant. The pipelined approach can lead to lots of opening and closing of database connections. However, it's also very important not to predict that such behaviour would be a bottleneck. I would only address the problem as needed (as I saw a major slowdown). In this case the issue becomes a workflow related issue, one that can be addressed without regard to the 51 conditions involved in decision making. There are high level ideas that you could apply to lessen the impact such as block fetching and writing. You could add block reading complimented by a block-writing method to the persistance object that saves a collection. That looks like it brings us back to the original problem but it would be subtly different if you completely decomposed each high level idea and further subdivided your way to the details. The block size could be managed externally and adjusted to give the optimal memory/connection count ratio.

In one example your iterator would read blocks of 100 or so entities at a time and narrow the list using the single entity yes/no object. The narrowed list would be passed to a complimentary batch write method on the persistance object.

public void persistBusinessObjects(){
   Collection<BusinessObject> block = getPersistanceManager().batchReadBusinessObjects(100);
   ArrayList<BusinessObject> narrowedList = new ArrayList<BusinessObject>(100);
   for(BusinessObject obj : block) if(getListFilter().includeObject(obj))
      narrowedList.add(obj);
   getPersistanceManager().save(narrowedList);
}

As a matter of fact, you could get even better performance by embedding the filtering logic into your actual block read. This would involve code that was a little more savvy and separated the JDBC logic from the logic that mapped values to the busines entity. On a high level it would look like this:

public void persistBusinessObjects(){
   PersistanceManager mgr = getPersistanceManager();
   mgr.setReadFIlter(getListFilter());
   Collection<BusinessObject> block = mgr.batchReadBusinessObjects(100);
   mgr.save(block);
}

On a lower level you would need to use callback patterns with a JDBC framework like Spring or even design your own poor man's version. Looking back on what I wrote I think I'll quit while I'm ahead. I'm afraid some of what I'm describing may be a little too much for what you're used to. If anything you really should separate your problems.

Cliff
0
 

Author Comment

by:jbaisden
ID: 17798264

Wow Dizzy (Cliff)! My head is spinning. I'm struck with the feeling that I just read something immensely profound but don't have a clue of what it means. I will pick at it over the next few days. On the other hand my issue is more immediate. I've never been one to accept a solution without understanding it, but what you said above is just too much for me to grasp at the moment.

My basic problem (In tech jargon) is this:

I need a way to insert LOTS of data into the database. I am using an arraylist to hold the data that needs to be inserted. On the most recent run of this process it was about 1 million records. My current approach to this is this:

1) Create my connection
2) Determine number of loop iterations needed to insert records in lots of 1000.
3) Loop over code that sets my prepared statement and query string.
4) Execute the prepared statement

I'm having issues with the process mysteriously stopping. That's right. Stopping. It doesn't throw an error in the try/catch or anything. It simply just stops. Last night when it ran it inserted 100,000+ records and then stopped. As previously said, there were no errors thrown or caught by the log.

I am at a loss. I have a very pressing dead line and have no idea how to finish this. Furthermore, I am about to switch jobs and don't want to leave this unfinished especially when its so close.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 5

Expert Comment

by:DizzyDiz
ID: 17798362
My apologies. I have a tendancy too run with a particular idea. In simple terms I am only trying to point out that you should break down your original problem down into big chunks and then break those big chunks into smaller chunks and break those smaller chunks into even smaller chunks and repeat until you get into something that can be coded in 3-4 lines of code. You have multiple ideas mixed together in your code.
0
 

Author Comment

by:jbaisden
ID: 17798554

Please don't misunderstand me, Dizzy. I am immensely greatful for all the effort you put into that post and will actually print it out to keep on file. It is an excellent description of how the thought processes of programmers and software engineers can be melded. While one may think they should be one in the same, this often is not the case.

One thing that did confuse me is the idea that I should make an interface for each condition that needs to be tested. I don't see the benefit of doing that; furthermore, I would end up with about 51+ files for one process. Did you mean that I ought to put the conditional testing into a single object or interface where by I can pass my object into it and it provide me with the numeric result from the testing of that object?

0
 
LVL 5

Expert Comment

by:DizzyDiz
ID: 17798638
No you don't make an interface for each condition. You make one interface that takes the BusinessObject and tests it against each condition. You don't need a numeric result from the interface either. The interface would look like this:

public interface BusinessObjectRanker { //can't think of a better name
   void setAllowedScore(int score);
   boolean BusinessObjectRanksAboveScore(BusinessObject obj);
}
0
 
LVL 5

Expert Comment

by:DizzyDiz
ID: 17799037
Regarding your current approach:

1) Create my connection
2) Determine number of loop iterations needed to insert records in lots of 1000.
3) Loop over code that sets my prepared statement and query string.
4) Execute the prepared statement

All the step involve details. None of the steps explain or focus your overall goal. Focusing on the details (connections, number of iterations, query strings and prepared statements) misses the point. The point is:

>For a given item, iterate all busniess entities. Score associations on each using lots of (51) conditions and if the score is more than the magic number save the entity.

I understand your timeline but you should understand that taking the wrong approach takes more time than taking the right approach. I'll do you a favor and stub out all of the high level ideas. From there you can play "fill in the blank" then it becomes a matter of "connect the dots". Here's an example that took me just 15 minutes to stub out:

public interface Mapper {
    Object mapCurrentResult(ResultSet current);
}

public interface ResultFilter {
    void accept(Object obj);
}

public interface QueryService {

    //each one of the following methods return "this" after setting internal fields with the given parameters
    QueryService useQuery(String query);

    QueryService useMapper(Mapper mapper);

    QueryService useResultFilter(ResultFilter resultFilter);

    //This method
    // uses the preset query,
    // executes it,
    // iterates the results,
    // and calls the Mapper first then the ResultFilter on each iteration,
    Collection execute();
}

public class BusinessObjectRepository implements Mapper
{
    Collection findAllBusinessObjects()
    {
        return getQueryService()
                .useQuery(getQuery())
                .useMapper(this)
                .useResultFilter(new ResultFilter()
                    {
                        public void accept(Object obj)
                        {
                        }
                    }
                )
                .execute();
    }

    public Object mapCurrentResult(ResultSet current)
    {
        BusinessObject obj = new BusinessObject();
        //pull values from current resultset record and set them on obj
        return obj;
    }

    /**
     * You'll want to complement this with a setter
     * @return
     */
    private QueryService getQueryService()
    { return null; }

    /**
     * This method could actually be made abstract if you wanted to be really fancy..
     * @return query necessary to pull data from tables. The data pulled is used to create business objects
     */
    protected String getQuery()
    {
        return "select * from businessobjtbl where whatever";
    }
}

From this (poor man's SpringFramework) you can start putting pieces in there proper place. The JDBC logic goes into a custom query service. That object has only logic to run arbitrary JDBC iterate records and  call methods on the given interfaces. I made the setter methods "self returning" as it's a pattern I follow often. (See my blog entry on the topic: http://codeforfun.wordpress.com/2006/06/23/avoid-a-void/) They're also cleverly named so it looks like you're building the object up incrementally. 90% of the code for the query service (everything except for the result filter stuff) can be found in the SpringFramework if you're not up to writing it yourself. The QueryService is really just a framework for running SQL and turning it into business objects. My example leaves a lot of things out. That's the fill in the blank part. you "fill in" the tables involved with constructing the business objects using a SQL string. You "fill in" the logic that maps the columns in the resultset to the business oject. You "fill in" implmentations of all the interfaces, each of which is focused on ONE thing. Then there's the connect the dots part. You write a little bit of glue code to put the pieces all together. This code create the Business object repository and plugs in the filtering logic, the query service and the other interface implementations and then the repository is ready for you to ask for the final collection. All logic is pipelined so that the result is the trimmed down list containing only what you wanted.
0
 

Author Comment

by:jbaisden
ID: 17948500

I feel that I ought to add a comment to explain my point distribution. Ultimately, it was karanw approach that solved this issue. Despite this, I cannot let all you effort go without reward Dizzy, especially in light of the fact that Karanw made only 1 post.

I am very appreciative of both of your feedback. This will be a question that I will come back to and reread to see what all was stated.

Thank you both again for your time.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

For customizing the look of your lightweight component and making it look opaque like it was made of plastic.  This tip assumes your component to be of rectangular shape and completely opaque.   (CODE)
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
The viewer will learn how to implement Singleton Design Pattern in Java.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now