Solved

How to efficiently manage the processing of thousands of db records in as little time as possible?

Posted on 2012-04-07
12
574 Views
Last Modified: 2014-11-12
Currently I have thousands of records in a mysql database. Each record takes a few minutes to process but even 3 minutes multiplied by 10,000 records could take almost a month to process (i.e. 3 minutes x 10,000 records = 30,000 minutes / 60 minutes = 500 hours / 24 hours = ~21 days) and more records are coming.

In an Amazon EC2 LAMP environment, how can I accomplish this task at a much faster rate?

I could throw more instances at it but that seems inefficient and would drive up cost. I could have more scripts (or browser tabs?) processing the database but with a single database as the source I'm sure to overlap processes? Even if I split the data across databases, I'd still eventually run in to overlap the processes on that database? So maybe I need software to queue up database for each script? Anyway, what do you experts think?
0
Comment
Question by:aristanoble
  • 3
  • 2
  • 2
  • +3
12 Comments
 
LVL 17

Expert Comment

by:ramrom
ID: 37820183
Please describe the processing. Perhaps there is a way to make if more efficient.
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 37820202
I've never had a process on a row take even a second.  What kind of processing are you doing?
0
 
LVL 1

Author Comment

by:aristanoble
ID: 37820206
Well the length of the process is because of the API call related to that record. The call is for a large amount of date that when returned gets processed also and then on to the next record.
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 37820261
Details are important in analyzing a problem.  But you're not giving us any so i don't know how we can help you.  We're usually pretty good with code but terrible at guessing.
0
 
LVL 7

Expert Comment

by:designatedinitializer
ID: 37820280
You have to analyse the process.
You have to break it down in steps and assess the duration of each step, and its dependencies.
Only then you will be able to know what you can do about it.
There will be things you can speed up, others you can do simultaneously, and others you can't.

If a woman takes 9 months to give birth to 1 child, that doesn't mean 2 women can do it in 4.5 months...
0
Control application downtime with dependency maps

Visualize the interdependencies between application components better with Applications Manager's automated application discovery and dependency mapping feature. Resolve performance issues faster by quickly isolating problematic components.

 
LVL 1

Author Comment

by:aristanoble
ID: 37820281
Sorry I'm just holding back from being too talkative but here's how it works: each user currently reside in the users table.  As each record is selected, the user id is used in the API call and returns hundreds of records. Each record is then saved in a separate table and then on to the next record.
0
 
LVL 7

Expert Comment

by:designatedinitializer
ID: 37820287
API call to what?
How much time does it take, that API call?
How is the data from the API call formated, and how are you parsing it?
0
 
LVL 1

Author Comment

by:aristanoble
ID: 37820445
I got a chance to work on all of your suggestions and I was reminded of some things I knew all along; as goes the case at times. I was inserting each record rather than inserting batches of records. Things are speeding up now but I need more.

To answer your questions, I'm pulling date from the Facebook API; by iteration we obtain about 5,000 records in a php array per user and I "was" iterating over the data inserting each record but now I iterate over the data concatenating it into a variable and then I'd insert the data in chunks.
0
 
LVL 32

Accepted Solution

by:
shalomc earned 300 total points
ID: 37820737
To truly scale your operation and to be limited by budget (and not by technology) you should change your computing method to that of parallel/grid computing.

I see that you are already familiar with AWS, take a look at EMR, which is Amazon's implementation of Hadoop.

Look here http://aws.amazon.com/elasticmapreduce/  and return with questions :)
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 200 total points
ID: 37821419
It would sure help us help you if you could post the CREATE TABLE statements for all of the DB tables in question, post the code that interacts with the API and the data base, and post some of the data that your scripts fetch from the API.  Armed with some specific information we may be able to provide some immediately useful suggestions.

One suggestion may be to make the fetch-and-update into an asynchronous process.  I have used this technique when I need to provide rapid response, but also need to get information from an API.  To do this, your script will need to make a POST method request (using CURL or FSockOpen) to the asynchronous script.  You can start the process and disconnect without waiting for it to complete.  If the async tasks takes three minutes, who cares?  You can just start up a lot of them and let them run in the background.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 38186533
Going forward, please consider the EE grading guidelines.  If you're not going to interact with the experts who are trying to help you, please don't give a marked-down grade.  We need some dialog in order to be able to understand what you're dealing with -- this back-and-forth dialog requires interaction between the participants.  

Thanks, ~Ray

http://www.experts-exchange.com/help/viewHelpPage.jsp?helpPageID=26
0

Featured Post

Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

Join & Write a Comment

Shadow IT is coming out of the shadows as more businesses are choosing cloud-based applications. It is now a multi-cloud world for most organizations. Simultaneously, most businesses have yet to consolidate with one cloud provider or define an offic…
For cloud, the “train has left the station” and in the Microsoft ERP & CRM world, that means the next generation of enterprise software from Microsoft is here: Dynamics 365 is Microsoft’s new integrated business solution that unifies CRM and ERP fun…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now