Solved

How to efficiently manage the processing of thousands of db records in as little time as possible?

Posted on 2012-04-07
12
576 Views
Last Modified: 2014-11-12
Currently I have thousands of records in a mysql database. Each record takes a few minutes to process but even 3 minutes multiplied by 10,000 records could take almost a month to process (i.e. 3 minutes x 10,000 records = 30,000 minutes / 60 minutes = 500 hours / 24 hours = ~21 days) and more records are coming.

In an Amazon EC2 LAMP environment, how can I accomplish this task at a much faster rate?

I could throw more instances at it but that seems inefficient and would drive up cost. I could have more scripts (or browser tabs?) processing the database but with a single database as the source I'm sure to overlap processes? Even if I split the data across databases, I'd still eventually run in to overlap the processes on that database? So maybe I need software to queue up database for each script? Anyway, what do you experts think?
0
Comment
Question by:aristanoble
  • 3
  • 2
  • 2
  • +3
12 Comments
 
LVL 17

Expert Comment

by:ramrom
ID: 37820183
Please describe the processing. Perhaps there is a way to make if more efficient.
0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 37820202
I've never had a process on a row take even a second.  What kind of processing are you doing?
0
 
LVL 1

Author Comment

by:aristanoble
ID: 37820206
Well the length of the process is because of the API call related to that record. The call is for a large amount of date that when returned gets processed also and then on to the next record.
0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 37820261
Details are important in analyzing a problem.  But you're not giving us any so i don't know how we can help you.  We're usually pretty good with code but terrible at guessing.
0
 
LVL 7

Expert Comment

by:designatedinitializer
ID: 37820280
You have to analyse the process.
You have to break it down in steps and assess the duration of each step, and its dependencies.
Only then you will be able to know what you can do about it.
There will be things you can speed up, others you can do simultaneously, and others you can't.

If a woman takes 9 months to give birth to 1 child, that doesn't mean 2 women can do it in 4.5 months...
0
Give your grad a cloud of their own!

With up to 8TB of storage, give your favorite graduate their own personal cloud to centralize all their photos, videos and music in one safe place. They can save, sync and share all their stuff, and automatic photo backup helps free up space on their smartphone and tablet.

 
LVL 1

Author Comment

by:aristanoble
ID: 37820281
Sorry I'm just holding back from being too talkative but here's how it works: each user currently reside in the users table.  As each record is selected, the user id is used in the API call and returns hundreds of records. Each record is then saved in a separate table and then on to the next record.
0
 
LVL 7

Expert Comment

by:designatedinitializer
ID: 37820287
API call to what?
How much time does it take, that API call?
How is the data from the API call formated, and how are you parsing it?
0
 
LVL 1

Author Comment

by:aristanoble
ID: 37820445
I got a chance to work on all of your suggestions and I was reminded of some things I knew all along; as goes the case at times. I was inserting each record rather than inserting batches of records. Things are speeding up now but I need more.

To answer your questions, I'm pulling date from the Facebook API; by iteration we obtain about 5,000 records in a php array per user and I "was" iterating over the data inserting each record but now I iterate over the data concatenating it into a variable and then I'd insert the data in chunks.
0
 
LVL 33

Accepted Solution

by:
shalomc earned 300 total points
ID: 37820737
To truly scale your operation and to be limited by budget (and not by technology) you should change your computing method to that of parallel/grid computing.

I see that you are already familiar with AWS, take a look at EMR, which is Amazon's implementation of Hadoop.

Look here http://aws.amazon.com/elasticmapreduce/  and return with questions :)
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 200 total points
ID: 37821419
It would sure help us help you if you could post the CREATE TABLE statements for all of the DB tables in question, post the code that interacts with the API and the data base, and post some of the data that your scripts fetch from the API.  Armed with some specific information we may be able to provide some immediately useful suggestions.

One suggestion may be to make the fetch-and-update into an asynchronous process.  I have used this technique when I need to provide rapid response, but also need to get information from an API.  To do this, your script will need to make a POST method request (using CURL or FSockOpen) to the asynchronous script.  You can start the process and disconnect without waiting for it to complete.  If the async tasks takes three minutes, who cares?  You can just start up a lot of them and let them run in the background.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 38186533
Going forward, please consider the EE grading guidelines.  If you're not going to interact with the experts who are trying to help you, please don't give a marked-down grade.  We need some dialog in order to be able to understand what you're dealing with -- this back-and-forth dialog requires interaction between the participants.  

Thanks, ~Ray

http://www.experts-exchange.com/help/viewHelpPage.jsp?helpPageID=26
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Monitoring systems evolution, cloud technology benefits and cloud cost calculators business utility.
Cloud-based technologies and services will continue to grow in popularity in 2017 thanks to the simple, scalable and cost-effective solutions they deliver. Here are three areas where cloud adoption is poised to really take off.
Steps to create a PostgreSQL RDS instance in the Amazon cloud. We will cover some of the default settings and show how to connect to the instance once it is up and running.
Both in life and business – not all partnerships are created equal. Spend 30 short minutes with us to learn:   • Key questions to ask when considering a partnership to accelerate your business into the cloud • Pitfalls and mistakes other partners…

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now