We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now


Getting page title from 20,000 pages in an hour

Medium Priority
Last Modified: 2012-05-07
If I had a lot of URLs [15k-20k] that I wanted to get the page title for... like google.com, experts-exchange.com, etc... How fast can they be processed?

What I assume I'd do, and correct me if I'm wrong, is do an fgets (or something to the same effect) until I reach the title tags in the external document so I don't have to pull everything. Then I'd extract it from there and move on to the next one in queue.

If this takes 3 seconds to do on the average site... that means the most I could process with one script running in an hour would be about 1,200. ( 20 per minute ) So if I needed to get more than that, 10k, 15k, 20k... How many of these processes could I run simultaneously until I destroy my server or notice a huge lag in regular content being called?

I understand this all depends on how awesome your hardware and your connection is etc etc... but for sake of example, let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated business server. Nothing too fancy, but it's also not a dinosaur.

Also, is there any other way besides this that I'm not thinking about?

Watch Question

I would recommend to check in the database that you are writing the titles into that it checks only the one that are not yet updated. That way you do not have a problem if your script crashes or times out.

in my opinion you can easily test that. run the script for 200 and check how long it takes. then run two at the same time and see if the performance drops.

it is VERY hard to say otherwise since the requested pages might be slow/outdated etc.

make sure that the different scripts process the data in a different order so the database request does not hit a locked record at every URL.

Be also very careful at the number of queries you send to a given site or IP: if you are sending too many in a short time, this might trigger some defensive move on the target if interpreted as an attack (ie, DDOS, spam or other)
You might consider requiring say less than 10 pages / minute on any given site, and preferrably with random intervals so that it is not too easily identified as some robot.
I can confirm that issue. If you are accessing sites that are without your knowledge at the same hoster, you will be blocked from accessing the target IP for a certain time. Google is doing this for example.

I wrote a script to retrieve google page ranks for 1500 websites and scaled down to do only a bunch (10 or so) every hour but always those who were not updated since the longest time. so I know I get the latest figure every 2-3 days for every site.
HonorGodSoftware Engineer

Along these lines, you might want to consider doing a DNS lookup for the IP address of the hostname.
There are times when disparate hostnames are actually provided by the same server.

You also might want to consider using something like cURL to perform the actual request.

You probably want to run this as a cron task... you could also look at threading as well.


It seems like everyone is missing the actual question here...

@tokyoahead Yes, I check them out. I mark x amount for processing with a date, then other threads pull ones that aren't marked to avoid overlapping. If the date is beyond x minutes, it means it didn't get processed and falls back into the loop. I DID do speed tests, that's why my question has numbers/times in it.

@fibo 15k-20k different URLs. I'm not crashing anyone. If it's in the database cache, it won't process it again.

@HonorGod Why would I do a DNS lookup? I'm using cURL to get the actual location, assuming some sites have frames, some redirect, etc. Then grabbing the final destination from there using a gradual request so I can get the title in 2k instead of loading the entire 100k from a site.

@BrianMM I have it running has a cron twice an hour. But I'm actually just gonna turn it in a never ending do...while statement and run multiple at once.

The question seems to have been missed. It was how many of these do...while threads can I run simultaneously until my machine gets unstable or noticeably slow?

<<But I'm actually just gonna turn it in a never ending do...while statement and run multiple at once.>>
Not sure the system manger will not kill your job after say 1hour continuous processing. It should if you are on shared server, just in case your program was looping infinitely (which by the way B-) it is precisely doing!).

So the (multiple) cron solution might be a better bet.
Another issue that you did not specifically mentioned: launching multiple parallel cURL queries

Some tests should help you finding a safe value for the number of simultaneous urls

Since you will run several scripts in parallel (eg cron scripts overlapping) you might need to define a strategy to inspect different sites when you will launch your first searches. assuming I would get 5 spiders running in parallel, I would get a sorted list of unvisited sites, spider 1 would take its targets in the first 20%... spider 5 in the last 20%

For next steps, I would use a similar strategy but instead of doing it from non-visited sites I would do that on the 1000 oldest sites, eg spider 1 running over the 200 oldest sites, spider 2 on sites 201 to 400, etc.

I would also probably install from the start some stat system so that you can after some days adjust your tuning..



cURL is a non-issue. The other stuff has already been said:

"let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated"

"Yes, I check them out. I mark x amount for processing with a date, then other threads pull ones that aren't marked to avoid overlapping"
> The question seems to have been missed. It was how many of these do...while threads can I run simultaneously until my machine gets unstable or noticeably slow?

no-one will be able to tell you. This ONLY depends on your system and what else runs on it.

try it. you will NOT get an answer here such as "20 and not more".



"let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated business server" #average #modern #dedicated #business-server

Besides, we're talking ballpark here. I'm not saying, "Are you sure I can't squeeze one more process?" I want to know how intensive this sort of thing is. Like, is 10 easy? Is 100 normal or is 100 way too much? These are simple questions...

You're telling me that when someone asks their IT guy, "Hey, can we run 200,000,000 simultaneous scripts without slowing everything down?" He'll say, "There's no way for me to answer such a question." ?

This is posted in servers so people *with experience* can respond saying, "In my experience, I've tried to do something similar but after 100 processes, I noticed a drag in performance."

Just because you, personally, don't know the answer does not make it unanswerable.
sure. I agree with you. but the chance that you find someone who has that experience with a benchmark figure to lay down on you are slim. waiting for that will take you longer than the 30 minutes to test it yourself.

Well Matt, you seem to have already your ballmark figure of 100 processes.

Some parameters that might impact this: amount of ram used (you probably don't need much)

You probably have some shared host on which you can place your script... what happens when you are running 100 queries?


@fibo Well, I guess that was my question. How expensive are these sorts of processes in regards to not just RAM, but also connection. I was justifying it to myself where... if you're in an office or something, and a lot of people are streaming music at the same time, visiting multiple webpages, or even watching movies... it's still fairly fast. But then again, an office connection doesn't have incoming requests for "Serve me this page." 100 is a total guess from my point of view, I was hoping someone would have some knowledge as to what is more reasonable. If a PHP script that loops  to request external websites only takes 0.05% memory (again, on a standard business dedicated machine [not shared]), then I suppose I could run many. That's what I was aiming to find out.

@tokyoahead I don't think the chances of finding that figure SHOULD be slim. This is Experts-Exchange, where even the most difficult of questions get answered. If it's supposedly filled with experts, I imagine 1,2, or 50 of them have experienced something similar in their years of working on this. While, yes, I could've set up a test in the time it took to get to this conclusion (which I haven't even found)... the point of posting on a forum was to run into someone who knows instead of discussing whether or not I should be able to find it here.

Hmmm... so you will be running your script from you own station or server.... I would probably write the script with the idea of tuning in mind, ie being sure that very number is easy to change (number of sites to test in this session, number of sites in the multi-curl), and that times and successes are recorded.
And then run with dichotomy: 100, 200, 400... until it fails, eg at 400. then test at (200 (last successful) + 400 )/2
I would repeat that until I get an interval of 10%. This should give you a raw figure, the significance of which is in fact unknown since there are too many outside factors (your bandwidth, other users activity, time of the day). Then taking a safety margin of 30-50% should give you a 100% success rate (but may be 30% would give you a fair enough 90% success rate)..
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a free trial preview!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.


Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.