Solved

Getting page title from 20,000 pages in an hour

Posted on 2009-06-29
15
265 Views
Last Modified: 2012-05-07
If I had a lot of URLs [15k-20k] that I wanted to get the page title for... like google.com, experts-exchange.com, etc... How fast can they be processed?

What I assume I'd do, and correct me if I'm wrong, is do an fgets (or something to the same effect) until I reach the title tags in the external document so I don't have to pull everything. Then I'd extract it from there and move on to the next one in queue.

If this takes 3 seconds to do on the average site... that means the most I could process with one script running in an hour would be about 1,200. ( 20 per minute ) So if I needed to get more than that, 10k, 15k, 20k... How many of these processes could I run simultaneously until I destroy my server or notice a huge lag in regular content being called?

I understand this all depends on how awesome your hardware and your connection is etc etc... but for sake of example, let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated business server. Nothing too fancy, but it's also not a dinosaur.

Also, is there any other way besides this that I'm not thinking about?

Thanks!
0
Comment
Question by:MattKenefick
  • 5
  • 4
  • 4
  • +2
15 Comments
 
LVL 4

Expert Comment

by:tokyoahead
ID: 24742822
I would recommend to check in the database that you are writing the titles into that it checks only the one that are not yet updated. That way you do not have a problem if your script crashes or times out.

in my opinion you can easily test that. run the script for 200 and check how long it takes. then run two at the same time and see if the performance drops.

it is VERY hard to say otherwise since the requested pages might be slow/outdated etc.

make sure that the different scripts process the data in a different order so the database request does not hit a locked record at every URL.
0
 
LVL 29

Expert Comment

by:fibo
ID: 24743437
Be also very careful at the number of queries you send to a given site or IP: if you are sending too many in a short time, this might trigger some defensive move on the target if interpreted as an attack (ie, DDOS, spam or other)
You might consider requiring say less than 10 pages / minute on any given site, and preferrably with random intervals so that it is not too easily identified as some robot.
0
 
LVL 4

Expert Comment

by:tokyoahead
ID: 24743464
I can confirm that issue. If you are accessing sites that are without your knowledge at the same hoster, you will be blocked from accessing the target IP for a certain time. Google is doing this for example.

I wrote a script to retrieve google page ranks for 1500 websites and scaled down to do only a bunch (10 or so) every hour but always those who were not updated since the longest time. so I know I get the latest figure every 2-3 days for every site.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 24744517
Along these lines, you might want to consider doing a DNS lookup for the IP address of the hostname.
There are times when disparate hostnames are actually provided by the same server.

You also might want to consider using something like cURL to perform the actual request.
0
 
LVL 11

Expert Comment

by:BrianMM
ID: 24745967
You probably want to run this as a cron task... you could also look at threading as well.
0
 
LVL 4

Author Comment

by:MattKenefick
ID: 24746194
It seems like everyone is missing the actual question here...

@tokyoahead Yes, I check them out. I mark x amount for processing with a date, then other threads pull ones that aren't marked to avoid overlapping. If the date is beyond x minutes, it means it didn't get processed and falls back into the loop. I DID do speed tests, that's why my question has numbers/times in it.

@fibo 15k-20k different URLs. I'm not crashing anyone. If it's in the database cache, it won't process it again.

@HonorGod Why would I do a DNS lookup? I'm using cURL to get the actual location, assuming some sites have frames, some redirect, etc. Then grabbing the final destination from there using a gradual request so I can get the title in 2k instead of loading the entire 100k from a site.

@BrianMM I have it running has a cron twice an hour. But I'm actually just gonna turn it in a never ending do...while statement and run multiple at once.

The question seems to have been missed. It was how many of these do...while threads can I run simultaneously until my machine gets unstable or noticeably slow?
0
 
LVL 29

Expert Comment

by:fibo
ID: 24746619
<<But I'm actually just gonna turn it in a never ending do...while statement and run multiple at once.>>
Not sure the system manger will not kill your job after say 1hour continuous processing. It should if you are on shared server, just in case your program was looping infinitely (which by the way B-) it is precisely doing!).

So the (multiple) cron solution might be a better bet.
Another issue that you did not specifically mentioned: launching multiple parallel cURL queries
http://www.php.net/manual/en/function.curl-multi-init.php
http://www.php.net/manual/en/function.curl-multi-exec.php

Some tests should help you finding a safe value for the number of simultaneous urls

Since you will run several scripts in parallel (eg cron scripts overlapping) you might need to define a strategy to inspect different sites when you will launch your first searches. assuming I would get 5 spiders running in parallel, I would get a sorted list of unvisited sites, spider 1 would take its targets in the first 20%... spider 5 in the last 20%

For next steps, I would use a similar strategy but instead of doing it from non-visited sites I would do that on the 1000 oldest sites, eg spider 1 running over the 200 oldest sites, spider 2 on sites 201 to 400, etc.

I would also probably install from the start some stat system so that you can after some days adjust your tuning..
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 4

Author Comment

by:MattKenefick
ID: 24746784
@fibo

cURL is a non-issue. The other stuff has already been said:

"let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated"

"Yes, I check them out. I mark x amount for processing with a date, then other threads pull ones that aren't marked to avoid overlapping"
0
 
LVL 4

Expert Comment

by:tokyoahead
ID: 24746821
> The question seems to have been missed. It was how many of these do...while threads can I run simultaneously until my machine gets unstable or noticeably slow?

no-one will be able to tell you. This ONLY depends on your system and what else runs on it.

try it. you will NOT get an answer here such as "20 and not more".
0
 
LVL 4

Author Comment

by:MattKenefick
ID: 24746893
@tokyoahead

"let's say this is being done on some sort of average 1and1, Rackspace, or MT dedicated business server" #average #modern #dedicated #business-server

Besides, we're talking ballpark here. I'm not saying, "Are you sure I can't squeeze one more process?" I want to know how intensive this sort of thing is. Like, is 10 easy? Is 100 normal or is 100 way too much? These are simple questions...

You're telling me that when someone asks their IT guy, "Hey, can we run 200,000,000 simultaneous scripts without slowing everything down?" He'll say, "There's no way for me to answer such a question." ?

This is posted in servers so people *with experience* can respond saying, "In my experience, I've tried to do something similar but after 100 processes, I noticed a drag in performance."

Just because you, personally, don't know the answer does not make it unanswerable.
0
 
LVL 4

Expert Comment

by:tokyoahead
ID: 24747089
sure. I agree with you. but the chance that you find someone who has that experience with a benchmark figure to lay down on you are slim. waiting for that will take you longer than the 30 minutes to test it yourself.
0
 
LVL 29

Expert Comment

by:fibo
ID: 24747090
Well Matt, you seem to have already your ballmark figure of 100 processes.

Some parameters that might impact this: amount of ram used (you probably don't need much)

You probably have some shared host on which you can place your script... what happens when you are running 100 queries?
0
 
LVL 4

Author Comment

by:MattKenefick
ID: 24759993
@fibo Well, I guess that was my question. How expensive are these sorts of processes in regards to not just RAM, but also connection. I was justifying it to myself where... if you're in an office or something, and a lot of people are streaming music at the same time, visiting multiple webpages, or even watching movies... it's still fairly fast. But then again, an office connection doesn't have incoming requests for "Serve me this page." 100 is a total guess from my point of view, I was hoping someone would have some knowledge as to what is more reasonable. If a PHP script that loops  to request external websites only takes 0.05% memory (again, on a standard business dedicated machine [not shared]), then I suppose I could run many. That's what I was aiming to find out.

@tokyoahead I don't think the chances of finding that figure SHOULD be slim. This is Experts-Exchange, where even the most difficult of questions get answered. If it's supposedly filled with experts, I imagine 1,2, or 50 of them have experienced something similar in their years of working on this. While, yes, I could've set up a test in the time it took to get to this conclusion (which I haven't even found)... the point of posting on a forum was to run into someone who knows instead of discussing whether or not I should be able to find it here.
0
 
LVL 29

Expert Comment

by:fibo
ID: 24761492
Hmmm... so you will be running your script from you own station or server.... I would probably write the script with the idea of tuning in mind, ie being sure that very number is easy to change (number of sites to test in this session, number of sites in the multi-curl), and that times and successes are recorded.
And then run with dichotomy: 100, 200, 400... until it fails, eg at 400. then test at (200 (last successful) + 400 )/2
I would repeat that until I get an interval of 10%. This should give you a raw figure, the significance of which is in fact unknown since there are too many outside factors (your bandwidth, other users activity, time of the day). Then taking a safety margin of 30-50% should give you a 100% success rate (but may be 30% would give you a fair enough 90% success rate)..
0
 
LVL 4

Accepted Solution

by:
MattKenefick earned 0 total points
ID: 24765951
I got the answer from someone else in person.

This is closed.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Introduction This article explores the design of a cache system that can improve the performance of a web site or web application.  The assumption is that the web site has many more “read” operations than “write” operations (this is commonly the ca…
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now