Solved

Creating script like cutestat

Posted on 2015-01-29
8
60 Views
Last Modified: 2016-05-27
Hi,

I am looking for advanced information to develop a script similar to cutestat. I want to know how should I proceed? and how should I crawl pages / websites ? should I do it via php curl and store the html output or part of it into database or what else could be the best practice ?

I shall be doing custom digging and regex patterns into database later on but initially I just want to start with 1 million domains and want to know what's the fastest way to get html of all those million domains / sites ?

is php efficient enough ? or I have to use any other crawler?

regards
0
Comment
Question by:fahadalam
  • 3
  • 2
  • 2
8 Comments
 
LVL 13

Expert Comment

by:Andrew Derse
ID: 40578438
Take a look at this page.

http://stackoverflow.com/questions/8316818/login-to-website-using-python/8316989#8316989

My question for you is, do you have the million domain names yet?

You can most definitely (once you have the information) store it into a database, that's easiest for getting stats off and such later.
0
 
LVL 13

Expert Comment

by:Andrew Derse
ID: 40578442
One more site that might be helpful.

http://www.crawl-anywhere.com/
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40579650
Is PHP efficient enough?  Yes, PHP powers Facebook, so unless your processing requirements exceed Facebook, you will be OK with PHP.  You can use cURL to read the HTML.  Hopefully your site will be more accurate than cutestat, because that thing is wa-a-ay off base!

Here is where you're going to run into a problem.  Most sites of any importance are not HTML any more.  They use a little HTML for a document framework, but the content is loaded dynamically by JavaScript and AJAX calls to web APIs.  To see what you're up against, make a Google search for anything, then use your browser's "view source."  The source document that can be read with cURL is not what you're seeing on the browser screen.

To understand the technologies we use for web development today, learn about AngularJS or just read this article and look for the demo/ajax_captcha_client.php scripts.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_9849-Making-CAPTCHA-Friendlier-with-PHP-Image-Manipulation.html
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:fahadalam
ID: 40591708
Ray you took me wrong, I was actually asking if php/curl is strong enough to make it a crawler/bot. Facebook and any other aren't using php for this purpose, almost everywhere it's stated to use python crawler and php for rest of thing
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40592698
A Google search for "webcrawler" returns about 1,700,000 results.  One of those will probably meet at least some of your needs.  The issue is not whether you can read the HTML document - that's the easy part, and in 1999 that would be all you needed to do in order to scrape a site and capture the data.  Today, the data is not in the HTML, so whatever crawler you choose has to be able to behave like a web browser, accepting and returning cookies, running JavaScript, following HTTP requests and processing the data, etc.  To me, that says two things.  One, the tool you want is a web browser, not a scraper script and two, the publishers of web content are tired of having their sites scraped and so they are taking steps to prevent it.  Please read the terms of service carefully before you copy and store someone's data - you could find yourself at the wrong end of a legal claim!

Best of luck with your project, ~Ray
0
 

Author Comment

by:fahadalam
ID: 40592879
@Ray, I understand the need of either crawler or a webbrowser, and yes I am still looking for a crawler at this stage (wont mind leaving those ajax based sites not indexed)

now, would php cater this requirement with curl in multiple instances efficiently?
0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 40593294
Yes
0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
What is wrong with this PDO Delete Query? 2 17
Log in through ID 5 17
Debug script powershell wmi 3 15
Generate PDF from MySQL using PHP 3 18
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to count occurrences of each item in an array.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question