?
Solved

read,collect and analyze  the data/information from any website

Posted on 2011-03-07
17
Medium Priority
?
211 Views
Last Modified: 2013-11-19
Hi

I'm looking for away to read,collect and analyze  the data/information from any website

for example: from amazon website I need the list of books and authors for specific subject? or from blogs website get a list of all titles and dates or from pizza website get all type of pizza and the prices

and extract it in XML file or even DB table or to any format

any one know any API do that or Java code do it...

In fact, I need to gather and analysis the data from websites and insert it in another database

thanks
0
Comment
Question by:nmokhayesh
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 6
  • 4
17 Comments
 

Author Comment

by:nmokhayesh
ID: 35059361
Can I do it using webbots
http://www.schrenk.com/nostarch/webbots/

any example using JAVA
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35059591
Use a high level API that supports xml, such as HttpUnit
0
 

Author Comment

by:nmokhayesh
ID: 35059917
Thanks CEHJ
HttpUnit used for testing
what I need is to collect some data from web page and analyze it to be inserted in DB
any suggestion please

thanks
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 86

Expert Comment

by:CEHJ
ID: 35059980
You can use HttpUnit for all sorts of purposes. Importantly in your case, it can produce an xhtml dom which you can then use (possibly with XPath) to get the data you need
0
 
LVL 92

Expert Comment

by:objects
ID: 35061707
many sites, such as amazon provide an api so you don't need to scrape the pages
http://aws.amazon.com/

for scraping try webharvest
http://web-harvest.sourceforge.net/
0
 
LVL 92

Expert Comment

by:objects
ID: 35061724
you really don't want to be reinventing the wheel :)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35062644
Crawlers are for downloading websites. You then need to analyse them
0
 

Author Comment

by:nmokhayesh
ID: 35084458
Hi,

What I mean be analyzing the data is that :

for instance , if I select specific university website and I need to grape the faculties data for that university only such as faculty name , address, schools, and professors information in to text file

based on your experiance is one of the previous approaches will do the job without doing a lot of coding?

thanks    
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35084832
The least coding will be achieved by using the highest level api. You won't get much higher than HttpUnit in Java
0
 

Author Comment

by:nmokhayesh
ID: 35085033
Ok
can you just give me some examples to do the task  because I do not know much about HttpUnit :(

thanks
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35085258
0
 
LVL 92

Expert Comment

by:objects
ID: 35088156
> can you just give me some examples to do the task  because I do not know much about HttpUnit :(


would be a lot of work with httpunit, its intended primarily just for testing
did you try webharvest, will make the job far easier
0
 

Author Comment

by:nmokhayesh
ID: 35095539
Dears,

really I appreciate your advice, but I'm confused now! I will rephrase my task in very simple example then I need your suggestion and best way to do it in a short time.

For example :  I need to get the university information for specific university (so this should work only with this university since very university website has its own design). So this API should get/grape the list of faculties and under each faculty list of schools/department and then under each department list of professors and their information (Tel, Fax , research interest , courses, and personal website).
The out put is XML or text file

so based in the example which technique is easy and fast to do the job in a short time without lean new things
is webbots, spiders or  httpunit or webharvest or Java webbots ?

I would appreciate if you give me similar example

please advice

thanks
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35095593
>>so based in the example which technique is easy and fast to do the job in a short time

My answer is as above: HttpUnit
0
 
LVL 92

Accepted Solution

by:
objects earned 1500 total points
ID: 35099703
"Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities."

Does exactly what you want
0
 

Author Closing Comment

by:nmokhayesh
ID: 36710197
it help
0

Featured Post

Stressed Out?

Watch some penguins on the livecam!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

FAQ pages provide a simple way for you to supply and for customers to find answers to the most common questions about your company. Here are six reasons why your company website should have a FAQ page
Color can increase conversions, create feelings of warmth or even incite people to get behind a cause. If you want your website to really impact site visitors, then it is vital to consider the impact color has on them.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question