Web crawler in java

gobicse
gobicse used Ask the Experts™
on
Hi all..
 i created a web crawler which retrieve the links which contain the user defined keywords and save those pages (not links) in the local directory....

so far what i have did is only retrieving the link which is displayed as shown in table(chk the attachment)

i need three things to be done...
1. I should download the webpage of the corresponding link and saving it as a separate text file in a directory.
 
2. i need to integrate this application with the lucene  API for indexing and searching..

3.if i click any links in the table, it should open in a browser. (this is optional)

 can anyone please help me... this is my final year project... am runnin outta time.. thanks in advance guys
Capture.JPG
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2009
Commented:
Let's go with a step at a time, downloading a  webpage - http://stackoverflow.com/questions/238547/how-do-you-programmatically-download-a-webpage-in-java

Redirect the system.out to a File.
Top Expert 2016

Commented:
>>so far what i have did is only retrieving the link which is displayed as shown

That's a good start. Since you have done that successfully, what you need to do is to apply what you have recursively, i.e. you have to retrieve the links from the pages that are listed in your picture. There are two main approaches possible here:

a. use what you have recursion
b. main a  queue of links

a. implies one thread, which is simpler, but potentially problematic (if there's a problem with one single link, if will hold up the whole app
b. is more flexible and lends itself easily to multi-threading

Don't be afraid of looking at how other crawlers operate and imitating

http://java-source.net/open-source/crawlers

Author

Commented:
hi  a b & CHEJ

 there is a slight improvement in my project. now i can download the whole webpage in text format with out any HTML tags.. but still i couldnt remove the "&" symbol from the web page which i am saving as a text file.. any suggestions..

this is one part of the web page which i extracted...

eg: "Testimonials || HDMI Cable FAQs  || View Shopping Cart  || UPS Tracking  || USPS Tracking || Shipping Info || Payments || New Products || Clearance "
      
Introduction to Web Design

Develop a strong foundation and understanding of web design by learning HTML, CSS, and additional tools to help you develop your own website.

Top Expert 2016
Commented:
Given String 'p' representing your page:
p = p.replaceAll(" ","");

Open in new window

Author

Commented:
hi there
  i already have this replaceAll() method.. but it accepts only one argument...if i am not wrong.... here is my code..


         connection = pageUrl.openConnection();
    	 Scanner scanner = new Scanner(connection.getInputStream());
    	 scanner.useDelimiter("\\Z");
    	 content = scanner.next();
    	 
    	 Pattern p = Pattern.compile("<[^>]*>");
    	 Matcher m = p.matcher(content);
    	 
    	 String text1 = m.replaceAll(" ");

Open in new window

a_b
Top Expert 2009
Commented:
m should be a string
Top Expert 2016
Commented:
You could make that
String text1 = m.replaceAll(" ").replaceAll("&nbsp", "");

Open in new window

Author

Commented:
hi guys

  As u said i tried the replaceAll() method.. i ve got another problem now.. now i can write the webpage as a .pdf file.. i have attached you the pdf file which is one of the output of my application. In that pdf first few pages are occupied with some html tags and tags starting and ending with "{  }"...... i cant get rid of those.. and also lots of white blank pages are there which i cant get rid of it .
write111.pdf
Top Expert 2016

Commented:
That's a lot of CSS before the content. You would be better of using a proper html parser - regex is not a good substitute

Author

Commented:
can you suggest me some parsers.... thanks
Top Expert 2016

Commented:

Author

Commented:
hi guys
  i planned to take a screen shot of the whole webpage instead of saving it in text or pdf file.. for that i used the following code.. and also i attached the screenshot of the webpage.. eventhough the blue colour is coming in the screenshot and i cant get the whole web page.. only a portion of it is comin... can anyone please help me out..
package test;
 
import java.awt.*;
import java.awt.image.*;
import java.io.File;
import java.io.IOException;
 
import javax.imageio.ImageIO;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
 
public abstract class WebImage 
{ 
static class Kit extends HTMLEditorKit 
{
public Document createDefaultDocument() {
HTMLDocument doc = 
(HTMLDocument) super.createDefaultDocument();
doc.setTokenThreshold(Integer.MAX_VALUE);
doc.setAsynchronousLoadPriority(-1);
return doc;
}
}
 
public static BufferedImage create
(String src, int width, int height) {
BufferedImage image = null;
JEditorPane pane = new JEditorPane();
Kit kit = new Kit();
pane.setEditorKit(kit);
pane.setEditable(false);
pane.setMargin(new Insets(0,0,0,0));
try {
pane.setPage(src);
image = new BufferedImage (width, height, BufferedImage.TYPE_INT_RGB);
Graphics g = image.createGraphics();
Container c = new Container();
SwingUtilities.paintComponent
(g, pane, c, 0, 0, width, height);
g.dispose();
} catch (Exception e) {
System.out.println(e);
}
return image;
}
 
public static void main(String args[]){
	
	BufferedImage ire = WebImage.create("http://www.mail-archive.com/batik-users@xmlgraphics.apache.org/msg04780.html", 1800, 1600);
	try {
		ImageIO.write(ire, "jpg", new File ("C:/Users/Desktop/output/webimage.jpg"));
	} catch (IOException e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}
 
}
}

Open in new window

webimage.jpg
Top Expert 2016

Commented:
You can't take a screenshot of any more than you can see

Author

Commented:
thanks CEHJ.... anyhow i thought of adding multiple functionalities to my project.. like saving the web page in text,pdf,image files... anyhow right now i am learning how to use jericho html parser instead of regex....... i will get back to u if i have any problem..
thanks

Author

Commented:
hi there..

 can you check this link below n tell whether if i use this code.. is it possible to convert the web page to image.. here they mentioned JNI(java native interface)... will it work if i use it in my application...

 thanks
Top Expert 2016

Commented:
Personally i wouldn't bother with it. It's going to be Windows only so that rather negates the point of writing it in Java in the first place. Why do you want web pages as images anyway?

Author

Commented:
because of some webpages which contain diagrams or tables.. etc ..etc.. if i save them in text.. i couldnt see them.. so i wanna save it as a image.. and  also to improve my grade for this project... lol

Author

Commented:
hi there..
    i come across a term "latex" to convert html to pdf file(which includes all this figures and images).....is there any chance for me to use this in my application.. if so can you tell me some tutorials or some sample source code.. it would be really helpful.. thanks

Author

Commented:
hi CEHJ
  i used jericho parser and it works really really fine.. thank u so much...

Author

Commented:
hi CEHJ
  can you help with the "latex" to convert html to pdf... still i have to do a lot of things in my project.. it would be better if i get any help from u guys...
Top Expert 2016
Commented:

Author

Commented:
hi..
 can any one tell me how to save a webpage in .doc format and .html format(I mean saving the webpage in the local directory as html file)...

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial