This might be as much of a general architecture question as it is specifically related to java.
I'm taking a crack at writing a java app which will be running on a server to perform the following functions.
1) read in a list of keywords from a db table
2) perform google searches via their api
3) return the first 10,000 hits as links into another table
4) producer/consumer thread model to retrieve these links form the table and retrieve the page
5) save the text from this page in a db table identifying its source and this will be the table which will be used in an "archive" search
This is essentially my first though at approaching the problem. Before I ask for particular advice on which method you would recommend in retrieving the pages, is there something inherently bad with my process above?? I think that I would really rather save off the pages and images in a directory structure of some sort, but I'm not entirely sure how that would be searchable from a webpage in that case.
Any advice???? Sorry it's somewhat of a broad question, but I think that perhaps you see my aim from the above. The exact purpose is to automate the retrieval of static information from the internet for review by staff members of the U.N.
Thanks,
David
Start Free Trial