asked on

jar file that contains manifest file

I am testing my WebCrawler program that retrieve a web page and the images it contains to local storage, so that I can look at the web page loaded from my local file system.
This program takes command-line application that takes two arguments:
The first argument represents the download directory into which the web page will be downloaded. This argument can be relative to the directory where the java command is executed or an absolute directory. If the directory does not exist, throw a CrawlerException.
The second argument represents the absolute URL of the web page to download. This URL WILL end with .../<filename>.html. The original <filename>.html will be used to save the HTML of the page in the download directory.

Now, if I run like this, everything is OK.
java -classpath c:\classes dkim18.crawler.WebCrawler c:\classes\ http://webdev.apl.jhu.edu/%%7Emed/summer03/homework/05LibrarySwing.html

However, if I make jar file then my progarm doesn't create sub directory where supposed to download all relative images.

This is my manifest file.
Manifest-Version: 1.0
Main-Class: dkim18.crawler.WebCrawler

jar -cvmf myManiFest dkim18.jar dkim18/

(all .classes files are under c:\classes\dkim18\crawler\)

So, I can run it from any directory location via: java -jar dkim18.jar <download dir> <url>

Anyidea?

dkim18

ASKER

Hold on...let me try this...

SOLUTION

CEHJ

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dkim18

ASKER

O.K
Here is the problem. If this program compile and run in Windows, it works, but doesn't work in sun Solaris system. I changed \ with /, but doen't write anything in html file and download any images in sub dir. I know this is too much to ask, but here is my code.

-----------
package dkim18.crawler;

import java.io.*;
import java.util.*;
import java.lang.*;
import java.net.*;
import java.util.regex.*;

/**
* WebCrawler class is used to retrieve a web page and the
* images it contains to local storage as project description
*
* @author: Daniel Kim
*/
public class WebCrawler {

private URL url; //url to be retrieved

/**
* Initializes url
*
* @param: url
*/
public void setURL(URL url) {
this.url = url;
}

/**
* returns url
*
* @return : url
*/
public URL getURL() {
return url;
}

/**
* Writes contents in html file that was creadted
*
* @param : file, web contents, directory
*
*/
static public void writeContents(File aFile, String aContents, String dir) throws
FileNotFoundException, IOException {
if (aFile == null) {
throw new IllegalArgumentException("File should not be null.");
}
if (!aFile.exists()) {
throw new FileNotFoundException("File does not exist: " + aFile);
}
if (!aFile.isFile()) {
throw new IllegalArgumentException("Should not be a directory: " + aFile);
}
if (!aFile.canWrite()) {
throw new IllegalArgumentException("File cannot be written: " + aFile);
}

Writer output = null;
try {
output = new BufferedWriter(new FileWriter(aFile));
output.write(aContents);
}
finally {
if (output != null)
output.close();
}
}

/**
* Makes sub directory name for storinig images files
*
* @param : file name
*/
public String makeSubDirName(String fileName) {

int splitIndex = fileName.indexOf(".");
String concatName = fileName.substring(0, splitIndex);
String fileN = (concatName + "_html_files");

return fileN;
}

/**
* Creates sub directory
*
* @param : sub dir from html file, sub directory name
*/
public String makeSubDirectory(String subDirName, String subDir) {

String newDir = subDir+subDirName;
boolean success = (new File(newDir)).mkdir();
if (!success) {
//System.out.println("Failed");
}
success = (new File(subDir)).mkdirs();
if (!success) {
//System.out.println("Failed");
}
return newDir;
}

/**
* Detects images source directory from html file and
* store in array of string
*
* @param : contents of html, sub directory name
* @return : array of string that contains images directory
*/
public String[] getImgSrcDir(String html, String subDirName ) {
final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
final String IMG_PATTERN = "<img\\s+src\\s*=\\s*('|\")(.*?)('|\")";

String imgPatternArr[] = new String[50];
Pattern myPattern = Pattern.compile(IMG_PATTERN, FLAGS);
Matcher myMatcher = myPattern.matcher(html);
int counter = 0;
while (myMatcher.find()) {
String img = myMatcher.group(1);
String imagesTag = myMatcher.group();
int startTag = myMatcher.start();
int startImages = myMatcher.start(1);
int endImages = myMatcher.end(1);
int endTag = myMatcher.end();

imgPatternArr[counter] = imagesTag;
String REGEX_SPACE = "\\s*";
String REGEX_QUOTE = "('|\")";
String REGEX_TAG = "<(.*?)=";

//get rid of spaces if there is any
String REPLACE = "";
Pattern p = Pattern.compile(REGEX_SPACE);
Matcher m = p.matcher(imgPatternArr[counter]);
imgPatternArr[counter] = m.replaceAll(REPLACE);

p = Pattern.compile(REGEX_QUOTE);
m = p.matcher(imgPatternArr[counter]);
imgPatternArr[counter] = m.replaceAll(REPLACE);

p = Pattern.compile(REGEX_TAG);
m = p.matcher(imgPatternArr[counter]);
imgPatternArr[counter] = m.replaceAll(REPLACE);
counter++;
}
String imgArr[] = new String[counter];
for (int i = 0; i < counter; i++) {
imgArr[i] = imgPatternArr[i];
}
return imgArr;
}

/**
* Replaces all relative directories (<a href) in the html
*
* @param : htmlpage, url, html file
* @return : updated html
*/
public static String replaceLinkSrcDir(String htmlWebPage, String url, String htmlFile){
int splitIndex = 0;
String newURL = null;

splitIndex = url.indexOf(htmlFile);
newURL = url.substring(0, splitIndex);

final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
final String FIND_PATTERN = "(<a href=\"*)(([/\\.]*)([^>\"]+))";
final String replace_str = "$1"+ newURL+ "$4";

Pattern myPattern = Pattern.compile(FIND_PATTERN, FLAGS);
Matcher myMatcher = myPattern.matcher(htmlWebPage);
StringBuffer buffy = new StringBuffer();

while (myMatcher.find()) {
try {
URL uri = new URL(myMatcher.group(4));
}catch(MalformedURLException e) {
myMatcher.appendReplacement(buffy, replace_str);
}
}

myMatcher.appendTail(buffy);
String newHtml=buffy.toString();
return newHtml;
}

/**
* Replaces images patterns in html file
*
* @param : html, sub dir name, number of images tag in the html file
* @return : updated html
*/
public static String patternReplace(String htmlWebPage, String subDirName, String[] counter){
final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
final String REPLACE_PATTERN = "<img\\s+src\\s*=\\s*('|\")(.*?)/";
String replace_str = "<img src=\"" + subDirName + "/";
Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
Matcher myMatcher = myPattern.matcher(htmlWebPage);

StringBuffer buffy = new StringBuffer();
for(int i = 0; i < counter.length ; i++){
if (myMatcher.find()) {
myMatcher.appendReplacement(buffy, replace_str);
}
}

myMatcher.appendTail(buffy);
String newHtml=buffy.toString();
return newHtml;
}

/**
* Add all path and name to make full images path
*
* @param : url, html file name, string array contains part of tag
* @return : string of array contain updated img tag
*/
public String[] makeFullImgPath(String urlPath, String htmlFileName, String[] imgDir){
int splitIndex = 0;
String path = null;

splitIndex = urlPath.indexOf(htmlFileName);
path = urlPath.substring(0, splitIndex);

for(int i =0; i < imgDir.length ; i++){
imgDir[i] = path+imgDir[i];
}
return imgDir;
}

/**
* Down load images from relative directory in html
*
* @param : name of sub dir name, string array of full images path, images dir
*/
public void downloadImg(String subDirPathName, String[] fullImgPath, String[] imgDir) {
String targetDirAndFile = null;
String imgFileName = null;
for (int i = 0; i < fullImgPath.length; i++){
try {
URL url = new URL(fullImgPath[i]);
URLConnection urlConnection = url.openConnection();
InputStream is = urlConnection.getInputStream();
int splitIndex = 0;
splitIndex = fullImgPath[i].lastIndexOf("/");
imgFileName = fullImgPath[i].substring(++splitIndex);
targetDirAndFile = subDirPathName + "/" + imgFileName;
FileOutputStream file = new FileOutputStream(targetDirAndFile);

int len;
byte[] buf = new byte[256];
while ( (len = is.read(buf)) >= 0) {
file.write(buf, 0, len);
}
is.close();
file.close();

}
catch (EOFException e) {
System.out.println(e);
}
catch (Exception e) {
System.out.println(e.toString());
System.exit(1);
}
}
}

/**
* Checks usage and throws crawler exception
*
* @param : number of command line agrs
*/
private static void checkUsage(String[] args)throws CrawlerException{
if (args.length != 2){
throw new CrawlerException();
}

File dir = new File(args[0]);
if(!(dir.exists())){
throw new CrawlerException();
}

try{
URL url = new URL(args[1]);
}catch(MalformedURLException mue){
throw new CrawlerException();
}
}

/**
* Gets html page by line
*
* @return : html contents
*/
public String getPage() throws IOException{

final String LINE_SEPARATOR = System.getProperty("line.separator");
BufferedReader bin = null;
StringBuffer buffy = new StringBuffer("");
try{
bin = new BufferedReader(new InputStreamReader(url.openStream()));
String line = null;
while((line = bin.readLine()) != null){
buffy.append(line);
buffy.append(LINE_SEPARATOR);
}
}finally{
if (bin != null) {
bin.close();
}
}
return buffy.toString() ;
}

/**
* main that drives this progarm
*
* @param : command line args
*/
public static void main(String[] args) throws CrawlerException {

checkUsage(args);

try{
String webPage = null;
String htmlFileName = null;
WebCrawler myReader = new WebCrawler();
myReader.setURL(new URL(args[1]));
webPage = myReader.getPage();

File file = new File(args[1]);
//index.html
htmlFileName = file.getName();
//index_html_files
String subDirName = myReader.makeSubDirName(htmlFileName);
//make sub dir name
String subDirPathName = myReader.makeSubDirectory(subDirName, args[0]);
//get img dir
String imgDir[] = myReader.getImgSrcDir(webPage, subDirName);
//replace img patterns
String newWebPage = patternReplace(webPage, subDirName, imgDir );
//replace relative link
String newWebPageURL= myReader.replaceLinkSrcDir(newWebPage, args[1], htmlFileName);

String targetDir = args[0]; //target dir
URL url = new URL(args[1]);
String s = url.getFile();
if (s != null && s.length() > 0) {
s = s.substring(s.lastIndexOf("/"));
File f = new File(targetDir + s);
f.createNewFile();
writeContents(f, newWebPageURL, args[0]);
//store final img path
String fullImgPath[] = myReader.makeFullImgPath(args[1], htmlFileName, imgDir);
myReader.downloadImg( subDirPathName, fullImgPath, imgDir);
}
}
catch (IOException e) {
System.out.println(e);
}
}

}

dkim18

ASKER

If I leave targetDirAndFile = subDirPathName + "\\" + imgFileName; like this in sun solaris system, html file does contains all the string. But still no downloaded images in sub dir.

public void downloadImg(String subDirPathName, String[] fullImgPath, String[] imgDir) {
String targetDirAndFile = null;
String imgFileName = null;
for (int i = 0; i < fullImgPath.length; i++){
try {
URL url = new URL(fullImgPath[i]);
URLConnection urlConnection = url.openConnection();
InputStream is = urlConnection.getInputStream();
int splitIndex = 0;
splitIndex = fullImgPath[i].lastIndexOf("/");
imgFileName = fullImgPath[i].substring(++splitIndex);
targetDirAndFile = subDirPathName + "\\" + imgFileName;
FileOutputStream file = new FileOutputStream(targetDirAndFile);

int len;
byte[] buf = new byte[256];
while ( (len = is.read(buf)) >= 0) {
file.write(buf, 0, len);
}
is.close();
file.close();

}
catch (EOFException e) {
System.out.println(e);
}
catch (Exception e) {
System.out.println(e.toString());
System.exit(1);
}
}
}

CEHJ

What exact command are you giving in SOlaris?

ASKER CERTIFIED SOLUTION

jimmack

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Mick Barry

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dkim18

ASKER

java -classpath classes dkim18.crawler.WebCrawler classes/ http://
webdev.apl.jhu.edu/%7Emed/summer03/homework/05LibrarySwing.html

(all in one line)
just doesn't download images in sub dir(classes/05LibrarySwing_html_file)

CEHJ

I assume you still have write access to classes directory?

What is the code inside your class that manipulates that directory parameter?

dkim18

ASKER

I didn't go through all the code you posted, but I saw this line (in the middle of your last post):

targetDirAndFile = subDirPathName + "\\" + imgFileName;

This (and any other relevant lines) should be changed to:

>>again If I leave targetDirAndFile = subDirPathName + "\\" + imgFileName; like this in sun solaris system, html file does contains all the string. But still no downloaded images in sub dir.

>>again, this program works find in Windows OS!!!

Mick Barry

Add some debug to determine whether the problem is with the download, or with the file writing.

dkim18

ASKER

I checked html source and there is this line:
<img src="05LibrarySwing_html_files/hw5.gif" alt="Hw5 snapshot">
which means this does changed
<img src="hw5/hw5.gif" alt="Hw5 snapshot"> to
<img src="05LibrarySwing_html_files/hw5.gif" alt="Hw5 snapshot">

but just dosn't download images...

Mick Barry

Add some debug to print out the urls of things it is downloading and the file path it is saving it to.

dkim18

ASKER

If I want to dect (<img...src="..."...>) in html file,

Is this "<img\\s+src=\"(.*?)\".*?>" correct?

CEHJ

Not quite - for one thing, sometimes urls are not in quotes

dkim18

ASKER

For this project, the pattern is always like this: <img...src="..."...>

Mick Barry

> Is this "<img\\s+src=\"(.*?)\".*?>" correct?

are you accessing the same url as you were when running windoze?

dkim18

ASKER

are you accessing the same url as you were when running windoze?
>>yes

Mick Barry

then pattern matching should be the same shouldn't it.
Have you added debug to determine exactly where the problem is occurring (I'm too lazy to wade thru all that code:) )

dkim18

ASKER

I am trying...but it is weird...now I have empty html file, when I run it.

CEHJ

This will match with or without quotes:

String re = "<img\\s+src=\"*([^>\"]+)\"*>";

Mick Barry

> File file = new File(args[1]);

args[1] is a URL, not a file.

dkim18

ASKER

args[1] is a URL, not a file.
>>The next statement will get html file name and I am using this file name.
>>htmlFileName = file.getName();

String re = "<img\\s+src=\"*([^>\"]+)\"*>";
>>didn;t work windows or solaris

Mick Barry

print out all these values and post the results:

File file = new File(args[1]);
//index.html
htmlFileName = file.getName();
//index_html_files
String subDirName = myReader.makeSubDirName(htmlFileName);
//make sub dir name
String subDirPathName = myReader.makeSubDirectory(subDirName, args[0]);
//get img dir
String imgDir[] = myReader.getImgSrcDir(webPage, subDirName);
//replace img patterns
String newWebPage = patternReplace(webPage, subDirName, imgDir );
//replace relative link
String newWebPageURL= myReader.replaceLinkSrcDir(newWebPage, args[1], htmlFileName);

String targetDir = args[0]; //target dir
URL url = new URL(args[1]);
String s = url.getFile();
if (s != null && s.length() > 0) {
s = s.substring(s.lastIndexOf("/"));
File f = new File(targetDir + s);

SOLUTION

CEHJ

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Mick Barry

isn't the RE working already in Windoze?

Mick Barry

:-)

http://www.objects.com.au