Solved

jar file that contains manifest file

Posted on 2003-12-10
27
622 Views
Last Modified: 2013-11-23
I am testing my WebCrawler program that retrieve a web page and the images it contains to local storage, so that I can look at the web page loaded from my local file system.
This program takes command-line application that takes two arguments:
The first argument represents the download directory into which the web page will be downloaded. This argument can be relative to the directory where the java command is executed or an absolute directory. If the directory does not exist, throw a CrawlerException.
The second argument represents the absolute URL of the web page to download. This URL WILL end with .../<filename>.html. The original <filename>.html will be used to save the HTML of the page in the download directory.

Now, if I run like this, everything is OK.
java -classpath c:\classes dkim18.crawler.WebCrawler c:\classes\ http://webdev.apl.jhu.edu/%%7Emed/summer03/homework/05LibrarySwing.html

However, if I make jar file then my progarm doesn't create sub directory where supposed to download all relative images.

This is my manifest file.
Manifest-Version: 1.0
Main-Class: dkim18.crawler.WebCrawler

jar -cvmf myManiFest dkim18.jar dkim18/

(all .classes files are under c:\classes\dkim18\crawler\)

So, I can run it from any directory location via: java -jar dkim18.jar <download dir> <url>

Anyidea?



0
Comment
Question by:dkim18
  • 11
  • 9
  • 6
  • +1
27 Comments
 

Author Comment

by:dkim18
ID: 9914800
Hold on...let me try this...
0
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 150 total points
ID: 9914813
Try passing the directory as an environment var:

java -Ddownload=c:/download -jar dkim18.jar <url>
0
 

Author Comment

by:dkim18
ID: 9914974
O.K
Here is the problem. If this program compile and run in Windows, it works, but doesn't work in sun Solaris system. I changed \ with /, but doen't write anything in html file and download any images in sub dir. I know this is too much to ask, but here is my code.

-----------
package dkim18.crawler;

import java.io.*;
import java.util.*;
import java.lang.*;
import java.net.*;
import java.util.regex.*;

/**
 * WebCrawler  class is used to retrieve a web page and the
 * images it contains to local storage as project description
 *
 * @author: Daniel Kim
 */
public class WebCrawler {

  private URL url;                             //url to be retrieved

  /**
   * Initializes url
   *
   * @param: url
   */
  public void setURL(URL url) {
    this.url = url;
  }

  /**
   * returns url
   *
   * @return : url
   */
  public URL getURL() {
    return url;
  }

  /**
   * Writes contents in html file that was creadted
   *
   * @param : file, web contents, directory
   *
   */
  static public void writeContents(File aFile, String aContents, String dir) throws
      FileNotFoundException, IOException {
    if (aFile == null) {
      throw new IllegalArgumentException("File should not be null.");
    }
    if (!aFile.exists()) {
      throw new FileNotFoundException("File does not exist: " + aFile);
    }
    if (!aFile.isFile()) {
      throw new IllegalArgumentException("Should not be a directory: " + aFile);
    }
    if (!aFile.canWrite()) {
      throw new IllegalArgumentException("File cannot be written: " + aFile);
    }

    Writer output = null;
    try {
      output = new BufferedWriter(new FileWriter(aFile));
      output.write(aContents);
    }
    finally {
      if (output != null)
        output.close();
    }
  }

  /**
   * Makes sub directory name for storinig images files
   *
   * @param : file name
   */
  public String makeSubDirName(String fileName) {

    int splitIndex = fileName.indexOf(".");
    String concatName = fileName.substring(0, splitIndex);
    String fileN = (concatName + "_html_files");

    return fileN;
  }

  /**
  * Creates sub directory
  *
  * @param : sub dir from html file, sub directory name
  */
  public String makeSubDirectory(String subDirName, String subDir) {

    String newDir = subDir+subDirName;
    boolean success = (new File(newDir)).mkdir();
    if (!success) {
       //System.out.println("Failed");
     }
     success = (new File(subDir)).mkdirs();
     if (!success) {
       //System.out.println("Failed");
     }
     return newDir;
  }

  /**
   * Detects images source directory from html file and
   * store in array of string
   *
   * @param : contents of html, sub directory name
   * @return : array of string that contains images directory
   */
  public String[] getImgSrcDir(String html, String subDirName ) {
    final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
    final String IMG_PATTERN = "<img\\s+src\\s*=\\s*('|\")(.*?)('|\")";

    String imgPatternArr[] = new String[50];
    Pattern myPattern = Pattern.compile(IMG_PATTERN, FLAGS);
    Matcher myMatcher = myPattern.matcher(html);
    int counter = 0;
    while (myMatcher.find()) {
      String img = myMatcher.group(1);
      String imagesTag = myMatcher.group();
      int startTag = myMatcher.start();
      int startImages = myMatcher.start(1);
      int endImages = myMatcher.end(1);
      int endTag = myMatcher.end();

      imgPatternArr[counter] = imagesTag;
      String REGEX_SPACE = "\\s*";
      String REGEX_QUOTE = "('|\")";
      String REGEX_TAG = "<(.*?)=";

      //get rid of spaces if there is any
      String REPLACE = "";
      Pattern p = Pattern.compile(REGEX_SPACE);
      Matcher m = p.matcher(imgPatternArr[counter]);
      imgPatternArr[counter] = m.replaceAll(REPLACE);

      p = Pattern.compile(REGEX_QUOTE);
      m = p.matcher(imgPatternArr[counter]);
      imgPatternArr[counter] = m.replaceAll(REPLACE);

      p = Pattern.compile(REGEX_TAG);
      m = p.matcher(imgPatternArr[counter]);
      imgPatternArr[counter] = m.replaceAll(REPLACE);
      counter++;
    }
    String imgArr[] = new String[counter];
    for (int i = 0; i < counter; i++) {
      imgArr[i] = imgPatternArr[i];
    }
    return imgArr;
  }

  /**
   * Replaces all relative directories (<a href) in the html
   *
   * @param : htmlpage, url, html file
   * @return : updated html
   */
  public static String replaceLinkSrcDir(String htmlWebPage, String url, String htmlFile){
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);

    final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
    final String FIND_PATTERN = "(<a href=\"*)(([/\\.]*)([^>\"]+))";
    final String replace_str = "$1"+ newURL+ "$4";

    Pattern myPattern = Pattern.compile(FIND_PATTERN, FLAGS);
    Matcher myMatcher = myPattern.matcher(htmlWebPage);
    StringBuffer buffy = new StringBuffer();

    while (myMatcher.find()) {
      try {
        URL uri = new URL(myMatcher.group(4));
     }catch(MalformedURLException e) {
       myMatcher.appendReplacement(buffy, replace_str);
     }
    }

    myMatcher.appendTail(buffy);
    String newHtml=buffy.toString();
    return newHtml;
}

  /**
   * Replaces images patterns in html file
   *
   * @param : html, sub dir name, number of images tag in the html file
   * @return : updated html
   */
  public static String patternReplace(String htmlWebPage, String subDirName, String[] counter){
    final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
    final String REPLACE_PATTERN = "<img\\s+src\\s*=\\s*('|\")(.*?)/";
    String replace_str = "<img src=\"" + subDirName + "/";
    Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
    Matcher myMatcher = myPattern.matcher(htmlWebPage);

    StringBuffer buffy = new StringBuffer();
    for(int i = 0; i < counter.length ; i++){
      if (myMatcher.find()) {
        myMatcher.appendReplacement(buffy, replace_str);
      }
    }

    myMatcher.appendTail(buffy);
    String newHtml=buffy.toString();
    return newHtml;
  }

  /**
   * Add all path and name to make full images path
   *
   * @param : url, html file name, string array contains part of tag
   * @return : string of array contain updated img tag
   */
  public String[] makeFullImgPath(String urlPath, String htmlFileName, String[] imgDir){
    int splitIndex = 0;
    String path = null;

    splitIndex = urlPath.indexOf(htmlFileName);
    path = urlPath.substring(0, splitIndex);

    for(int i =0; i < imgDir.length ; i++){
      imgDir[i] = path+imgDir[i];
    }
    return imgDir;
  }

  /**
   * Down load images from relative directory in html
   *
   * @param : name of sub dir name, string array of full images path, images dir
   */
  public void downloadImg(String subDirPathName, String[] fullImgPath, String[] imgDir) {
    String targetDirAndFile = null;
    String imgFileName = null;
    for (int i = 0; i < fullImgPath.length; i++){
      try {
        URL url = new URL(fullImgPath[i]);
        URLConnection urlConnection = url.openConnection();
        InputStream is = urlConnection.getInputStream();
        int splitIndex = 0;
        splitIndex = fullImgPath[i].lastIndexOf("/");
        imgFileName = fullImgPath[i].substring(++splitIndex);
        targetDirAndFile = subDirPathName + "/" + imgFileName;
        FileOutputStream file = new FileOutputStream(targetDirAndFile);

        int len;
        byte[] buf = new byte[256];
        while ( (len = is.read(buf)) >= 0) {
          file.write(buf, 0, len);
        }
        is.close();
        file.close();

      }
      catch (EOFException e) {
        System.out.println(e);
      }
      catch (Exception e) {
        System.out.println(e.toString());
        System.exit(1);
      }
    }
  }

  /**
   * Checks usage and throws crawler exception
   *
   * @param : number of command line agrs
   */
  private static void checkUsage(String[] args)throws CrawlerException{
    if (args.length != 2){
      throw new CrawlerException();
    }

    File dir = new  File(args[0]);
    if(!(dir.exists())){
       throw new CrawlerException();
    }

    try{
      URL url = new URL(args[1]);
    }catch(MalformedURLException mue){
      throw new CrawlerException();
    }
  }

  /**
   * Gets html page by line
   *
   * @return : html contents
   */
  public String getPage() throws IOException{

    final String LINE_SEPARATOR = System.getProperty("line.separator");
    BufferedReader bin = null;
    StringBuffer buffy = new StringBuffer("");
    try{
      bin = new BufferedReader(new InputStreamReader(url.openStream()));
      String line = null;
      while((line = bin.readLine()) != null){
        buffy.append(line);
        buffy.append(LINE_SEPARATOR);
      }
    }finally{
      if (bin != null) {
        bin.close();
      }
    }
    return buffy.toString() ;
  }

/**
 * main that drives this progarm
 *
 * @param : command line args
 */
  public static void main(String[] args) throws CrawlerException {

  checkUsage(args);

  try{
    String webPage = null;
    String htmlFileName = null;
    WebCrawler myReader = new WebCrawler();
    myReader.setURL(new URL(args[1]));
    webPage = myReader.getPage();

    File file = new File(args[1]);
    //index.html
    htmlFileName = file.getName();
    //index_html_files
    String subDirName = myReader.makeSubDirName(htmlFileName);
    //make sub dir name
    String subDirPathName = myReader.makeSubDirectory(subDirName, args[0]);
    //get img dir
    String imgDir[] = myReader.getImgSrcDir(webPage, subDirName);
    //replace img patterns
    String newWebPage = patternReplace(webPage, subDirName, imgDir );
    //replace relative link
    String  newWebPageURL= myReader.replaceLinkSrcDir(newWebPage, args[1], htmlFileName);

    String targetDir = args[0];                       //target dir
    URL url = new URL(args[1]);
    String s = url.getFile();
    if (s != null && s.length() > 0) {
      s = s.substring(s.lastIndexOf("/"));
      File f = new File(targetDir + s);
      f.createNewFile();
      writeContents(f, newWebPageURL, args[0]);
      //store final img path
      String fullImgPath[] = myReader.makeFullImgPath(args[1], htmlFileName, imgDir);
      myReader.downloadImg( subDirPathName, fullImgPath, imgDir);
    }
  }
  catch (IOException e) {
    System.out.println(e);
  }
 }

}






0
 

Author Comment

by:dkim18
ID: 9915009
If I leave targetDirAndFile = subDirPathName + "\\" + imgFileName; like this in sun solaris system, html file does contains all the string. But still no downloaded images in sub dir.


  public void downloadImg(String subDirPathName, String[] fullImgPath, String[] imgDir) {
    String targetDirAndFile = null;
    String imgFileName = null;
    for (int i = 0; i < fullImgPath.length; i++){
      try {
        URL url = new URL(fullImgPath[i]);
        URLConnection urlConnection = url.openConnection();
        InputStream is = urlConnection.getInputStream();
        int splitIndex = 0;
        splitIndex = fullImgPath[i].lastIndexOf("/");
        imgFileName = fullImgPath[i].substring(++splitIndex);
        targetDirAndFile = subDirPathName + "\\" + imgFileName;
        FileOutputStream file = new FileOutputStream(targetDirAndFile);

        int len;
        byte[] buf = new byte[256];
        while ( (len = is.read(buf)) >= 0) {
          file.write(buf, 0, len);
        }
        is.close();
        file.close();

      }
      catch (EOFException e) {
        System.out.println(e);
      }
      catch (Exception e) {
        System.out.println(e.toString());
        System.exit(1);
      }
    }
  }
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9915046
What exact command are you giving in SOlaris?
0
 
LVL 15

Accepted Solution

by:
jimmack earned 100 total points
ID: 9915129
I didn't go through all the code you posted, but I saw this line (in the middle of your last post):

>> targetDirAndFile = subDirPathName + "\\" + imgFileName;

This (and any other relevant lines) should be changed to:

targetDirAndFile = subDirPathName + File.separator + imgFileName;

if you want this code to work cross-platform.
0
 
LVL 92

Assisted Solution

by:objects
objects earned 100 total points
ID: 9915157
Do you have access to create the directory and files?
0
 

Author Comment

by:dkim18
ID: 9915163
java -classpath classes dkim18.crawler.WebCrawler classes/ http://
webdev.apl.jhu.edu/%7Emed/summer03/homework/05LibrarySwing.html

(all in one line)
just doesn't download images in sub dir(classes/05LibrarySwing_html_file)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9915199
I assume you still have write access to classes directory?

What is the code inside your class that manipulates that directory parameter?
0
 

Author Comment

by:dkim18
ID: 9915221
I didn't go through all the code you posted, but I saw this line (in the middle of your last post):

 targetDirAndFile = subDirPathName + "\\" + imgFileName;

This (and any other relevant lines) should be changed to:

>>again If I leave targetDirAndFile = subDirPathName + "\\" + imgFileName; like this in sun solaris system, html file does contains all the string. But still no downloaded images in sub dir.

>>again, this program works find in Windows OS!!!
0
 
LVL 92

Expert Comment

by:objects
ID: 9915272
Add some debug to determine whether the problem is with the download, or with the file writing.
0
 

Author Comment

by:dkim18
ID: 9915394
I checked html source and there is this line:
<img src="05LibrarySwing_html_files/hw5.gif" alt="Hw5 snapshot">
which means this does changed
<img src="hw5/hw5.gif" alt="Hw5 snapshot"> to
<img src="05LibrarySwing_html_files/hw5.gif" alt="Hw5 snapshot">

 but just dosn't download images...
0
 
LVL 92

Expert Comment

by:objects
ID: 9915584
Add some debug to print out the urls of things it is downloading and the file path it is saving it to.
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 

Author Comment

by:dkim18
ID: 9915791
If I want to dect (<img...src="..."...>) in html file,

Is this "<img\\s+src=\"(.*?)\".*?>" correct?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9915947
Not quite - for one thing, sometimes urls are not in quotes
0
 

Author Comment

by:dkim18
ID: 9916045
For this project, the pattern is always like this: <img...src="..."...>
0
 
LVL 92

Expert Comment

by:objects
ID: 9916164
> Is this "<img\\s+src=\"(.*?)\".*?>" correct?

are you accessing the same url as you were when running windoze?
0
 

Author Comment

by:dkim18
ID: 9916271
are you accessing the same url as you were when running windoze?
>>yes
0
 
LVL 92

Expert Comment

by:objects
ID: 9916290
then pattern matching should be the same shouldn't it.
Have you added debug to determine exactly where the problem is occurring (I'm too lazy to wade thru all that code:) )
0
 

Author Comment

by:dkim18
ID: 9916334
I am trying...but it is weird...now I have empty html file, when I run it.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9916515
This will match with or without quotes:

String re = "<img\\s+src=\"*([^>\"]+)\"*>";
0
 
LVL 92

Expert Comment

by:objects
ID: 9916554
> File file = new File(args[1]);

args[1] is a URL, not a file.

0
 

Author Comment

by:dkim18
ID: 9916728
args[1] is a URL, not a file.
>>The next statement will get html file name and I am using this file name.
>>htmlFileName = file.getName();

String re = "<img\\s+src=\"*([^>\"]+)\"*>";
>>didn;t work windows or solaris
0
 
LVL 92

Expert Comment

by:objects
ID: 9916819
print out all these values and post the results:

   File file = new File(args[1]);
    //index.html
    htmlFileName = file.getName();
    //index_html_files
    String subDirName = myReader.makeSubDirName(htmlFileName);
    //make sub dir name
    String subDirPathName = myReader.makeSubDirectory(subDirName, args[0]);
    //get img dir
    String imgDir[] = myReader.getImgSrcDir(webPage, subDirName);
    //replace img patterns
    String newWebPage = patternReplace(webPage, subDirName, imgDir );
    //replace relative link
    String  newWebPageURL= myReader.replaceLinkSrcDir(newWebPage, args[1], htmlFileName);

    String targetDir = args[0];                       //target dir
    URL url = new URL(args[1]);
    String s = url.getFile();
    if (s != null && s.length() > 0) {
      s = s.substring(s.lastIndexOf("/"));
      File f = new File(targetDir + s);
0
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 150 total points
ID: 9916836
Try

String re = "<img\\s+src=\"*([^>\\s\"]+)";
0
 
LVL 92

Expert Comment

by:objects
ID: 9916855
isn't the RE working already in Windoze?
0
 
LVL 92

Expert Comment

by:objects
ID: 9967073
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Viewers learn about the “while” loop and how to utilize it correctly in Java. Additionally, viewers begin exploring how to include conditional statements within a while loop and avoid an endless loop. Define While Loop: Basic Example: Explanatio…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now