Link to home
Start Free TrialLog in
Avatar of cneaton
cneaton

asked on

Java1.4 Regular Expression Problem

Hi,

I would like to look into the HTML source code of a web page and determine if certain patterns are present.  If you look at the code below, I'm trying to find the tag:
<SCRIPT LANGUAGE=JavaScript>
Although the output below signals that the tag is present, when I try to print out the matching region (with System.out.println(match);), it gives me a segment of text that begins at say <SCRIPT LANGUAGE=...., but doesn't end until the end of the file.  I would like to end at the first ">".  Can anyone help me with this?  Thanks.

import java.util.regex.*;
import java.io.*;
import java.net.*;

public class BasicMatch {
     public static void main(String[] args) {
          CharSequence s = "";
          try{
               String webpage = "http://javascript.internet.com/";
               InputStreamReader rd = getReader(webpage);    
               BufferedReader bfr = new BufferedReader(rd);
               StringBuffer sb = new StringBuffer("");
                while ((s = bfr.readLine()) != null){
                    sb.append(s);
               }
                 s = (CharSequence)sb.toString();
          } catch(Exception e)  
          {
               System.out.println(e);
          }
             // Compile regular expression
             String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScript(.)*\"(.)*>";
             Pattern pattern = Pattern.compile(patternStr);
             // Determine if pattern exists in input
             CharSequence inputStr = "a b c b";
             Matcher matcher = pattern.matcher(s);
             boolean matchFound = matcher.find();    // true
              System.out.println(matchFound);
          if(matchFound){
                  // Get matching string
                      String match = matcher.group();         // b
                   System.out.println(match);
                      // Get indices of matching string
                      int start = matcher.start();            // 2
                      int end = matcher.end();                // 3
                      // the end is index of the last matching character + 1
                   System.out.println(start);
               System.out.println(end);
                      // Find the next occurrence
                      matchFound = matcher.find();            // true
          }
     }

     static InputStreamReader getReader(String uri) throws IOException     {          
     // Retrieve from Internet.          
          if (uri.startsWith("http:"))          
          {              
               URLConnection conn = new URL(uri).openConnection();
               //System.out.println(uri);              
               return new InputStreamReader(conn.getInputStream());          
          }          
     // Retrieve from file.          
          else          
          {              
               return new FileReader(uri);          
          }    
     }
    }
Avatar of Mick Barry
Mick Barry
Flag of Australia image

try:

String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScript(.)*\"(.)*?>";
Avatar of cneaton
cneaton

ASKER

Hi Objects,

Thanks for your reply!  I tried that and it didn't work.  Any more ideas?

Thanks!
String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScript(.)*?\"(.)*?>";
No that won't work either.
one more try :)

String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScript(.)*?\"(.)*?>";
How about:

<SCRIPT [^>]+?>
Here one that uses org.apache.oro.text.perl package (as I'm still using jdk1.3).  The pattern should work though.

import org.apache.oro.text.perl.*;

public class ParserUsingPerl {

    static String[] lines = {
        "<script language=\"JavaScript\">",
        "<!--",
        "<script>",
        "<script >",
        "<script language=\"JavaScript\" type=\"text/javascript\">",
        "<script type=\"text/javascript\">",
        "<scriptlanguage=\"JavaScript\">",
        "< script language=\"JavaScript\">",
        "a dummy line",
        "<script language=\"Beanscript\">",
        "<script language= \n" +
            "Beanscript>",
        "<script LANGUAGE = \"JavaScript\" type=\"text/javascript\">",
    };

    <b>static String pat = "m#\\s*<SCRIPT\\s+?LANGUAGE\\s*=?\\s*[\\\"|']?([A-Z|a-z]+.*?)[\\\"|']?.*?>#i";</b>
    static Perl5Util perl5 = new Perl5Util();

    public static void main(String[] args) {
        try {
            for (int i=0; i<lines.length; i++) {
                String match = "";
                if (perl5.match(pat,lines[i])) {
                    System.out.print("MATCH: ");
                    match = perl5.group(1);
                } else {
                    System.out.print("     : ");
                }
                System.out.println(lines[i]);
                if (!match.equals("")) {
                    System.out.println("       found " + match);
                }
            }
        } catch (Exception e) {
            e.printStackTrace(System.err);
        }
    }
}
ASKER CERTIFIED SOLUTION
Avatar of Igor Bazarny
Igor Bazarny
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Tested the last one I posted and it worked ok.
"<SCRIPT LANGUAGE=(.)*?(J|j)avaScript(.)*?\"(.)*?>"
Avatar of cneaton

ASKER

Thanks to all for your help!  Bazarny, I substitued your code into my original program and it works really well!  Thanks a lot again to everyone!
What was wrong with the regexp I supplied?
It worked fine here?
Ugh, it's a second time this month another contributor is unhappy when points are awarded to me. Is there any way around?

Regards,

Igor Bazarny,
Brainbench MVP for Java 1
www.brainbench.com
Avatar of cneaton

ASKER

Hi Objects,

There wasn't anything wrong with the solution you provided, and I really appreciate all of your help, but Bazarny's solution had a little more information that has proved to be really helpful in the development of my code.  For instance, the "Pattern.CASE_INSENSITIVE" flag that he included in his solution saves me from having to do things like (j|J)avascript and allows me to catch the word "javascript" in a much more flexible way.  Again, I thank you for all of your help, and I am grateful for your suggestions because you were the first person to reply to me.  However, I chose Bazarny's solution because I felt it was the the most helpful.  
> it's a second time this month another contributor is
> unhappy when points are awarded to me.

I'm not unhappy the points were awarded to you, was just curious what was wrong with what I provided.