Java1.4 Regular Expression Problem

cneaton
cneaton used Ask the Experts™
on
Hi,

I would like to look into the HTML source code of a web page and determine if certain patterns are present.  If you look at the code below, I'm trying to find the tag:
<SCRIPT LANGUAGE=JavaScript>
Although the output below signals that the tag is present, when I try to print out the matching region (with System.out.println(match);), it gives me a segment of text that begins at say <SCRIPT LANGUAGE=...., but doesn't end until the end of the file.  I would like to end at the first ">".  Can anyone help me with this?  Thanks.

import java.util.regex.*;
import java.io.*;
import java.net.*;

public class BasicMatch {
     public static void main(String[] args) {
          CharSequence s = "";
          try{
               String webpage = "http://javascript.internet.com/";
               InputStreamReader rd = getReader(webpage);    
               BufferedReader bfr = new BufferedReader(rd);
               StringBuffer sb = new StringBuffer("");
                while ((s = bfr.readLine()) != null){
                    sb.append(s);
               }
                 s = (CharSequence)sb.toString();
          } catch(Exception e)  
          {
               System.out.println(e);
          }
             // Compile regular expression
             String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScript(.)*\"(.)*>";
             Pattern pattern = Pattern.compile(patternStr);
             // Determine if pattern exists in input
             CharSequence inputStr = "a b c b";
             Matcher matcher = pattern.matcher(s);
             boolean matchFound = matcher.find();    // true
              System.out.println(matchFound);
          if(matchFound){
                  // Get matching string
                      String match = matcher.group();         // b
                   System.out.println(match);
                      // Get indices of matching string
                      int start = matcher.start();            // 2
                      int end = matcher.end();                // 3
                      // the end is index of the last matching character + 1
                   System.out.println(start);
               System.out.println(end);
                      // Find the next occurrence
                      matchFound = matcher.find();            // true
          }
     }

     static InputStreamReader getReader(String uri) throws IOException     {          
     // Retrieve from Internet.          
          if (uri.startsWith("http:"))          
          {              
               URLConnection conn = new URL(uri).openConnection();
               //System.out.println(uri);              
               return new InputStreamReader(conn.getInputStream());          
          }          
     // Retrieve from file.          
          else          
          {              
               return new FileReader(uri);          
          }    
     }
    }
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Mick BarryJava Developer
Top Expert 2010

Commented:
try:

String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScript(.)*\"(.)*?>";

Author

Commented:
Hi Objects,

Thanks for your reply!  I tried that and it didn't work.  Any more ideas?

Thanks!
Mick BarryJava Developer
Top Expert 2010

Commented:
String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScript(.)*?\"(.)*?>";
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

Mick BarryJava Developer
Top Expert 2010

Commented:
No that won't work either.
Mick BarryJava Developer
Top Expert 2010

Commented:
one more try :)

String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScript(.)*?\"(.)*?>";
Top Expert 2016

Commented:
How about:

<SCRIPT [^>]+?>
Here one that uses org.apache.oro.text.perl package (as I'm still using jdk1.3).  The pattern should work though.

import org.apache.oro.text.perl.*;

public class ParserUsingPerl {

    static String[] lines = {
        "<script language=\"JavaScript\">",
        "<!--",
        "<script>",
        "<script >",
        "<script language=\"JavaScript\" type=\"text/javascript\">",
        "<script type=\"text/javascript\">",
        "<scriptlanguage=\"JavaScript\">",
        "< script language=\"JavaScript\">",
        "a dummy line",
        "<script language=\"Beanscript\">",
        "<script language= \n" +
            "Beanscript>",
        "<script LANGUAGE = \"JavaScript\" type=\"text/javascript\">",
    };

    <b>static String pat = "m#\\s*<SCRIPT\\s+?LANGUAGE\\s*=?\\s*[\\\"|']?([A-Z|a-z]+.*?)[\\\"|']?.*?>#i";</b>
    static Perl5Util perl5 = new Perl5Util();

    public static void main(String[] args) {
        try {
            for (int i=0; i<lines.length; i++) {
                String match = "";
                if (perl5.match(pat,lines[i])) {
                    System.out.print("MATCH: ");
                    match = perl5.group(1);
                } else {
                    System.out.print("     : ");
                }
                System.out.println(lines[i]);
                if (!match.equals("")) {
                    System.out.println("       found " + match);
                }
            }
        } catch (Exception e) {
            e.printStackTrace(System.err);
        }
    }
}
This pattern seems to work:
<SCRIPT LANGUAGE=(.)*(J|j)avaScript[^>]*\"[^>]*>

Here is full test with another variant:


import java.util.regex.*;
import java.io.*;

public class FindTest{
    public static void main(String[] args) throws Exception{
        Pattern p = Pattern.compile("<SCRIPT\\s*LANGUAGE=\\s*\"JavaScript\"[^>]*>", Pattern.CASE_INSENSITIVE);
        StringBuffer buffer = new StringBuffer();
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        for( int ch=in.read(); ch>=0; ch=in.read()){
            buffer.append((char)ch);
        }
        Matcher m = p.matcher(buffer);
        while( m.find() ){
            System.out.println("Start: "+m.start()+", end: "+m.end()+", group: ["+m.group()+"]");
        }
    }
}

And yet another one:
        Pattern p = Pattern.compile("<SCRIPT\\s*LANGUAGE=\\s*\"JavaScript\".*?>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

BTW last one seems to be same as first objects' suggestion. Which way didn't it work in?

Igor
Mick BarryJava Developer
Top Expert 2010

Commented:
Tested the last one I posted and it worked ok.
"<SCRIPT LANGUAGE=(.)*?(J|j)avaScript(.)*?\"(.)*?>"

Author

Commented:
Thanks to all for your help!  Bazarny, I substitued your code into my original program and it works really well!  Thanks a lot again to everyone!
Mick BarryJava Developer
Top Expert 2010

Commented:
What was wrong with the regexp I supplied?
It worked fine here?
Ugh, it's a second time this month another contributor is unhappy when points are awarded to me. Is there any way around?

Regards,

Igor Bazarny,
Brainbench MVP for Java 1
www.brainbench.com

Author

Commented:
Hi Objects,

There wasn't anything wrong with the solution you provided, and I really appreciate all of your help, but Bazarny's solution had a little more information that has proved to be really helpful in the development of my code.  For instance, the "Pattern.CASE_INSENSITIVE" flag that he included in his solution saves me from having to do things like (j|J)avascript and allows me to catch the word "javascript" in a much more flexible way.  Again, I thank you for all of your help, and I am grateful for your suggestions because you were the first person to reply to me.  However, I chose Bazarny's solution because I felt it was the the most helpful.  
Mick BarryJava Developer
Top Expert 2010

Commented:
> it's a second time this month another contributor is
> unhappy when points are awarded to me.

I'm not unhappy the points were awarded to you, was just curious what was wrong with what I provided.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial