cneaton
asked on
Java1.4 Regular Expression Problem
Hi,
I would like to look into the HTML source code of a web page and determine if certain patterns are present. If you look at the code below, I'm trying to find the tag:
<SCRIPT LANGUAGE=JavaScript>
Although the output below signals that the tag is present, when I try to print out the matching region (with System.out.println(match); ), it gives me a segment of text that begins at say <SCRIPT LANGUAGE=...., but doesn't end until the end of the file. I would like to end at the first ">". Can anyone help me with this? Thanks.
import java.util.regex.*;
import java.io.*;
import java.net.*;
public class BasicMatch {
public static void main(String[] args) {
CharSequence s = "";
try{
String webpage = "http://javascript.internet.com/";
InputStreamReader rd = getReader(webpage);
BufferedReader bfr = new BufferedReader(rd);
StringBuffer sb = new StringBuffer("");
while ((s = bfr.readLine()) != null){
sb.append(s);
}
s = (CharSequence)sb.toString( );
} catch(Exception e)
{
System.out.println(e);
}
// Compile regular expression
String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScrip t(.)*\"(.) *>";
Pattern pattern = Pattern.compile(patternStr );
// Determine if pattern exists in input
CharSequence inputStr = "a b c b";
Matcher matcher = pattern.matcher(s);
boolean matchFound = matcher.find(); // true
System.out.println(matchFo und);
if(matchFound){
// Get matching string
String match = matcher.group(); // b
System.out.println(match);
// Get indices of matching string
int start = matcher.start(); // 2
int end = matcher.end(); // 3
// the end is index of the last matching character + 1
System.out.println(start);
System.out.println(end);
// Find the next occurrence
matchFound = matcher.find(); // true
}
}
static InputStreamReader getReader(String uri) throws IOException {
// Retrieve from Internet.
if (uri.startsWith("http:"))
{
URLConnection conn = new URL(uri).openConnection();
//System.out.println(uri);
return new InputStreamReader(conn.get InputStrea m());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
I would like to look into the HTML source code of a web page and determine if certain patterns are present. If you look at the code below, I'm trying to find the tag:
<SCRIPT LANGUAGE=JavaScript>
Although the output below signals that the tag is present, when I try to print out the matching region (with System.out.println(match);
import java.util.regex.*;
import java.io.*;
import java.net.*;
public class BasicMatch {
public static void main(String[] args) {
CharSequence s = "";
try{
String webpage = "http://javascript.internet.com/";
InputStreamReader rd = getReader(webpage);
BufferedReader bfr = new BufferedReader(rd);
StringBuffer sb = new StringBuffer("");
while ((s = bfr.readLine()) != null){
sb.append(s);
}
s = (CharSequence)sb.toString(
} catch(Exception e)
{
System.out.println(e);
}
// Compile regular expression
String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScrip
Pattern pattern = Pattern.compile(patternStr
// Determine if pattern exists in input
CharSequence inputStr = "a b c b";
Matcher matcher = pattern.matcher(s);
boolean matchFound = matcher.find(); // true
System.out.println(matchFo
if(matchFound){
// Get matching string
String match = matcher.group(); // b
System.out.println(match);
// Get indices of matching string
int start = matcher.start(); // 2
int end = matcher.end(); // 3
// the end is index of the last matching character + 1
System.out.println(start);
System.out.println(end);
// Find the next occurrence
matchFound = matcher.find(); // true
}
}
static InputStreamReader getReader(String uri) throws IOException {
// Retrieve from Internet.
if (uri.startsWith("http:"))
{
URLConnection conn = new URL(uri).openConnection();
//System.out.println(uri);
return new InputStreamReader(conn.get
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
ASKER
Hi Objects,
Thanks for your reply! I tried that and it didn't work. Any more ideas?
Thanks!
Thanks for your reply! I tried that and it didn't work. Any more ideas?
Thanks!
String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScri pt(.)*?\"( .)*?>";
No that won't work either.
one more try :)
String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScri pt(.)*?\"( .)*?>";
String patternStr = "<SCRIPT LANGUAGE=(.)*?(J|j)avaScri
How about:
<SCRIPT [^>]+?>
<SCRIPT [^>]+?>
Here one that uses org.apache.oro.text.perl package (as I'm still using jdk1.3). The pattern should work though.
import org.apache.oro.text.perl.* ;
public class ParserUsingPerl {
static String[] lines = {
"<script language=\"JavaScript\">",
"<!--",
"<script>",
"<script >",
"<script language=\"JavaScript\" type=\"text/javascript\">" ,
"<script type=\"text/javascript\">" ,
"<scriptlanguage=\"JavaScr ipt\">",
"< script language=\"JavaScript\">",
"a dummy line",
"<script language=\"Beanscript\">",
"<script language= \n" +
"Beanscript>",
"<script LANGUAGE = \"JavaScript\" type=\"text/javascript\">" ,
};
<b>static String pat = "m#\\s*<SCRIPT\\s+?LANGUAG E\\s*=?\\s *[\\\"|']? ([A-Z|a-z] +.*?)[\\\" |']?.*?>#i ";</b>
static Perl5Util perl5 = new Perl5Util();
public static void main(String[] args) {
try {
for (int i=0; i<lines.length; i++) {
String match = "";
if (perl5.match(pat,lines[i]) ) {
System.out.print("MATCH: ");
match = perl5.group(1);
} else {
System.out.print(" : ");
}
System.out.println(lines[i ]);
if (!match.equals("")) {
System.out.println(" found " + match);
}
}
} catch (Exception e) {
e.printStackTrace(System.e rr);
}
}
}
import org.apache.oro.text.perl.*
public class ParserUsingPerl {
static String[] lines = {
"<script language=\"JavaScript\">",
"<!--",
"<script>",
"<script >",
"<script language=\"JavaScript\" type=\"text/javascript\">"
"<script type=\"text/javascript\">"
"<scriptlanguage=\"JavaScr
"< script language=\"JavaScript\">",
"a dummy line",
"<script language=\"Beanscript\">",
"<script language= \n" +
"Beanscript>",
"<script LANGUAGE = \"JavaScript\" type=\"text/javascript\">"
};
<b>static String pat = "m#\\s*<SCRIPT\\s+?LANGUAG
static Perl5Util perl5 = new Perl5Util();
public static void main(String[] args) {
try {
for (int i=0; i<lines.length; i++) {
String match = "";
if (perl5.match(pat,lines[i])
System.out.print("MATCH: ");
match = perl5.group(1);
} else {
System.out.print(" : ");
}
System.out.println(lines[i
if (!match.equals("")) {
System.out.println(" found " + match);
}
}
} catch (Exception e) {
e.printStackTrace(System.e
}
}
}
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Tested the last one I posted and it worked ok.
"<SCRIPT LANGUAGE=(.)*?(J|j)avaScri pt(.)*?\"( .)*?>"
"<SCRIPT LANGUAGE=(.)*?(J|j)avaScri
ASKER
Thanks to all for your help! Bazarny, I substitued your code into my original program and it works really well! Thanks a lot again to everyone!
What was wrong with the regexp I supplied?
It worked fine here?
It worked fine here?
Ugh, it's a second time this month another contributor is unhappy when points are awarded to me. Is there any way around?
Regards,
Igor Bazarny,
Brainbench MVP for Java 1
www.brainbench.com
Regards,
Igor Bazarny,
Brainbench MVP for Java 1
www.brainbench.com
ASKER
Hi Objects,
There wasn't anything wrong with the solution you provided, and I really appreciate all of your help, but Bazarny's solution had a little more information that has proved to be really helpful in the development of my code. For instance, the "Pattern.CASE_INSENSITIVE" flag that he included in his solution saves me from having to do things like (j|J)avascript and allows me to catch the word "javascript" in a much more flexible way. Again, I thank you for all of your help, and I am grateful for your suggestions because you were the first person to reply to me. However, I chose Bazarny's solution because I felt it was the the most helpful.
There wasn't anything wrong with the solution you provided, and I really appreciate all of your help, but Bazarny's solution had a little more information that has proved to be really helpful in the development of my code. For instance, the "Pattern.CASE_INSENSITIVE"
> it's a second time this month another contributor is
> unhappy when points are awarded to me.
I'm not unhappy the points were awarded to you, was just curious what was wrong with what I provided.
> unhappy when points are awarded to me.
I'm not unhappy the points were awarded to you, was just curious what was wrong with what I provided.
String patternStr = "<SCRIPT LANGUAGE=(.)*(J|j)avaScrip