Solved

Defining regular expression

Posted on 2003-12-04
13
479 Views
Last Modified: 2010-03-31

How do I define regrex that start with <a href=" but not http:// after that.

ex)
I want to detect relative link such as <a href="HW/homework.txt"> not <a href="http://....">

p.s.
the format is going to be <a href="<hyperlink>">

0
Comment
Question by:dkim18
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
  • 3
  • +1
13 Comments
 
LVL 35

Expert Comment

by:girionis
ID: 9880822
 If you already have the string then simply do a string.indexOf("http"). If it is not found it will return a -1.
0
 
LVL 35

Expert Comment

by:girionis
ID: 9880824
 BTW the absense of http does not guarante a relative link since the path might as well be /home/dkim18/mpla/mpla/mpla...
0
 
LVL 24

Expert Comment

by:sciuriware
ID: 9881036
To find a relative link in String k, practise:

int foundLink;
String m = k.toUppercase(); // Catch <a href as well ....

    if((foundLink = m.indexOf("<A HREF")) >= 0)   // Found something
    {
          if(m.indexOf("HTTP", foundLink) < 0) // Not found, that's good ...
          {
                  ...... go on as you like it .......

Note: this non-regex-approach doesn't prepare for multiple spaces between A and H ...
;JOOP!
0
SharePoint Admin?

Enable Your Employees To Focus On The Core With Intuitive Onscreen Guidance That is With You At The Moment of Need.

 
LVL 35

Expert Comment

by:girionis
ID: 9881078
 Better (if "s" is the string variable that holds the string)

s.toLowerCase().indexOf("http")
0
 
LVL 24

Expert Comment

by:sciuriware
ID: 9881178
Anyway dkim18, it's hard to use regex to define that you do NOT want to find something.
A more precise piece of code could be:

int foundLink;
String m = k.toUppercase(); // Catch <a href as well ....

    if(m.matches("<A +HREF) && (foundLink = m.indexOf(" HREF")) >= 0)   // Found something and allows many spaces between A and H ...
    {
          if(k.indexOf("http", foundLink) < 0) // Not found, that's good ...
          {
                  ...... go on again as you like it .......

Can you live with all above?
;JOOP!
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9882363
Try

String re =".+href=\"*http:.+|.+href=\"*/.+";
boolean absoluteUrl = someLink.matches(re);
0
 

Author Comment

by:dkim18
ID: 9888463
This is what I intended.
So far,   final String REPLACE_PATTERN = "<a href=\"[^(http)]"; this dedects all relative link but when it added with "newURL" the first character of relative path after " disappeared.
ex)
if relative link is: <a href="save/save.txt">
absoult linke is(newURL):<a href="http://www.abc/hw/

Result is: <a href="http://www.abc/hw/ave/save.txt">
but I want <a href="http://www.abc/hw/save/save.txt">
So, I link the this absolute link from my local computer.

I know the problem is here:
final String REPLACE_PATTERN = "<a href=\"[^(http://)]";
Somehow, [^(http://)]; make disapper chracter 's' from above example.

here is my cord:
---------------------------
public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile){
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);
 
  final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
  final String REPLACE_PATTERN = "<a href=\"[^(http://)]";
   
  Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
  Matcher myMatcher = myPattern.matcher(htmlWebPage);

  StringBuffer buffy = new StringBuffer();
  for(int i = 0; i < 3 ; i++){
    if (myMatcher.find()) {
      myMatcher.appendReplacement(buffy, replace_str);
    }
  }

  myMatcher.appendTail(buffy);
  System.out.println(buffy.toString());

  String newHtml=buffy.toString();
  return newHtml;
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9888720
How will your replacement rules work in the following case?

<a href="../a/b/x.html">Some link</a>

How would you put in the parent directory?
0
 

Author Comment

by:dkim18
ID: 9888822
How will your replacement rules work in the following case?

<a href="../a/b/x.html">Some link</a>
>>if this is not absolute dir, then this will be detected and replaced with absolute dir.
>>is that what you wanted to ask?

I just need solve the replacing problem. when above code replaces relative dir with absolte dir, it cut it out the first character of relative dir.

ex)
if relative link is: <a href="save/save.txt">
absoult linke is:<a href="http://www.xyz.com/hw/index.html
(this line of code make new url path w/o index.html)
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);

Result is: <a href="http://www.abc/hw/ave/save.txt">
but I want <a href="http://www.abc/hw/save/save.txt">
So, I link the this absolute link from my local computer.

again, the problem is here:
final String REPLACE_PATTERN = "<a href=\"[^(http://)]";

some how it occupied the first character of relative path after <a href="


I hope I made my point clear.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9888868
Yes, let's forget that .. business for now. It seems to me your expression is not quite right - for instance, the character class does not seem appropriate here. The following works for me:

  String s = "zzzz<a href=/c/d/file.html>A</a>zzzz<a href=\"c/d/file.html\">zzzz</a>zzzz" +
        "<a href=\"../c/d/file.html\">zzzz</a>zzzz<a href=\"./c/d/file.html\">zzzz</a>";
       
        patternReplaceURL(s, null, null);
        
        
.................
        

  public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile) {
        final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL;
        final String REPLACE_PATTERN = "<a href=(\")*[/\\.]*";
        final String replace_str = "<a href=$1http://www.xxx.com/";

         Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
          Matcher myMatcher = myPattern.matcher(htmlWebPage);
          StringBuffer buffy = new StringBuffer();
          while (myMatcher.find()) {
              myMatcher.appendReplacement(buffy, replace_str);
          }

          myMatcher.appendTail(buffy);
          System.out.println(buffy.toString());
          String newHtml = buffy.toString();
          return newHtml;
  }

0
 

Author Comment

by:dkim18
ID: 9888952
this line dedected absolute links.
final String REPLACE_PATTERN = "<a href=(\")*[/\\.]*";

I am trying dedect relative link
ex)
 <a href="save/save.txt">
and trying replace with something like "<a href="http://www.abc./com/hw1/"
so I can get absolute link like "<a href="http://www.abc./com/hw1/save/save.txt">
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9889153
Sorry - didn't know you had mixed relative/absolute in source. Shall tweak it if I have time ;-)
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 300 total points
ID: 9892530
This should leave the absolute ones untouched. It works by replacing the ones that throw an exception (due to no protocol - i.e. relative) and replace in the handler

 public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile) throws Exception  {
    final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL;
    final String FIND_PATTERN = "(<a href=\"*)(([/\\.]*)([^>\"]+))";
    final String replace_str = "$1http://www.xxx.com/$4";
    Pattern myPattern = Pattern.compile(FIND_PATTERN, FLAGS);
    Matcher myMatcher = myPattern.matcher(htmlWebPage);
    StringBuffer buffy = new StringBuffer();
    while (myMatcher.find()) {
      // DEBUG
      /*
      System.out.println("$1 = " + myMatcher.group(1));
      System.out.println("$2 = " + myMatcher.group(2));
      System.out.println("$3 = " + myMatcher.group(3));
      System.out.println("$4 = " + myMatcher.group(4));
      */
      try {
        URL uri = new URL(myMatcher.group(4));
      }
      catch(MalformedURLException e) {
        myMatcher.appendReplacement(buffy, replace_str);
      }
    }

    myMatcher.appendTail(buffy);
    ///System.out.println(buffy.toString());
    return buffy.toString();
  }
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.

623 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question