Solved

Defining regular expression

Posted on 2003-12-04
13
474 Views
Last Modified: 2010-03-31

How do I define regrex that start with <a href=" but not http:// after that.

ex)
I want to detect relative link such as <a href="HW/homework.txt"> not <a href="http://....">

p.s.
the format is going to be <a href="<hyperlink>">

0
Comment
Question by:dkim18
  • 5
  • 3
  • 3
  • +1
13 Comments
 
LVL 35

Expert Comment

by:girionis
ID: 9880822
 If you already have the string then simply do a string.indexOf("http"). If it is not found it will return a -1.
0
 
LVL 35

Expert Comment

by:girionis
ID: 9880824
 BTW the absense of http does not guarante a relative link since the path might as well be /home/dkim18/mpla/mpla/mpla...
0
 
LVL 24

Expert Comment

by:sciuriware
ID: 9881036
To find a relative link in String k, practise:

int foundLink;
String m = k.toUppercase(); // Catch <a href as well ....

    if((foundLink = m.indexOf("<A HREF")) >= 0)   // Found something
    {
          if(m.indexOf("HTTP", foundLink) < 0) // Not found, that's good ...
          {
                  ...... go on as you like it .......

Note: this non-regex-approach doesn't prepare for multiple spaces between A and H ...
;JOOP!
0
Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

 
LVL 35

Expert Comment

by:girionis
ID: 9881078
 Better (if "s" is the string variable that holds the string)

s.toLowerCase().indexOf("http")
0
 
LVL 24

Expert Comment

by:sciuriware
ID: 9881178
Anyway dkim18, it's hard to use regex to define that you do NOT want to find something.
A more precise piece of code could be:

int foundLink;
String m = k.toUppercase(); // Catch <a href as well ....

    if(m.matches("<A +HREF) && (foundLink = m.indexOf(" HREF")) >= 0)   // Found something and allows many spaces between A and H ...
    {
          if(k.indexOf("http", foundLink) < 0) // Not found, that's good ...
          {
                  ...... go on again as you like it .......

Can you live with all above?
;JOOP!
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9882363
Try

String re =".+href=\"*http:.+|.+href=\"*/.+";
boolean absoluteUrl = someLink.matches(re);
0
 

Author Comment

by:dkim18
ID: 9888463
This is what I intended.
So far,   final String REPLACE_PATTERN = "<a href=\"[^(http)]"; this dedects all relative link but when it added with "newURL" the first character of relative path after " disappeared.
ex)
if relative link is: <a href="save/save.txt">
absoult linke is(newURL):<a href="http://www.abc/hw/

Result is: <a href="http://www.abc/hw/ave/save.txt">
but I want <a href="http://www.abc/hw/save/save.txt">
So, I link the this absolute link from my local computer.

I know the problem is here:
final String REPLACE_PATTERN = "<a href=\"[^(http://)]";
Somehow, [^(http://)]; make disapper chracter 's' from above example.

here is my cord:
---------------------------
public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile){
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);
 
  final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
  final String REPLACE_PATTERN = "<a href=\"[^(http://)]";
   
  Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
  Matcher myMatcher = myPattern.matcher(htmlWebPage);

  StringBuffer buffy = new StringBuffer();
  for(int i = 0; i < 3 ; i++){
    if (myMatcher.find()) {
      myMatcher.appendReplacement(buffy, replace_str);
    }
  }

  myMatcher.appendTail(buffy);
  System.out.println(buffy.toString());

  String newHtml=buffy.toString();
  return newHtml;
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9888720
How will your replacement rules work in the following case?

<a href="../a/b/x.html">Some link</a>

How would you put in the parent directory?
0
 

Author Comment

by:dkim18
ID: 9888822
How will your replacement rules work in the following case?

<a href="../a/b/x.html">Some link</a>
>>if this is not absolute dir, then this will be detected and replaced with absolute dir.
>>is that what you wanted to ask?

I just need solve the replacing problem. when above code replaces relative dir with absolte dir, it cut it out the first character of relative dir.

ex)
if relative link is: <a href="save/save.txt">
absoult linke is:<a href="http://www.xyz.com/hw/index.html
(this line of code make new url path w/o index.html)
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);

Result is: <a href="http://www.abc/hw/ave/save.txt">
but I want <a href="http://www.abc/hw/save/save.txt">
So, I link the this absolute link from my local computer.

again, the problem is here:
final String REPLACE_PATTERN = "<a href=\"[^(http://)]";

some how it occupied the first character of relative path after <a href="


I hope I made my point clear.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9888868
Yes, let's forget that .. business for now. It seems to me your expression is not quite right - for instance, the character class does not seem appropriate here. The following works for me:

  String s = "zzzz<a href=/c/d/file.html>A</a>zzzz<a href=\"c/d/file.html\">zzzz</a>zzzz" +
        "<a href=\"../c/d/file.html\">zzzz</a>zzzz<a href=\"./c/d/file.html\">zzzz</a>";
       
        patternReplaceURL(s, null, null);
        
        
.................
        

  public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile) {
        final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL;
        final String REPLACE_PATTERN = "<a href=(\")*[/\\.]*";
        final String replace_str = "<a href=$1http://www.xxx.com/";

         Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
          Matcher myMatcher = myPattern.matcher(htmlWebPage);
          StringBuffer buffy = new StringBuffer();
          while (myMatcher.find()) {
              myMatcher.appendReplacement(buffy, replace_str);
          }

          myMatcher.appendTail(buffy);
          System.out.println(buffy.toString());
          String newHtml = buffy.toString();
          return newHtml;
  }

0
 

Author Comment

by:dkim18
ID: 9888952
this line dedected absolute links.
final String REPLACE_PATTERN = "<a href=(\")*[/\\.]*";

I am trying dedect relative link
ex)
 <a href="save/save.txt">
and trying replace with something like "<a href="http://www.abc./com/hw1/"
so I can get absolute link like "<a href="http://www.abc./com/hw1/save/save.txt">
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9889153
Sorry - didn't know you had mixed relative/absolute in source. Shall tweak it if I have time ;-)
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 300 total points
ID: 9892530
This should leave the absolute ones untouched. It works by replacing the ones that throw an exception (due to no protocol - i.e. relative) and replace in the handler

 public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile) throws Exception  {
    final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL;
    final String FIND_PATTERN = "(<a href=\"*)(([/\\.]*)([^>\"]+))";
    final String replace_str = "$1http://www.xxx.com/$4";
    Pattern myPattern = Pattern.compile(FIND_PATTERN, FLAGS);
    Matcher myMatcher = myPattern.matcher(htmlWebPage);
    StringBuffer buffy = new StringBuffer();
    while (myMatcher.find()) {
      // DEBUG
      /*
      System.out.println("$1 = " + myMatcher.group(1));
      System.out.println("$2 = " + myMatcher.group(2));
      System.out.println("$3 = " + myMatcher.group(3));
      System.out.println("$4 = " + myMatcher.group(4));
      */
      try {
        URL uri = new URL(myMatcher.group(4));
      }
      catch(MalformedURLException e) {
        myMatcher.appendReplacement(buffy, replace_str);
      }
    }

    myMatcher.appendTail(buffy);
    ///System.out.println(buffy.toString());
    return buffy.toString();
  }
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
replace a word with other 1 44
reverse digits of a number using for loop 5 40
hibernate example using maven 12 41
Which non-HTML GUI front end to use with Java? 3 21
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
This video teaches viewers about errors in exception handling.

785 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question