Solved

Defining regular expression

Posted on 2003-12-04
13
470 Views
Last Modified: 2010-03-31

How do I define regrex that start with <a href=" but not http:// after that.

ex)
I want to detect relative link such as <a href="HW/homework.txt"> not <a href="http://....">

p.s.
the format is going to be <a href="<hyperlink>">

0
Comment
Question by:dkim18
  • 5
  • 3
  • 3
  • +1
13 Comments
 
LVL 35

Expert Comment

by:girionis
ID: 9880822
 If you already have the string then simply do a string.indexOf("http"). If it is not found it will return a -1.
0
 
LVL 35

Expert Comment

by:girionis
ID: 9880824
 BTW the absense of http does not guarante a relative link since the path might as well be /home/dkim18/mpla/mpla/mpla...
0
 
LVL 24

Expert Comment

by:sciuriware
ID: 9881036
To find a relative link in String k, practise:

int foundLink;
String m = k.toUppercase(); // Catch <a href as well ....

    if((foundLink = m.indexOf("<A HREF")) >= 0)   // Found something
    {
          if(m.indexOf("HTTP", foundLink) < 0) // Not found, that's good ...
          {
                  ...... go on as you like it .......

Note: this non-regex-approach doesn't prepare for multiple spaces between A and H ...
;JOOP!
0
 
LVL 35

Expert Comment

by:girionis
ID: 9881078
 Better (if "s" is the string variable that holds the string)

s.toLowerCase().indexOf("http")
0
 
LVL 24

Expert Comment

by:sciuriware
ID: 9881178
Anyway dkim18, it's hard to use regex to define that you do NOT want to find something.
A more precise piece of code could be:

int foundLink;
String m = k.toUppercase(); // Catch <a href as well ....

    if(m.matches("<A +HREF) && (foundLink = m.indexOf(" HREF")) >= 0)   // Found something and allows many spaces between A and H ...
    {
          if(k.indexOf("http", foundLink) < 0) // Not found, that's good ...
          {
                  ...... go on again as you like it .......

Can you live with all above?
;JOOP!
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9882363
Try

String re =".+href=\"*http:.+|.+href=\"*/.+";
boolean absoluteUrl = someLink.matches(re);
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:dkim18
ID: 9888463
This is what I intended.
So far,   final String REPLACE_PATTERN = "<a href=\"[^(http)]"; this dedects all relative link but when it added with "newURL" the first character of relative path after " disappeared.
ex)
if relative link is: <a href="save/save.txt">
absoult linke is(newURL):<a href="http://www.abc/hw/

Result is: <a href="http://www.abc/hw/ave/save.txt">
but I want <a href="http://www.abc/hw/save/save.txt">
So, I link the this absolute link from my local computer.

I know the problem is here:
final String REPLACE_PATTERN = "<a href=\"[^(http://)]";
Somehow, [^(http://)]; make disapper chracter 's' from above example.

here is my cord:
---------------------------
public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile){
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);
 
  final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL ;
  final String REPLACE_PATTERN = "<a href=\"[^(http://)]";
   
  Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
  Matcher myMatcher = myPattern.matcher(htmlWebPage);

  StringBuffer buffy = new StringBuffer();
  for(int i = 0; i < 3 ; i++){
    if (myMatcher.find()) {
      myMatcher.appendReplacement(buffy, replace_str);
    }
  }

  myMatcher.appendTail(buffy);
  System.out.println(buffy.toString());

  String newHtml=buffy.toString();
  return newHtml;
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9888720
How will your replacement rules work in the following case?

<a href="../a/b/x.html">Some link</a>

How would you put in the parent directory?
0
 

Author Comment

by:dkim18
ID: 9888822
How will your replacement rules work in the following case?

<a href="../a/b/x.html">Some link</a>
>>if this is not absolute dir, then this will be detected and replaced with absolute dir.
>>is that what you wanted to ask?

I just need solve the replacing problem. when above code replaces relative dir with absolte dir, it cut it out the first character of relative dir.

ex)
if relative link is: <a href="save/save.txt">
absoult linke is:<a href="http://www.xyz.com/hw/index.html
(this line of code make new url path w/o index.html)
    int splitIndex = 0;
    String newURL = null;

    splitIndex = url.indexOf(htmlFile);
    newURL = url.substring(0, splitIndex);

Result is: <a href="http://www.abc/hw/ave/save.txt">
but I want <a href="http://www.abc/hw/save/save.txt">
So, I link the this absolute link from my local computer.

again, the problem is here:
final String REPLACE_PATTERN = "<a href=\"[^(http://)]";

some how it occupied the first character of relative path after <a href="


I hope I made my point clear.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9888868
Yes, let's forget that .. business for now. It seems to me your expression is not quite right - for instance, the character class does not seem appropriate here. The following works for me:

  String s = "zzzz<a href=/c/d/file.html>A</a>zzzz<a href=\"c/d/file.html\">zzzz</a>zzzz" +
        "<a href=\"../c/d/file.html\">zzzz</a>zzzz<a href=\"./c/d/file.html\">zzzz</a>";
       
        patternReplaceURL(s, null, null);
        
        
.................
        

  public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile) {
        final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL;
        final String REPLACE_PATTERN = "<a href=(\")*[/\\.]*";
        final String replace_str = "<a href=$1http://www.xxx.com/";

         Pattern myPattern = Pattern.compile(REPLACE_PATTERN, FLAGS);
          Matcher myMatcher = myPattern.matcher(htmlWebPage);
          StringBuffer buffy = new StringBuffer();
          while (myMatcher.find()) {
              myMatcher.appendReplacement(buffy, replace_str);
          }

          myMatcher.appendTail(buffy);
          System.out.println(buffy.toString());
          String newHtml = buffy.toString();
          return newHtml;
  }

0
 

Author Comment

by:dkim18
ID: 9888952
this line dedected absolute links.
final String REPLACE_PATTERN = "<a href=(\")*[/\\.]*";

I am trying dedect relative link
ex)
 <a href="save/save.txt">
and trying replace with something like "<a href="http://www.abc./com/hw1/"
so I can get absolute link like "<a href="http://www.abc./com/hw1/save/save.txt">
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9889153
Sorry - didn't know you had mixed relative/absolute in source. Shall tweak it if I have time ;-)
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 300 total points
ID: 9892530
This should leave the absolute ones untouched. It works by replacing the ones that throw an exception (due to no protocol - i.e. relative) and replace in the handler

 public static String patternReplaceURL(String htmlWebPage, String url, String htmlFile) throws Exception  {
    final int FLAGS = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL;
    final String FIND_PATTERN = "(<a href=\"*)(([/\\.]*)([^>\"]+))";
    final String replace_str = "$1http://www.xxx.com/$4";
    Pattern myPattern = Pattern.compile(FIND_PATTERN, FLAGS);
    Matcher myMatcher = myPattern.matcher(htmlWebPage);
    StringBuffer buffy = new StringBuffer();
    while (myMatcher.find()) {
      // DEBUG
      /*
      System.out.println("$1 = " + myMatcher.group(1));
      System.out.println("$2 = " + myMatcher.group(2));
      System.out.println("$3 = " + myMatcher.group(3));
      System.out.println("$4 = " + myMatcher.group(4));
      */
      try {
        URL uri = new URL(myMatcher.group(4));
      }
      catch(MalformedURLException e) {
        myMatcher.appendReplacement(buffy, replace_str);
      }
    }

    myMatcher.appendTail(buffy);
    ///System.out.println(buffy.toString());
    return buffy.toString();
  }
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Java Message handling in Service Layer 3 58
printing a file in reverse order is easy in recursion rather than in iteration 3 52
Java Loop 6 50
fibonacci ten numbers 4 30
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now