Solved

Regular expression ?

Posted on 2003-12-01
17
735 Views
Last Modified: 2010-03-31
I'm having difficulty with using a regular expression to validate/invalidate a repeating pattern like [*.*.*], where * are wildcards.  For example, I could have a string java.sun.com which should validate or a string java.sun..com which should invalidate because of the double ".."
My thought was to use a RE like:
(.*\.[a-zA-Z0-9])
but repeating the expression doesn't work.  
0
Comment
Question by:Taurus
  • 9
  • 5
  • 2
  • +1
17 Comments
 
LVL 92

Expert Comment

by:objects
ID: 9855490
Maybe something like:

\w+(\.+\w+)+
0
 
LVL 92

Expert Comment

by:objects
ID: 9855546
woops only want one dot :) , that should be:

\w+(\.\w+)+
0
 

Author Comment

by:Taurus
ID: 9856701
Does not seem to do it.  I tested "java.sun..com" with this RE on http://www.regexlib.com/RETester.aspx and http://home.tiscali.be/stevevh/tools/testRE.html.
0
 
LVL 92

Expert Comment

by:objects
ID: 9856717
I just tested it with java's regexp and it failed? Might be a case of different regexp implementation.
0
 
LVL 92

Expert Comment

by:objects
ID: 9856740
import java.util.regex.*;

public class a
{
      public static void main(String[] args)
      {
            Pattern p = Pattern.compile("\\w+(\\.\\w+)+");
            Matcher m = p.matcher(args[0]);
            if (m.matches()) System.out.println("match");
      }
}
0
 
LVL 92

Expert Comment

by:objects
ID: 9856745
Perhaps what it not processing the whole input string, and saying it finds a match with "java.sun"
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9858529
This is a possible workaround:

    String re = "[a-zA-Z_\\.]+";
    String input = "java..sun.com";
    boolean valid = input.matches(re) && input.indexOf("..") < 0;
0
 
LVL 3

Expert Comment

by:savalou
ID: 9858786
What you are trying to do is understandable but I don't think using schema validation is the right way to go.  The W3 document on data types (http://www.w3.org/TR/xmlschema-2/#anyURI) more or less says shema pattern validation of URIs is useless because URIs can be so many things.  Even in your circs where you want a URL (more or less), so many patterns are valid, that  the "obvious" inconsistencies are not so obvious.  In most cases you would hae to take the URI and do something with it in code anyway, right, like get the file or whatever.

Anyway, the following pattern

<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*)*"/>

approves the following
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \dir_</url>

but not:

<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:Taurus
ID: 9860216
I tested the pattern you gave on http://www.regexlib.com/RETester.aspx and http://home.tiscali.be/stevevh/tools/testRE.html.   It does not work as suggested on these.  I tested it on XMLSPY and it works for a couple of the case I tried but incorrectly invalidates on things like: http://www.msn.com/mydir.txt and basically any relative path part I think(without having tested lots of cases).  I guess I will not rely on the online testers as they seem unreliable (which makes experimenting with not so simple RE's impossible for the inexperienced)

Per your comment about schema validation not being the right way to go, yes URI's I suppose they can be many things.  My URIs however will be fairly contstrained and what I want to check/validate for are simple typos and they follow two basic form(s).  Ideally, per my post in the XML topic area I'd like to be able to define an enumeration of patterns similar to the ones shown on http://www.cafesoft.com/support/tips/permission-resource-pattern-matching.html.  

But for starters match on the following patterns (specified in no RE language)to verify form and check for basic typos:
*://*:*/*  //URI
*:\*\*  //PC path

examples to further clarify:
http:///msn.com  //invalid because of the "///"
http://msn..com //invalid because of the ".."
http://www.msn.com//mytext.txt //invalid because of the second "//" should be a single "/".
c:\myrelativepath\\mytext.txt //invalid because the "\\mytext.txt" should be "\mytext.txt".

Why not do some basic validation with the schema since it will likely eliminate 50% of user input errors?  Why have the anyURI field at all if no validation is carried out?  

I am confounded that I cannot find a set of well tested, well specified, robust patterns for URI and path validation, anywhere it seems.  All I've been able to find thus far are sites like regexlib.com that only offer RE patterns that are written by whoever and don't have any formal specification and or test harness set to them.


0
 

Author Comment

by:Taurus
ID: 9860221
Above comment was for Savalou.
0
 
LVL 92

Accepted Solution

by:
objects earned 100 total points
ID: 9861519
Looking at it again that Javascript regex is only doing partial matching.

this should do the trick for you:

^\w+(\.\w+)+$
0
 
LVL 3

Expert Comment

by:savalou
ID: 9861665
Yes, I know it doesn't work on http://home.tiscali.be/stevevh/tools/testRE.html.  Not all RE matching algorithms are created equal (though well it may be that they should).

Anyway, if you want it to work on the tiscali one, you need ^ and $:
^((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)$|^([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*)*$

To validate filenames with extensions, you need to add a "\.".  I thought you'd fiigure that out yourself.  Same goes with any other symbols that can be part of filenames on your system:
<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-\._ ]*)*"/>


I don't know what parser you are using (XML Spy?), but the regex I posted generates complaints about each of your four examples when I use the Xerces SAX parser.  

I'm afraid I can't do much more for you.
0
 
LVL 92

Expert Comment

by:objects
ID: 9861783
I thought we were matching hostnames, when did the requirements change.
0
 

Author Comment

by:Taurus
ID: 9861981
>when did the requirements change?
Well, in this particular post I started with a very simple example. I was just experimenting after my other (more encompassing) post in the XML topic area didn't get me very far (http://www.experts-exchange.com/Web/Web_Languages/XML/Q_20811314.html).  I haven't had much opportunity to work with RE's (did so a tiny bit several years past as part of some Java scripting work).  Coming back to it, in the context of writing REs for a schema validator, is frustrating, especially when I can't find resources that are complete or robust (as is the case of the online validator's).  Or as Salvalou said, not all matching algorithms are created equal.  

Salvalou,

Per not figuring out your pattern and adding a "\.",  well as Objects said, this post was originally just intended to allow me to figure out how to validate a simple repeating pattern.  I've seen now about two dozen different, lengthy, URI patterns and I haven't the time to go through and understand each until I find one that seems to work as advertised.  Hence I didn't spend time looking at your pattern I just tested it.   So might I ask, is this your pattern or does it orig. from another source?  How well has it been tested?  Not to sound lazy but pattern matching/ parsing on URI's and paths feels so much like reinventing the wheel (which I try to avoid).
0
 
LVL 92

Expert Comment

by:objects
ID: 9862018
Well it makes it a little hard to give you exactly what you need if you don't tell us the full story :)

Anyway the RE I posted above should meet the requirements in this question.
0
 

Author Comment

by:Taurus
ID: 9862424
Objects, yes what you gave seems to work with the ^ and $.   I expanded it to ^(\w+:\/\/)\w+(\.\w+)+(\/\w+).?\w+$  Have not tested much, but I'm getting the idea.
0
 
LVL 92

Expert Comment

by:objects
ID: 9885090
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

For beginner Java programmers or at least those new to the Eclipse IDE, the following tutorial will show some (four) ways in which you can import your Java projects to your Eclipse workbench. Introduction While learning Java can be done with…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now