Solved

# having problem with patterns and anyURI

Posted on 2003-11-28
1,428 Views
I am having trouble defining a pattern such as "*://*:*/*" per anyURI for a simple type.  Entering this pattern (although it may not be properly constructed as I'm new to this) is causing the XMLSpy application I'm using to close unexpectedly.

Also, is there a good list, on the web somewhere, of patterns or reg expressions for anyURI?

XMLfile:

<?xml version="1.0"?>
<!-- edited with XMLSPY v2004 rel. 3 U (http://www.xmlspy.com) by B H(1VWC) -->
<simQueueDir xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C:\nuers\po.xsd">http://mysim.com</simQueueDir>

XSD file:

<!-- edited with XMLSPY v2004 rel. 3 U (http://www.xmlspy.com) by BH :confused: (1VWC) -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<!-- Stock Keeping Unit, a code for identifying products -->
<xsd:element name="simQueueDir">
<xsd:simpleType>
<xsd:restriction base="xsd:anyURI">
<xsd:pattern value="*://*:*/*"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:schema>

0
Question by:Taurus
[X]
###### Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 5
• 4
• 2

LVL 26

Expert Comment

ID: 9843505
Here's one for Relative URL:

<xsd:simpleType name="RelativeURL">
<xsd:annotation>
<xsd:documentation>
RelativeURL is a uriReference with no colon character before the first /, ? or #, if any(RFC2396).
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:anyURI">
<xsd:pattern value="[^:#/\?]*(:{0,0}|[#/\?].*)" />
</xsd:restriction>
</xsd:simpleType>

Also, RegExLib.com has lots of RegEx's; here are some for URIs:

http://www.regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2

However, more to your particular issue, I found this post on XML-Dev, written by Alexandar Falk (of Altova) regarding XML Spy's handling of anyURI:

----- Forwarded message from Alexander Falk -----

This is the Regular Expression (RE) we originally used for the anyURI
dataype within our XML Spy product up until 4.0b2:

(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?

It was constructed according to the BNF grammar given in RFC 2396
(http://www.ietf.org/rfc/rfc2396.txt) and we used this RE to validate
elements and attributes whose datatype was anyURI.

However, we've found that (a) many customers actually use illegal URIs in
their documents happily, (b) XML Schema Part 2
(http://www.w3.org/TR/xmlschema-2/#anyURI) doesn't require any validation of
the contents of the anyURI dataype, and (c) most customers don't want us to
validate stronger than what other processors are doing.

Therefore, we are currently eliminating the anyURI checking [...]

----- End Forwarded message from Alexander Falk -----

Regards,
Mike Sharp
0

Author Comment

ID: 9851437
I've seen the RE that Alexander posted.  It yields the error: This file is not well-formed: Name((Letter|'_'|':')(Name-Character)*) expected!.  Any ideas how to fix it?  What sorts of URI's does it match to?  It is not easy to read.
0

LVL 26

Expert Comment

ID: 9851569
This is a guess, but I'd say the ampersands are causing the issue.  Change it to &amp; and you might be fine.

If I read it right, this part:
([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}
matches the protocol part, like http:// or ftp:// or //.  It says the protocol, which is optional, if present must have the first character as apha, there can be any number of subsequent alphanumeric characters, followed by between zero and two forward slashes.

Then this part:
[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)? matches any sequence of one or more of those characters, up to but not including the "#" This covers the rest of the URI, except for the document placeholder (whatever it's called) including the #. So it matches: foo.com foo.com/snafu?myparam=tarfu /snafu?myparam=tarfu 203.044.001.2 etc. It says there must be one or more characters, if present, but the whole thing is optional. So it would seem to me that this is a valid URI by the regex: http:// The last part: (#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
which is entirely optional, if present must start with a "#" and consist of at least one more of the characters in the set.

But if you're using this regex in an XML document, you'll have to escape the ampersands so that the parser can get the correct character for the regex.  After parsing, an "&amp;" becomes just the ampersand character, which is what you want.

Regards,
Mike Sharp
0

Author Comment

ID: 9851782
I went to the regexlib.com site and used the tester with the given RE.  I don't know what is going wrong but the expression seems to validate/match for all kinds of obviously bogus URI's.  For example:

http:///www.msn.com
http:///w3ww.msn.com
\$er
4848\\-asdk

Any ideas?
0

LVL 26

Expert Comment

ID: 9852458
Well, as Alexander mentioned, they stopped using it anyway because people were successfully using URIs that didn't validate anyway...Any single regex that validates all combinations of URI is going to be a monster...and in fact, you may want to limit your URI to certain groups.  For example, you might want to allow http:// and https:// but not mailto: or urn:

I think you might find it easier to do an enumeration of patterns.  That way you can specify simpler individual regexes, one for each type of URI that you're trying to validate.

Regards,
Mike Sharp
0

Author Comment

ID: 9852978
I thought Alexander was saying that people were using illegal URIs, rather than "people were successfully using URIs that didn't validate"?  I assumed that the term "illegal" meant something other than not following the pattern.  Why would I want to use this RE that doesn't appear to function even for obvious cases?  Further, why have an AnyURI type if one is not going to do any validation?

I'd like to be able to define an enumeration of patterns similar to the ones shown on http://www.cafesoft.com/support/tips/permission-resource-pattern-matching.html.  That seems a lot more straight forward than the RE nonsense.  (REs seem very obfusicating and not intuitive.  Two characteristics that make them (REs), in reality, a poor grammar for textual pattern matching (akin to assembler)).

All I wanted was a schema type that could validate for structure a URI and or PC file path of the form *://*:*/* (any host, any port, and with any URI), */*/* (relative URI), [a-zA-Z]:\*\* (PC file path), *\* (relative PC file path) for obvious inconformities.   My experience thus far is that schema does a lot that is just not useful and little that is.  What I'm wanting seems like something that should already exist and I shouldn't have to re-invent the wheel.
0

Author Comment

ID: 9853622
When I stated above "akin to assembler" I was metaphorically speaking, such as for example, trying to compare reading and understanding assembly language code to reading and understanding the source code of a higher level language.
0

LVL 3

Expert Comment

ID: 9858789
What you are trying to do is understandable but I don't think using schema validation is the right way to go.  The W3 document on data types (http://www.w3.org/TR/xmlschema-2/#anyURI) more or less says shema pattern validation of URIs is useless because URIs can be so many things.  Even in your circs where you want a URL (more or less), so many patterns are valid, that  the "obvious" inconsistencies are not so obvious.  In most cases you would hae to take the URI and do something with it in code anyway, right, like get the file or whatever.

Anyway, the following pattern

<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*)*"/>

approves the following
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \dir_</url>

but not:

<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>
0

LVL 3

Accepted Solution

savalou earned 100 total points
ID: 9862200
I think I'll just answer here because the Java group is not very fun.

I made up the regex and tested it on the tiscali site first and then with the following xml file
<?xml version="1.0"?>
<x xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://example.org"
xsi:schemaLocation="http://example.org
d:\develop\src\xsd\pattern.xsd">
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \di-r_</url>
<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>

<url>http:///msn.com</url>
<url>http://msn..com</url>
<url>http://www.msn.com//mytext.txt</url>
<url>c:\myrelativepath\\mytext.txt</url>
</x>

and the following schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://example.org" elementFormDefault="qualified">
<xsd:element name="x">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="url" maxOccurs="unbounded">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/([a-zA-Z0-9\-_ ]*(\.\w+)?)?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*(\.\w+)?)*"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

As to adding the \., actually that will make the following
c:\de.ss\..s
valid, which I know is not what you want, but as I said, I'm about done helping on this.  It's not a very interesting problem.
0

Author Comment

ID: 9862681
Yeah I agree it is not at all interesting but having validation before runtime is desirable.
Per running on the tascali site: it seems to work intermittently which really is beginning to aggravate me.  It must be some crap bug in IE or something!  I'll test it out in a while in XMLSpy.
0

LVL 26

Expert Comment

ID: 9862863
Does that regex work for URI's like:

http://www3.foo24hours.com

Regards,
Mike Sharp
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL several years ago, it seemed like now was a good time to update it for object-oriented PHP.  This article does that, replacing as much as possible the pr…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…
###### Suggested Courses
Course of the Month7 days, 2 hours left to enroll