asked on

having problem with patterns and anyURI

I am having trouble defining a pattern such as "*://*:*/*" per anyURI for a simple type. Entering this pattern (although it may not be properly constructed as I'm new to this) is causing the XMLSpy application I'm using to close unexpectedly.

Also, is there a good list, on the web somewhere, of patterns or reg expressions for anyURI?

XMLfile:

<?xml version="1.0"?>

<simQueueDir xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C:\nuers\po.xsd">http://mysim.com</simQueueDir>

XSD file:


<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      
      <xsd:element name="simQueueDir">
            <xsd:simpleType>
                  <xsd:restriction base="xsd:anyURI">
                        <xsd:pattern value="*://*:*/*"/>
                  </xsd:restriction>
            </xsd:simpleType>
      </xsd:element>
</xsd:schema>

rdcpro

Here's one for Relative URL:

<xsd:simpleType name="RelativeURL">
<xsd:annotation>
<xsd:documentation>
RelativeURL is a uriReference with no colon character before the first /, ? or #, if any(RFC2396).
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:anyURI">
<xsd:pattern value="[^:#/\?]*(:{0,0}|[#/\?].*)" />
</xsd:restriction>
</xsd:simpleType>

Also, RegExLib.com has lots of RegEx's; here are some for URIs:

http://www.regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2

However, more to your particular issue, I found this post on XML-Dev, written by Alexandar Falk (of Altova) regarding XML Spy's handling of anyURI:

----- Forwarded message from Alexander Falk -----

This is the Regular Expression (RE) we originally used for the anyURI
dataype within our XML Spy product up until 4.0b2:

(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?

It was constructed according to the BNF grammar given in RFC 2396
(http://www.ietf.org/rfc/rfc2396.txt) and we used this RE to validate
elements and attributes whose datatype was anyURI.

However, we've found that (a) many customers actually use illegal URIs in
their documents happily, (b) XML Schema Part 2
(http://www.w3.org/TR/xmlschema-2/#anyURI) doesn't require any validation of
the contents of the anyURI dataype, and (c) most customers don't want us to
validate stronger than what other processors are doing.

Therefore, we are currently eliminating the anyURI checking [...]

----- End Forwarded message from Alexander Falk -----

Regards,
Mike Sharp

Taurus

ASKER

I've seen the RE that Alexander posted. It yields the error: This file is not well-formed: Name((Letter|'_'|':')(Name-Character)*) expected!. Any ideas how to fix it? What sorts of URI's does it match to? It is not easy to read.

rdcpro

This is a guess, but I'd say the ampersands are causing the issue. Change it to & and you might be fine.

If I read it right, this part:
([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}
matches the protocol part, like http:// or ftp:// or //. It says the protocol, which is optional, if present must have the first character as apha, there can be any number of subsequent alphanumeric characters, followed by between zero and two forward slashes.

Then this part:
[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
matches any sequence of one or more of those characters, up to but not including the "#"

This covers the rest of the URI, except for the document placeholder (whatever it's called) including the #. So it matches:

foo.com
foo.com/snafu?myparam=tarfu
/snafu?myparam=tarfu
203.044.001.2
etc.

It says there must be one or more characters, if present, but the whole thing is optional. So it would seem to me that this is a valid URI by the regex:

http://

The last part:
(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
which is entirely optional, if present must start with a "#" and consist of at least one more of the characters in the set.

But if you're using this regex in an XML document, you'll have to escape the ampersands so that the parser can get the correct character for the regex. After parsing, an "&" becomes just the ampersand character, which is what you want.

Regards,
Mike Sharp

Taurus

ASKER

I went to the regexlib.com site and used the tester with the given RE. I don't know what is going wrong but the expression seems to validate/match for all kinds of obviously bogus URI's. For example:

http:///www.msn.com
http:///w3ww.msn.com
$er
4848\\-asdk

Any ideas?

rdcpro

Well, as Alexander mentioned, they stopped using it anyway because people were successfully using URIs that didn't validate anyway...Any single regex that validates all combinations of URI is going to be a monster...and in fact, you may want to limit your URI to certain groups. For example, you might want to allow http:// and https:// but not mailto: or urn:

I think you might find it easier to do an enumeration of patterns. That way you can specify simpler individual regexes, one for each type of URI that you're trying to validate.

Regards,
Mike Sharp

Taurus

ASKER

I thought Alexander was saying that people were using illegal URIs, rather than "people were successfully using URIs that didn't validate"? I assumed that the term "illegal" meant something other than not following the pattern. Why would I want to use this RE that doesn't appear to function even for obvious cases? Further, why have an AnyURI type if one is not going to do any validation?

I'd like to be able to define an enumeration of patterns similar to the ones shown on http://www.cafesoft.com/support/tips/permission-resource-pattern-matching.html. That seems a lot more straight forward than the RE nonsense. (REs seem very obfusicating and not intuitive. Two characteristics that make them (REs), in reality, a poor grammar for textual pattern matching (akin to assembler)).

All I wanted was a schema type that could validate for structure a URI and or PC file path of the form *://*:*/* (any host, any port, and with any URI), */*/* (relative URI), [a-zA-Z]:\*\* (PC file path), *\* (relative PC file path) for obvious inconformities. My experience thus far is that schema does a lot that is just not useful and little that is. What I'm wanting seems like something that should already exist and I shouldn't have to re-invent the wheel.

Taurus

ASKER

When I stated above "akin to assembler" I was metaphorically speaking, such as for example, trying to compare reading and understanding assembly language code to reading and understanding the source code of a higher level language.

savalou

What you are trying to do is understandable but I don't think using schema validation is the right way to go. The W3 document on data types (http://www.w3.org/TR/xmlschema-2/#anyURI) more or less says shema pattern validation of URIs is useless because URIs can be so many things. Even in your circs where you want a URL (more or less), so many patterns are valid, that the "obvious" inconsistencies are not so obvious. In most cases you would hae to take the URI and do something with it in code anyway, right, like get the file or whatever.

Anyway, the following pattern

<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*)*"/>

approves the following
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \dir_</url>

but not:

<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>

ASKER CERTIFIED SOLUTION

savalou

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Taurus

ASKER

Yeah I agree it is not at all interesting but having validation before runtime is desirable.
Per running on the tascali site: it seems to work intermittently which really is beginning to aggravate me. It must be some crap bug in IE or something! I'll test it out in a while in XMLSpy.

rdcpro

Does that regex work for URI's like:

http://www3.foo24hours.com

Regards,
Mike Sharp