having problem with patterns and anyURI

I am having trouble defining a pattern such as "*://*:*/*" per anyURI for a simple type.  Entering this pattern (although it may not be properly constructed as I'm new to this) is causing the XMLSpy application I'm using to close unexpectedly.    

Also, is there a good list, on the web somewhere, of patterns or reg expressions for anyURI?

XMLfile:

<?xml version="1.0"?>
<!-- edited with XMLSPY v2004 rel. 3 U (http://www.xmlspy.com) by B H(1VWC) -->
<simQueueDir xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C:\nuers\po.xsd">http://mysim.com</simQueueDir>

XSD file:

<!-- edited with XMLSPY v2004 rel. 3 U (http://www.xmlspy.com) by BH :confused: (1VWC) -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <!-- Stock Keeping Unit, a code for identifying products -->
      <xsd:element name="simQueueDir">
            <xsd:simpleType>
                  <xsd:restriction base="xsd:anyURI">
                        <xsd:pattern value="*://*:*/*"/>
                  </xsd:restriction>
            </xsd:simpleType>
      </xsd:element>
</xsd:schema>

TaurusAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

rdcproCommented:
Here's one for Relative URL:


<xsd:simpleType name="RelativeURL">
  <xsd:annotation>
    <xsd:documentation>
      RelativeURL is a uriReference with no colon character before the first /, ? or #, if any(RFC2396).
    </xsd:documentation>
  </xsd:annotation>
  <xsd:restriction base="xsd:anyURI">
    <xsd:pattern value="[^:#/\?]*(:{0,0}|[#/\?].*)" />
  </xsd:restriction>
</xsd:simpleType>


Also, RegExLib.com has lots of RegEx's; here are some for URIs:

http://www.regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2

However, more to your particular issue, I found this post on XML-Dev, written by Alexandar Falk (of Altova) regarding XML Spy's handling of anyURI:

----- Forwarded message from Alexander Falk -----

This is the Regular Expression (RE) we originally used for the anyURI
dataype within our XML Spy product up until 4.0b2:

      
(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?

It was constructed according to the BNF grammar given in RFC 2396
(http://www.ietf.org/rfc/rfc2396.txt) and we used this RE to validate
elements and attributes whose datatype was anyURI.

However, we've found that (a) many customers actually use illegal URIs in
their documents happily, (b) XML Schema Part 2
(http://www.w3.org/TR/xmlschema-2/#anyURI) doesn't require any validation of
the contents of the anyURI dataype, and (c) most customers don't want us to
validate stronger than what other processors are doing.

Therefore, we are currently eliminating the anyURI checking [...]

----- End Forwarded message from Alexander Falk -----



Regards,
Mike Sharp
0
TaurusAuthor Commented:
I've seen the RE that Alexander posted.  It yields the error: This file is not well-formed: Name((Letter|'_'|':')(Name-Character)*) expected!.  Any ideas how to fix it?  What sorts of URI's does it match to?  It is not easy to read.
0
rdcproCommented:
This is a guess, but I'd say the ampersands are causing the issue.  Change it to &amp; and you might be fine.  

If I read it right, this part:
([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}
matches the protocol part, like http:// or ftp:// or //.  It says the protocol, which is optional, if present must have the first character as apha, there can be any number of subsequent alphanumeric characters, followed by between zero and two forward slashes.

Then this part:
[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
matches any sequence of one or more of those characters, up to but not including the "#"

This covers the rest of the URI, except for the document placeholder (whatever it's called) including the #.  So it matches:

foo.com
foo.com/snafu?myparam=tarfu
/snafu?myparam=tarfu
203.044.001.2
etc.

It says there must be one or more characters, if present, but the whole thing is optional.  So it would seem to me that this is a valid URI by the regex:

http://

The last part:
(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
which is entirely optional, if present must start with a "#" and consist of at least one more of the characters in the set.


But if you're using this regex in an XML document, you'll have to escape the ampersands so that the parser can get the correct character for the regex.  After parsing, an "&amp;" becomes just the ampersand character, which is what you want.

Regards,
Mike Sharp
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

TaurusAuthor Commented:
I went to the regexlib.com site and used the tester with the given RE.  I don't know what is going wrong but the expression seems to validate/match for all kinds of obviously bogus URI's.  For example:

http:///www.msn.com
http:///w3ww.msn.com
$er
4848\\-asdk

Any ideas?  
0
rdcproCommented:
Well, as Alexander mentioned, they stopped using it anyway because people were successfully using URIs that didn't validate anyway...Any single regex that validates all combinations of URI is going to be a monster...and in fact, you may want to limit your URI to certain groups.  For example, you might want to allow http:// and https:// but not mailto: or urn:

I think you might find it easier to do an enumeration of patterns.  That way you can specify simpler individual regexes, one for each type of URI that you're trying to validate.  

Regards,
Mike Sharp
0
TaurusAuthor Commented:
I thought Alexander was saying that people were using illegal URIs, rather than "people were successfully using URIs that didn't validate"?  I assumed that the term "illegal" meant something other than not following the pattern.  Why would I want to use this RE that doesn't appear to function even for obvious cases?  Further, why have an AnyURI type if one is not going to do any validation?  

I'd like to be able to define an enumeration of patterns similar to the ones shown on http://www.cafesoft.com/support/tips/permission-resource-pattern-matching.html.  That seems a lot more straight forward than the RE nonsense.  (REs seem very obfusicating and not intuitive.  Two characteristics that make them (REs), in reality, a poor grammar for textual pattern matching (akin to assembler)).

All I wanted was a schema type that could validate for structure a URI and or PC file path of the form *://*:*/* (any host, any port, and with any URI), */*/* (relative URI), [a-zA-Z]:\*\* (PC file path), *\* (relative PC file path) for obvious inconformities.   My experience thus far is that schema does a lot that is just not useful and little that is.  What I'm wanting seems like something that should already exist and I shouldn't have to re-invent the wheel.
0
TaurusAuthor Commented:
When I stated above "akin to assembler" I was metaphorically speaking, such as for example, trying to compare reading and understanding assembly language code to reading and understanding the source code of a higher level language.
0
savalouCommented:
What you are trying to do is understandable but I don't think using schema validation is the right way to go.  The W3 document on data types (http://www.w3.org/TR/xmlschema-2/#anyURI) more or less says shema pattern validation of URIs is useless because URIs can be so many things.  Even in your circs where you want a URL (more or less), so many patterns are valid, that  the "obvious" inconsistencies are not so obvious.  In most cases you would hae to take the URI and do something with it in code anyway, right, like get the file or whatever.

Anyway, the following pattern

<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*)*"/>

approves the following
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \dir_</url>

but not:

<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>
0
savalouCommented:
I think I'll just answer here because the Java group is not very fun.

I made up the regex and tested it on the tiscali site first and then with the following xml file
<?xml version="1.0"?>
<x xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://example.org"
xsi:schemaLocation="http://example.org
d:\develop\src\xsd\pattern.xsd">
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \di-r_</url>
<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>

<url>http:///msn.com</url>
<url>http://msn..com</url>
<url>http://www.msn.com//mytext.txt</url>
<url>c:\myrelativepath\\mytext.txt</url>
</x>

and the following schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://example.org" elementFormDefault="qualified">
<xsd:element name="x">
  <xsd:complexType>
  <xsd:sequence>
    <xsd:element name="url" maxOccurs="unbounded">
         <xsd:simpleType>
              <xsd:restriction base="xsd:string">
                   <xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/([a-zA-Z0-9\-_ ]*(\.\w+)?)?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*(\.\w+)?)*"/>
              </xsd:restriction>
         </xsd:simpleType>
    </xsd:element>
  </xsd:sequence>
  </xsd:complexType>
</xsd:element>
</xsd:schema>

As to adding the \., actually that will make the following
c:\de.ss\..s
valid, which I know is not what you want, but as I said, I'm about done helping on this.  It's not a very interesting problem.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
TaurusAuthor Commented:
Yeah I agree it is not at all interesting but having validation before runtime is desirable.
Per running on the tascali site: it seems to work intermittently which really is beginning to aggravate me.  It must be some crap bug in IE or something!  I'll test it out in a while in XMLSpy.
0
rdcproCommented:
Does that regex work for URI's like:

http://www3.foo24hours.com

Regards,
Mike Sharp
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.