Solved

having problem with patterns and anyURI

Posted on 2003-11-28
11
1,367 Views
Last Modified: 2013-11-19
I am having trouble defining a pattern such as "*://*:*/*" per anyURI for a simple type.  Entering this pattern (although it may not be properly constructed as I'm new to this) is causing the XMLSpy application I'm using to close unexpectedly.    

Also, is there a good list, on the web somewhere, of patterns or reg expressions for anyURI?

XMLfile:

<?xml version="1.0"?>
<!-- edited with XMLSPY v2004 rel. 3 U (http://www.xmlspy.com) by B H(1VWC) -->
<simQueueDir xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C:\nuers\po.xsd">http://mysim.com</simQueueDir>

XSD file:

<!-- edited with XMLSPY v2004 rel. 3 U (http://www.xmlspy.com) by BH :confused: (1VWC) -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <!-- Stock Keeping Unit, a code for identifying products -->
      <xsd:element name="simQueueDir">
            <xsd:simpleType>
                  <xsd:restriction base="xsd:anyURI">
                        <xsd:pattern value="*://*:*/*"/>
                  </xsd:restriction>
            </xsd:simpleType>
      </xsd:element>
</xsd:schema>

0
Comment
Question by:Taurus
  • 5
  • 4
  • 2
11 Comments
 
LVL 26

Expert Comment

by:rdcpro
ID: 9843505
Here's one for Relative URL:


<xsd:simpleType name="RelativeURL">
  <xsd:annotation>
    <xsd:documentation>
      RelativeURL is a uriReference with no colon character before the first /, ? or #, if any(RFC2396).
    </xsd:documentation>
  </xsd:annotation>
  <xsd:restriction base="xsd:anyURI">
    <xsd:pattern value="[^:#/\?]*(:{0,0}|[#/\?].*)" />
  </xsd:restriction>
</xsd:simpleType>


Also, RegExLib.com has lots of RegEx's; here are some for URIs:

http://www.regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2

However, more to your particular issue, I found this post on XML-Dev, written by Alexandar Falk (of Altova) regarding XML Spy's handling of anyURI:

----- Forwarded message from Alexander Falk -----

This is the Regular Expression (RE) we originally used for the anyURI
dataype within our XML Spy product up until 4.0b2:

      
(([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?

It was constructed according to the BNF grammar given in RFC 2396
(http://www.ietf.org/rfc/rfc2396.txt) and we used this RE to validate
elements and attributes whose datatype was anyURI.

However, we've found that (a) many customers actually use illegal URIs in
their documents happily, (b) XML Schema Part 2
(http://www.w3.org/TR/xmlschema-2/#anyURI) doesn't require any validation of
the contents of the anyURI dataype, and (c) most customers don't want us to
validate stronger than what other processors are doing.

Therefore, we are currently eliminating the anyURI checking [...]

----- End Forwarded message from Alexander Falk -----



Regards,
Mike Sharp
0
 

Author Comment

by:Taurus
ID: 9851437
I've seen the RE that Alexander posted.  It yields the error: This file is not well-formed: Name((Letter|'_'|':')(Name-Character)*) expected!.  Any ideas how to fix it?  What sorts of URI's does it match to?  It is not easy to read.
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 9851569
This is a guess, but I'd say the ampersands are causing the issue.  Change it to &amp; and you might be fine.  

If I read it right, this part:
([a-zA-Z][0-9a-zA-Z+\\-\\.]*:)?/{0,2}
matches the protocol part, like http:// or ftp:// or //.  It says the protocol, which is optional, if present must have the first character as apha, there can be any number of subsequent alphanumeric characters, followed by between zero and two forward slashes.

Then this part:
[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
matches any sequence of one or more of those characters, up to but not including the "#"

This covers the rest of the URI, except for the document placeholder (whatever it's called) including the #.  So it matches:

foo.com
foo.com/snafu?myparam=tarfu
/snafu?myparam=tarfu
203.044.001.2
etc.

It says there must be one or more characters, if present, but the whole thing is optional.  So it would seem to me that this is a valid URI by the regex:

http://

The last part:
(#[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%]+)?
which is entirely optional, if present must start with a "#" and consist of at least one more of the characters in the set.


But if you're using this regex in an XML document, you'll have to escape the ampersands so that the parser can get the correct character for the regex.  After parsing, an "&amp;" becomes just the ampersand character, which is what you want.

Regards,
Mike Sharp
0
 

Author Comment

by:Taurus
ID: 9851782
I went to the regexlib.com site and used the tester with the given RE.  I don't know what is going wrong but the expression seems to validate/match for all kinds of obviously bogus URI's.  For example:

http:///www.msn.com
http:///w3ww.msn.com
$er
4848\\-asdk

Any ideas?  
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 9852458
Well, as Alexander mentioned, they stopped using it anyway because people were successfully using URIs that didn't validate anyway...Any single regex that validates all combinations of URI is going to be a monster...and in fact, you may want to limit your URI to certain groups.  For example, you might want to allow http:// and https:// but not mailto: or urn:

I think you might find it easier to do an enumeration of patterns.  That way you can specify simpler individual regexes, one for each type of URI that you're trying to validate.  

Regards,
Mike Sharp
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 

Author Comment

by:Taurus
ID: 9852978
I thought Alexander was saying that people were using illegal URIs, rather than "people were successfully using URIs that didn't validate"?  I assumed that the term "illegal" meant something other than not following the pattern.  Why would I want to use this RE that doesn't appear to function even for obvious cases?  Further, why have an AnyURI type if one is not going to do any validation?  

I'd like to be able to define an enumeration of patterns similar to the ones shown on http://www.cafesoft.com/support/tips/permission-resource-pattern-matching.html.  That seems a lot more straight forward than the RE nonsense.  (REs seem very obfusicating and not intuitive.  Two characteristics that make them (REs), in reality, a poor grammar for textual pattern matching (akin to assembler)).

All I wanted was a schema type that could validate for structure a URI and or PC file path of the form *://*:*/* (any host, any port, and with any URI), */*/* (relative URI), [a-zA-Z]:\*\* (PC file path), *\* (relative PC file path) for obvious inconformities.   My experience thus far is that schema does a lot that is just not useful and little that is.  What I'm wanting seems like something that should already exist and I shouldn't have to re-invent the wheel.
0
 

Author Comment

by:Taurus
ID: 9853622
When I stated above "akin to assembler" I was metaphorically speaking, such as for example, trying to compare reading and understanding assembly language code to reading and understanding the source code of a higher level language.
0
 
LVL 3

Expert Comment

by:savalou
ID: 9858789
What you are trying to do is understandable but I don't think using schema validation is the right way to go.  The W3 document on data types (http://www.w3.org/TR/xmlschema-2/#anyURI) more or less says shema pattern validation of URIs is useless because URIs can be so many things.  Even in your circs where you want a URL (more or less), so many patterns are valid, that  the "obvious" inconsistencies are not so obvious.  In most cases you would hae to take the URI and do something with it in code anyway, right, like get the file or whatever.

Anyway, the following pattern

<xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/\w?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*)*"/>

approves the following
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \dir_</url>

but not:

<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>
0
 
LVL 3

Accepted Solution

by:
savalou earned 100 total points
ID: 9862200
I think I'll just answer here because the Java group is not very fun.

I made up the regex and tested it on the tiscali site first and then with the following xml file
<?xml version="1.0"?>
<x xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://example.org"
xsi:schemaLocation="http://example.org
d:\develop\src\xsd\pattern.xsd">
<url>http://java.sun.com</url>
<url>http://java.sun.com:80</url>
<url>http://java.sun.com:80/</url>
<url>http://java.sun.com:80/a/b/c</url>
<url>d:\dir \dir_</url>
<url>\dir \di-r_</url>
<url>http:/java.sun.com</url>
<url>http://java.sun..com</url>
<url>d:\dir@\dir</url>

<url>http:///msn.com</url>
<url>http://msn..com</url>
<url>http://www.msn.com//mytext.txt</url>
<url>c:\myrelativepath\\mytext.txt</url>
</x>

and the following schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://example.org" elementFormDefault="qualified">
<xsd:element name="x">
  <xsd:complexType>
  <xsd:sequence>
    <xsd:element name="url" maxOccurs="unbounded">
         <xsd:simpleType>
              <xsd:restriction base="xsd:string">
                   <xsd:pattern value="((\w+://\w+(\.\w+)+(:\d+)?)?(/([a-zA-Z0-9\-_ ]*(\.\w+)?)?)*)|([a-zA-Z]:)?(\\[a-zA-Z0-9\-_ ]*(\.\w+)?)*"/>
              </xsd:restriction>
         </xsd:simpleType>
    </xsd:element>
  </xsd:sequence>
  </xsd:complexType>
</xsd:element>
</xsd:schema>

As to adding the \., actually that will make the following
c:\de.ss\..s
valid, which I know is not what you want, but as I said, I'm about done helping on this.  It's not a very interesting problem.
0
 

Author Comment

by:Taurus
ID: 9862681
Yeah I agree it is not at all interesting but having validation before runtime is desirable.
Per running on the tascali site: it seems to work intermittently which really is beginning to aggravate me.  It must be some crap bug in IE or something!  I'll test it out in a while in XMLSpy.
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 9862863
Does that regex work for URI's like:

http://www3.foo24hours.com

Regards,
Mike Sharp
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Most of the sites are being standardized with W3C Web Standards. W3C provides lot of web standard services to the web. They have the web specification, process and documentation for all the web standards. You can apply HTML, CSS and Accessibility st…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now