Link to home
Create AccountLog in
Avatar of dlearman1
dlearman1Flag for United States of America

asked on

What is a valid regex expression that will match standard English including quotes (single and double both straight and curly) when the regex must be delimited by outside quotes.

I'm working on client-side validation for an HTML form, and having trouble with regex. Specifically, using quotes inside a quoted string:


pattern="/^[\"?\w\s#\$%\^&\*’";:\?,\._-]+$/gim"


I thought that escaping the first double quote would work, but it still won't accept quotes inside outer quotes. Also, it looks like I have a mix of straight and curly quotes styles. The leading quote is optional in case the entire text is a quote. 


How can I modify the regex string to match quotes (single and double both straight and curly) when the regex must be delimited by outside quotes.

Avatar of skullnobrains
skullnobrains

if it is javascript, you do not need the quotes at all.

as a general rule, you would rely on \w and language aware regexp
You need to move the surrounding quotes outside of the character class.  You also do not need the i flag as \w matches uppercase and lowercase letters.  However, since you have quotes in the character class, you don't need to specify the surrounding quotes explicitly.  This should work for the characters you want to match.  However, if you do have "curly" quotes, that is a different character and you must also include that in the character class.
/^[\w\s#\$%\^&\*’";:\?,\._-]+$/gsm

Open in new window


If you want to match all English text and punctuation/special characters plus spacing, you could do the below.  It matches carriage returns, newlines, whitespace, and all ASCII characters between exclamation and tilde (which includes all letters, numbers, and punctuation/special chars).  Again, if you need "curly" quotes, you need to add those.
/^[\x0a\x0d\s!-~]+$/gsm

Open in new window

Avatar of dlearman1

ASKER

skullnobrains... This usage is an HTML string. I believe quotes are required.

wilcoxon... You need to move the surrounding quotes outside of the character class. If you mean the enclosing quotes "/... gsm", they are required by HTML. This is why I need to escape, or something, to double quotes inside the regex?

You need to move the surrounding quotes outside of the character class. I don't follow what you are saying here.

I'm not really expecting any curly quotes, so this was mostly just curiosity on my part. Your regex, /^[\x0a\x0d\s!-~]+$/gsm, is obviously expertly minimized and superior to mine. But I think I will add carriage return and newline to my definition because a year from now I would not be able to interpret yours. Still your regex does solve my competing quotes problem because it doesn't contain explicit quote characters. If there really is no way to escape the regex then your solution might be best
 
I haven't done JavaScript in a long time so I don't remember what it requires for regex (or coding generally beyond being procedural with "weird" object oriented aspects).

I thought you were saying that you needed the regex to check for a matching string with surrounding quotes.  I think what you meant was the regex itself needs quotes around it due to JavaScript syntax.

I think your issue was that you had a second double quote in your character class that was not escaped.  I removed it from this (no other changes from your code besides removing the unnecessary i flag).  Does this work for you?  I also notice that the only single quote in it is a "curly" quote and not a normal apostrophe (so you may need to add that).
pattern="/^[\"?\w\s#\$%\^&\*’;:\?,\._-]+$/gm"

Open in new window

I just refreshed my memory on a couple things and your character class should work without adding \x0a\x0d as \s should already include newline and carriage return.  Using the m flag though will make ^$ anchor each line so the regex would be true if one line of the input text matches.  I think you need to use one of these:
pattern="/\A[\"?\w\s#\$%\^&\*’;:\?,\._-]+\Z/gsm"

Open in new window

or possibly (less sure this will work properly):
pattern="/^[\"?\w\s#\$%\^&\*’;:\?,\._-]+$/gs"

Open in new window

I'm just starting to look at your last comment, but I thought I would clarify my use.  The regex string we are working on is used in HTML not Javascript.  HTML has several built-in form validation attributes. I am trying to implement the pattern attribute. This involves two basic HTML steps:

1. I specify a regular expression defining a pattern the user entered data needs to match by setting the pattern attribute to a regex string: pattern="some regular expression";  In this application, I'm looking for regex that will match normal conversational English language.

2. At runtime the compiler compares the test regex (as specified in step 1) against the actual user inputted text. If there is a match, validity is set to true. If there is not a match, validity is set to patternMismatch and 1) the user receives the browser's standard error message and 2) the form's submit button is disabled.  

For my purpose, the regex pattern must be applied to the entirety of the user input (a maximum of 500 characters). The regex search it cannot, for example, return true if only one line of the input text matches. To avoid false mismatches, the test regex pattern must be inclusive of all standard keyboard characters including
single and double quotes. Hopefully it will not match characters like <>{}[]|\ or unusual syntax like %%%//%7&.
The user supplied text may still contain unwanted elements, namely phone numbers, email addresses and URL's. I'm using Javascript to search for these specific elements in follow-up regex tests.  

My current pattern is: pattern="[\"?\w\s#\$%\^&\*"';:\?,\._-]+/gm"

I believe that when this string is specified under HTML's Constraint Validation AP it is evaluated as if ^(?: were added at the start of the pattern and )$ at the end, and is automatically applied to the entire user input text.
(see https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/pattern for a discussion).

 So I'm thinking my pattern should be: pattern="[\"?\w\s#\$%\^&\*\"';:\?,\._-]+"
and the actual pattern applied by the API is: pattern=^(?:"[\"?\w\s#\$%\^&\*"';:\?,\._-]+)$/g"

If you agree, then we're left with how to handle quotes. Unless the quotes inside the character set are escaped the HTML compiler will confuse them with the enclosing quotes and return a string syntax error. I tried to escape them using \" but this didn't resolve the syntax error. Right now, this is the main problem I need to solve. Other than that, it seems I need to handle both straight and curly double and single quotes since both styles can be keyboard inputs. Note: I included the leading quote in the regex pattern to allow for the edge case where the user input starts with a quotation.

Sorry, I'm throwing a lot of words at you. Let me know if I'm overdoing it.









The enclosing double quotes are required by HTML.



 
Last time I used HTML, it was HTML 4.2 or XHTML.  I wasn't aware of the pattern validation available in HTML 5.  However, based on some reading, it looks much more restrictive than JavaScript or other full regex implementations.

The quote problem is the same one I mentioned in my previous reply - you have both the first escaped quote and a later unescaped quote.  The question mark after the escaped quote inside the char class does nothing (it is a literal question mark as is the later escaped one).  I would try these regex.
pattern="(?s)[\"\w\s#\$%\^&\*\';:\?,\._-]+"
pattern="[\"\w\s#\$%\^&\*\';:\?,\._-]+"
pattern="(?s)[\x22\w\s#\$%\^&\*\';:\?,\._-]+"
pattern="[\x22\w\s#\$%\^&\*\';:\?,\._-]+"

Open in new window

If you want to clean up the regex some, VERY few characters need to be escaped inside a character class...  so $ ^ * ? . should all work without the preceding backslash.
you can use the regular html entities such as &dbquot; or possibly escape sequences such as &#xWHATEVER;
other ways include using old character classes such as [[:punct:]] for all punctuation or [[:quot:]]. you need to check what the regexp engine supports.
If you want to use the html entities, they must NOT go inside your character class but needs to use alternation.

I can't find anything definitive but I highly doubt HTML 5 pattern validation supports POSIX bracket expressions (JavaScript does not).
ASKER CERTIFIED SOLUTION
Avatar of skullnobrains
skullnobrains

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Thanks for your help. Here is what I'm currently using, and it seems to be working fine, and it gets around the " inside '' " problem by using unicode values.

 pattern="[\w\s#\$%\^&\*()\u201C\u201D\u2018\u2019\u0022\u0027\u0060\u00B4;:=+\?,\._-]+"  
thanks for sharing. you probably only need to bother with double quotes so you may only convert these and possibky <>& 
I'm confused by the chosen solution.  What does skullnobrains suggest that I had not previously suggested?  Your current/final regex appears to be using \u codes which are effectively the same as the \x codes I suggested earlier.