William Peck
asked on
need help with REGEXP_SUBSTR or REGEXP_INSTR
I need some help understanding REGEXP_INSTR / SUBSTR. I got excellent assistance here, and now I want to learn how to do it myself . . .
The pattern I'm looking for is this: M6T6W6R6, where the numbers are interspersed among the letters
- M = Mon, T = Tues, R = Thurs, etc. The numbers are class periods (1 - 10).
So I want to flag the above as "bad".
but these are good, because the numbers (periods) are not mixed in among the letters (days)
- MR567
- MTRF123456
- MF123
if this can be done, give me a tiny hint and then see if I can get it.
The pattern I'm looking for is this: M6T6W6R6, where the numbers are interspersed among the letters
- M = Mon, T = Tues, R = Thurs, etc. The numbers are class periods (1 - 10).
So I want to flag the above as "bad".
but these are good, because the numbers (periods) are not mixed in among the letters (days)
- MR567
- MTRF123456
- MF123
if this can be done, give me a tiny hint and then see if I can get it.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
note, my notes were for identification of "good" records via something like regexp_like(.......)
to find bad records you would do NOT regexp_like(.......)
to find bad records you would do NOT regexp_like(.......)
if you want to find the specific pattern of repeated letter/number pairs, you could do that too, but differently.
Either looking for letter,number,letter,numbe r or letter,number 2 or more times
Either looking for letter,number,letter,numbe
ASKER
well, look at that . . .
SELECT DISTINCT a, Regexp_instr(a,'(^[MWTRFS] +[0-9]+$)' ) AS Valid_Mtg_Time
FROM tab1
WHERE INSTR(a,',') = 0
AND a IN ('M5T34R56','WF567','MWF')
results
great !
SELECT DISTINCT a, Regexp_instr(a,'(^[MWTRFS]
FROM tab1
WHERE INSTR(a,',') = 0
AND a IN ('M5T34R56','WF567','MWF')
results
WF567 1
MWF 0
M5T34R56 0
great !
ASKER
sdstuber,
thanks for the other tips.
I need to study the book to see if I can see where they're telling me what you indicated. I just played around with previous pattern matches you you and slightwv posted here, until it worked. So I want to make sure I can turn to the reference and find the solution.
thanks for the other tips.
I need to study the book to see if I can see where they're telling me what you indicated. I just played around with previous pattern matches you you and slightwv posted here, until it worked. So I want to make sure I can turn to the reference and find the solution.
appendix D of the 11gR2 SQL reference covers the Oracle regexp syntax.
http://docs.oracle.com/cd/E11882_01/server.112/e41084/ap_posix.htm#SQLRF020
it's a shorter read than other books, and it's oracle specific.
http://docs.oracle.com/cd/E11882_01/server.112/e41084/ap_posix.htm#SQLRF020
it's a shorter read than other books, and it's oracle specific.
ASKER
sdstuber,
the Appendix D link I don't think has what I'm looking for . . . I didn't see anything actually describing how to use REGEXP.
I found this which is closer to what I was looking for, although I would never have been able to solve this problem by reading these pages.
In Jason Price's "Oracle Database 11g SQL", he says
" '^' matches the beginning of a string ... and $ matches the end position of a string" whereas you said
" '^' means the beginning of the line, '$' means the end of the line"
- is that the same ?
Either way, I would have never been able to figure this out on my own - it doesn't "click".
the Appendix D link I don't think has what I'm looking for . . . I didn't see anything actually describing how to use REGEXP.
I found this which is closer to what I was looking for, although I would never have been able to solve this problem by reading these pages.
In Jason Price's "Oracle Database 11g SQL", he says
" '^' matches the beginning of a string ... and $ matches the end position of a string" whereas you said
" '^' means the beginning of the line, '$' means the end of the line"
- is that the same ?
Either way, I would have never been able to figure this out on my own - it doesn't "click".
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thanks a lot. I will review this in more detail tomorrow, and sleep on it tonight . . .
ASKER
sdstuber,
thanks again, very helpful.
>>Here's another way... rather than searching for a string of exactly one group of letters followed by numbers you could have searched for the SECOND instance of a letter/number group. For a legal string you wouldn't find one, for the illegal strings you were looking for you would find them.
-- good idea, but I couldn't get that to work, see below.
>>The "trick" really, is in defining the pattern well. Not the regexp, but the actual in-words description of what you are looking for.
-- good point
-- so it works, but a couple of questions
1. The pattern I'm looking for is "The string starts with (^) one or more letters ( [MTRF] ) then those are followed by one or more numbers ( [0-9] ) and then the string ends ($).", so this is defined by ^[MWTRFS]+[0-9]+$
- What if I wanted to make this legal: M5T34R56 (numbers follow letters), but MWF is illegal ?
2. Why does this (Regexp_instr(a,'F5') return 2, where a = 'F5' ?
3. Is a string denoted by a comma ? So to look for a comma, you backslash it ?
And say I wanted to make MWF5T34R56 legal ? The pattern is "any letter (or multiples of) which is M,T,W,R,F,S followed by any number of 1,2,3,4,5,6,7,8,9,10" and this pattern can repeat ?
So only MWF above is illegal since it's not followed by numbers, but MWF1 is good.
thanks again, very helpful.
>>Here's another way... rather than searching for a string of exactly one group of letters followed by numbers you could have searched for the SECOND instance of a letter/number group. For a legal string you wouldn't find one, for the illegal strings you were looking for you would find them.
-- good idea, but I couldn't get that to work, see below.
>>The "trick" really, is in defining the pattern well. Not the regexp, but the actual in-words description of what you are looking for.
-- good point
-- so it works, but a couple of questions
1. The pattern I'm looking for is "The string starts with (^) one or more letters ( [MTRF] ) then those are followed by one or more numbers ( [0-9] ) and then the string ends ($).", so this is defined by ^[MWTRFS]+[0-9]+$
- What if I wanted to make this legal: M5T34R56 (numbers follow letters), but MWF is illegal ?
2. Why does this (Regexp_instr(a,'F5') return 2, where a = 'F5' ?
3. Is a string denoted by a comma ? So to look for a comma, you backslash it ?
SELECT DISTINCT a, Regexp_instr(a,'(^[MWTRFS]+[0-9]+$)') AS "Pattern_Exists_Once",
Regexp_instr(a,'F5') AS F5_found
FROM tab1
WHERE INSTR(a,',') = 0
AND a IN ('M5T34R56','WF567','MWF','F123')
Legal F5_Found
WF567 1 2
MWF 0 0
F123 1 0
M5T34R56 0 0
And say I wanted to make MWF5T34R56 legal ? The pattern is "any letter (or multiples of) which is M,T,W,R,F,S followed by any number of 1,2,3,4,5,6,7,8,9,10" and this pattern can repeat ?
So only MWF above is illegal since it's not followed by numbers, but MWF1 is good.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
sdstuber,
thanks again.
>>AND a IN ('M5T34R56','WF567','MWF', 'F123') - that condition alone means your data won't have any commas, the INSTR isn't really needed since it's redundant with the IN clause.
- right, I was first limiting myself to no commas with INSTR(seti.mtg_time,',') = 0 then just further limited to the IN clause
>>I get 1 when I run this... SELECT REGEXP_INSTR(a, 'F5') FROM (SELECT 'F5' a FROM DUAL)
-- but this returns 2: SELECT REGEXP_INSTR(a, 'F5') FROM (SELECT 'WF567' a FROM DUAL)
>>since you want a repeating pattern, wrap the entire repeatable part in parentheses and then use + to indicate it will be found 1 or more times.
-- great ! that's really helpful
-- but this also returned legal (1)
REGEXP_INSTR(a, '([MWTRFS]+[0-9]+)') AS "Pattern_Exists_Multiple"
whereas you have an extra "+"
REGEXP_INSTR(a, '([MTWRFS]+[0-9]+)+') AS is_legal
thanks again.
>>AND a IN ('M5T34R56','WF567','MWF',
- right, I was first limiting myself to no commas with INSTR(seti.mtg_time,',') = 0 then just further limited to the IN clause
>>I get 1 when I run this... SELECT REGEXP_INSTR(a, 'F5') FROM (SELECT 'F5' a FROM DUAL)
-- but this returns 2: SELECT REGEXP_INSTR(a, 'F5') FROM (SELECT 'WF567' a FROM DUAL)
>>since you want a repeating pattern, wrap the entire repeatable part in parentheses and then use + to indicate it will be found 1 or more times.
-- great ! that's really helpful
-- but this also returned legal (1)
REGEXP_INSTR(a, '([MWTRFS]+[0-9]+)') AS "Pattern_Exists_Multiple"
whereas you have an extra "+"
REGEXP_INSTR(a, '([MTWRFS]+[0-9]+)+') AS is_legal
>>> but this returns 2: SELECT REGEXP_INSTR(a, 'F5') FROM (SELECT 'WF567' a FROM DUAL)
of course, F5 is found at the 2nd character of WF567
INSTR isn't a boolean function of 0/1 it returns the position of the string if it finds it, 0 if not
"Pattern_Exists_Multiple" - that's a bad name, the INSTR looks for the first time that expression is found, which is at character 1, the fact that there are repeats is meaningless
REGEXP_INSTR(a, '([MTWRFS]+[0-9]+)+') -- this checks explicitly for repeats.
to be even more explicit, add the ^$ to indicate the pattern must be the entire string.
if not
then '############MTRWF123@@@@@ @@@@@@@@@' would be a legal string because a legal pattern can be found inside that string
your original question has been answered, please close this one and open new questions rather than adding on here.
of course, F5 is found at the 2nd character of WF567
INSTR isn't a boolean function of 0/1 it returns the position of the string if it finds it, 0 if not
"Pattern_Exists_Multiple" - that's a bad name, the INSTR looks for the first time that expression is found, which is at character 1, the fact that there are repeats is meaningless
REGEXP_INSTR(a, '([MTWRFS]+[0-9]+)+') -- this checks explicitly for repeats.
to be even more explicit, add the ^$ to indicate the pattern must be the entire string.
if not
then '############MTRWF123@@@@@
your original question has been answered, please close this one and open new questions rather than adding on here.
ASKER
>>of course, F5 is found at the 2nd character of WF567
- oops, I was focusing on the 1 for the entire pattern and had a brain freeze on that . . .
>>your original question has been answered, please close this one and open new questions rather than adding on here.
-- sure
- oops, I was focusing on the 1 for the entire pattern and had a brain freeze on that . . .
>>your original question has been answered, please close this one and open new questions rather than adding on here.
-- sure
ASKER