Link to home
Start Free TrialLog in
Avatar of William Peck
William PeckFlag for United States of America

asked on

need Sql trick to strip out non-standard text in Oracle

I need to strip out "BY ARRANGEMENT" from text, but it's spelled differently :-(

Here are the variations
F2,T56,BY ARRANGEM
M5,T56,BY ARRANGEM
M6,T56,BY ARRANGEM
MTWR1256,FBYARRANG
MWF3,BYARRANGEMENT
MW6,BY ARRANGMENT
T56,BYARRANGEMENT
T56,F2,BYARRANGEME
TWF7,BY ARRANGEMN
TWF7,BYARRANGEMENT
TR34,BY ARRANGEMEN
TR56,BY ARRANGEMEN
TR10,BYARRANGEMENT
TR8,BYARRANGEMENT
TR9,BYARRANGEMENT
WF5,BYARRANGEMENT
T56,BYARRANGEMENT
BY ARRANGEMENT,F1
T56,BYARRANGEMENT
W6,BYARRANGEMENT


BY ARRANGEMENT
F2,BY ARRANGEMENT
F3,BY ARRANGEMENT
F4,BY ARRANGEMENT
M3, BY ARRANGEMENT
M5,BY ARRANGEMENT
M6,BY ARRANGEMENT
MF3,BY ARRANGEMENT
MF7,BY ARRANGEMENT
MW2,BY ARRANGEMENT
MW6,BY ARRANGEMENT
R12,BY ARRANGEMENT
R34,BY ARRANGEMENT
R56,BY ARRANGEMENT
T2,BY ARRANGEMENT
T34,BY ARRANGEMENT
T5,BY ARRANGEMENT
TF7,BY ARRANGEMENT
W2,BY ARRANGEMENT
W5,BY ARRANGEMENT
W7,BY ARRANGEMENT
WF2,BY ARRANGEMENT
WF3,BY ARRANGEMENT
T34,BY ARRANGEMENT
F1,BY ARRANGEMENT
T34,BY ARRANGEMENT
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Interesting...

I think I'm close but need clarification.

What is the expected output from:
 MTWR1256,FBYARRANG
Is it always the last string in the comma delimited list?

if so, try this

regexp_replace(str,'[^,]+$')

and if you need to remove the trailing comma when there is one try this

regexp_replace(str,',?[^,]+$')
SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
To add to my last post:  It might work until they misspell 'BY'...

It will fail on something like:
MI6,BI ARRANGEMENT
MI6,B Y ARRANGEMENT
I didn't see the "BY ARRANGEMT,F1"  - thanks

my first post would have handled the "BY ARRANGEMENT"  though.

My second post does them all (at least as far as I can tell)
Avatar of William Peck

ASKER

ok, I'm working on it, thanks very much !

this is an outlier: MTWR1256,FBYARRANG - that's "Friday By Arrangement" but breaks the rules . . . :-(
>>select regexp_replace(col1,'([ ,]+)?BY[^,]+([ ,]+)?') from tab1;
- Whoa ! looks really good, checking now . . .
Forget mine for now.

Look at what sdstuber posted in  http:#a40354505

I think that is more what you want and is much cleaner than mine.
sdstuber, thanks ! works perfectly !!!!

---
slightwv, you had all but the FBYArrangement, but that was slick too !
Since the part you want seems to have more rigid construction rules, it's a lot easier to search for that

here's a slight improvement on my last post, this explictly checks for the pattern at the beginning of the string and also correctly checks patterns like   M2W3 (bad)  vs M2,W3 (good)

REGEXP_SUBSTR(str, '^([MWTRF]+[0-9]+,)+([MWTRF]+[0-9]+)?')     -- keeping the trailing comma


REGEXP_SUBSTR(str, '^([MWTRF]+[0-9]+,)*[MWTRF]+[0-9]+')    -- removing the trailing comma
Thanks, my boss will be impressed !!!! :-)
you already identified one outlier   "MTWR1256,FBYARRANG"  - the expression in post 40354502 doesn't handle that correctly, at least I don't think so.  You didn't actually specify what the output should be for that one, so I guessed with mine.
The expressions produce different results - so they can't both be correct.
sdstuber,
>>also correctly checks patterns like   M2W3 (bad)  vs M2,W3 (good)
- actually M2W3 is valid, and your expression works --> F1R56 produces F1R56

I'm using regexp_substr(REPLACE(mtg_time,' ',''),'([MWTRFS]+[0-9]+,?)+')

which also produces correctly: MTWR1256,FBYARRANG = MTWR1256, which is correct.

I'm not worried about the comma.
Gadsden Consulting,

As much as I appreciate the points I don't feel I deserve them.  With your permission I would like to unaccept this so you can award the points where they are due.
Just to add a kink to the solutions, if a record looked like 'BY34, BY ARRANGEM', neither solution works, although slightwv's could be modifed to - select regexp_replace(text,'([ ,]+)?BY ?[A-Z]+,?') from table to accommodate that scenario.
>> if a record looked like 'BY34 ...

Not an issue.  Check the asker's previous question.  Allowed values are MTWRF

these are class periods, so M=Mon, T=Tues, R=Thurs, etc.

https://www.experts-exchange.com/questions/28528466/how-to-determine-the-number-of-instances-of-a-character-in-string-in-Oracle.html
Whew! I didn't see all of those posts. So the 'MTWR1256,FBYARRANG' should produce just MTRW1256? Have you confirmed that only the letters M, W, T, R, F, or S can be found prior to the "...ARRANG..." segment?
even without precedent - I declared the criteria as part of my suggestions.  

If that assumption was not true it was on the asker to clarify that point.
I also stated the goal of my approach was to search for rigorous "inclusive" rules, rather than fuzzy "exclusive" rules.

So, whatever the inclusive part was, if it could be rigorously defined, then that was the way to go.
slightwv,
>>As much as I appreciate the points I don't feel I deserve them.  I would like to unaccept this so you can award the points where they are due.
- sure, but I like to give points when someone is making a significant contribution.

awking00, thanks for weighing in
>>BY34, BY ARRANGEM
-- this wouldn't happen (but it could if the user was having a really bad day . . . )
- the provided solution returns null, which is correct.

>> Have you confirmed that only the letters M, W, T, R, F, or S can be found prior to the "...ARRANG..." segment?
- yes, but the user could start typing with their nose, I suppose . . . the value isn't validated.

--------------
note: I was about to post another question to strip out characters and leave only numbers. That way, I can compare the numbers in the original meeting time, e.g.,  F2,T56,BYARRANGEM --> F2,T56,

then I said to myself, hey, why not try out your new skill . . . and voila
regexp_substr(mtg_time,'([0-9])+') works like a champ :-)

so MTWR1256,FBYARRANG = 1256 as does the stripped period MTWR1256,

and I am good-to-go . . .
don't use the () unless you have to insider regular expressions they define sub-expressions which are expensive to unnest

regexp_substr(mtg_time,'[0-9]+')
>>>  sure, but I like to give points when someone is making a significant contribution.

you can leave as is, or resplit with weighted points for example 300/200, 400/100  instead of 250/250
Oh, hope you don't mind a quickie . . .

SELECT regexp_substr('M78910,','([710])+') FROM dual;
- this returns 7, how come it doesn't return 710 ?

SELECT regexp_substr('M78910,','([0-8])+') FROM dual;
- this returns 78, how come it doesn't return 7810 ?
([710])+ means 7 or 1 or 0 the "+" means one or more in series.

So, you are asking for 1 or more characters that are 7, 1 or 0 "together" in series.

In M78910, the first string that matches that pattern is the 7 (the 8 isn't in the list so it stops the match).

If you only want the 7's 1's and 0's 0 to remain:
SELECT regexp_replace('M78910,','[^710]+') FROM dual;

The '^ inside the square brackets means NOT.  So the replace says, replace everything that is NOT a 7, 1 or 0.
ok, thanks, slightwv. I'm still scratching my head on this a bit . . . will post other questions as needed.
Keep this in mind when dealing with regular expressions:
It is all about patterns.

If you can find the pattern in what you want to do, you can write a regex to get it.

They are INCREDIBLY POWERFUL and as such, can be INCREDIBLY CONFUSING.

The other frustrating thing about them is different implementations allow different 'features'.

Not everything in the book I'm about to recommend can be used in Oracle's implementation but I found it a good book to learn about regular expressions.  Well, back when I actually purchased books...

I'm also partial to this publisher when it comes to tech books:
http://shop.oreilly.com/product/9780596528126.do
small correction/expansion on slightwv's explanation

'([710])+'

can be read as...

start expression
"(" - start a sub-expression    (you don't want this, as noted above)
"[" - start a character set
"710" - these are the characters in the character set
"]" - end the character set
")" - end the sub expression
"+" - repeat the previous pattern 1 or more times, the previous pattern is ([710])
end expression

the sub-expression in this usage does nothing for you except make your expression slower

regular expression parsers are generally very literal, so even though the () would be removed as a dummy operation in many languages, not so in a regexp.  Those () are instructions and the parser follows them even if they aren't helpful
slighttwv,
>>They are INCREDIBLY POWERFUL and as such, can be INCREDIBLY CONFUSING.
- agreed on the confusing part :-)

>>I'm also partial to this publisher when it comes to tech books:
- excellent suggestion, thx.

----
sdstuber,
great, thanks for the expln.
ststuber,

thanks as well, a new trick for my toolbag . . .