Solved

need Sql trick to strip out non-standard text in Oracle

Posted on 2014-10-01
31
228 Views
Last Modified: 2014-10-01
I need to strip out "BY ARRANGEMENT" from text, but it's spelled differently :-(

Here are the variations
F2,T56,BY ARRANGEM
M5,T56,BY ARRANGEM
M6,T56,BY ARRANGEM
MTWR1256,FBYARRANG
MWF3,BYARRANGEMENT
MW6,BY ARRANGMENT
T56,BYARRANGEMENT
T56,F2,BYARRANGEME
TWF7,BY ARRANGEMN
TWF7,BYARRANGEMENT
TR34,BY ARRANGEMEN
TR56,BY ARRANGEMEN
TR10,BYARRANGEMENT
TR8,BYARRANGEMENT
TR9,BYARRANGEMENT
WF5,BYARRANGEMENT
T56,BYARRANGEMENT
BY ARRANGEMENT,F1
T56,BYARRANGEMENT
W6,BYARRANGEMENT


BY ARRANGEMENT
F2,BY ARRANGEMENT
F3,BY ARRANGEMENT
F4,BY ARRANGEMENT
M3, BY ARRANGEMENT
M5,BY ARRANGEMENT
M6,BY ARRANGEMENT
MF3,BY ARRANGEMENT
MF7,BY ARRANGEMENT
MW2,BY ARRANGEMENT
MW6,BY ARRANGEMENT
R12,BY ARRANGEMENT
R34,BY ARRANGEMENT
R56,BY ARRANGEMENT
T2,BY ARRANGEMENT
T34,BY ARRANGEMENT
T5,BY ARRANGEMENT
TF7,BY ARRANGEMENT
W2,BY ARRANGEMENT
W5,BY ARRANGEMENT
W7,BY ARRANGEMENT
WF2,BY ARRANGEMENT
WF3,BY ARRANGEMENT
T34,BY ARRANGEMENT
F1,BY ARRANGEMENT
T34,BY ARRANGEMENT
0
Comment
Question by:Gadsden Consulting
  • 11
  • 10
  • 8
  • +1
31 Comments
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40354493
Interesting...

I think I'm close but need clarification.

What is the expected output from:
 MTWR1256,FBYARRANG
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40354495
Is it always the last string in the comma delimited list?

if so, try this

regexp_replace(str,'[^,]+$')

and if you need to remove the trailing comma when there is one try this

regexp_replace(str,',?[^,]+$')
0
 
LVL 76

Assisted Solution

by:slightwv (䄆 Netminder)
slightwv (䄆 Netminder) earned 250 total points
ID: 40354502
>>Is it always the last string in the comma delimited list?

Thought that myself until I found:
BY ARRANGEMENT,F1
BY ARRANGEMENT

Here is what I have:
select regexp_replace(col1,'([ ,]+)?BY[^,]+([ ,]+)?') from tab1;

It produces the following:

F2,T56
M5,T56
M6,T56
MTWR1256,F
MWF3
MW6
T56
T56,F2
TWF7
TWF7
TR34
TR56
TR10
TR8
TR9
WF5
T56
F1
T56
W6

F2
F3
F4
M3
M5
M6
MF3
MF7
MW2
MW6
R12
R34
R56
T2
T34
T5
TF7
W2
W5
W7
WF2
WF3
T34
F1
T34

Open in new window

0
 
LVL 73

Accepted Solution

by:
sdstuber earned 250 total points
ID: 40354505
alternately, if you're looking for strings of  M,T,W,R,F  followed by digits then try this


regexp_substr(str,'([MWTRF]+[0-9]+,?)+')

and again, if you want to remove the trailing comma

rtrim(regexp_substr(str,'([MWTRF]+[0-9]+,?)+'),',')
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40354509
To add to my last post:  It might work until they misspell 'BY'...

It will fail on something like:
MI6,BI ARRANGEMENT
MI6,B Y ARRANGEMENT
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40354510
I didn't see the "BY ARRANGEMT,F1"  - thanks

my first post would have handled the "BY ARRANGEMENT"  though.

My second post does them all (at least as far as I can tell)
0
 

Author Comment

by:Gadsden Consulting
ID: 40354511
ok, I'm working on it, thanks very much !

this is an outlier: MTWR1256,FBYARRANG - that's "Friday By Arrangement" but breaks the rules . . . :-(
0
 

Author Comment

by:Gadsden Consulting
ID: 40354514
>>select regexp_replace(col1,'([ ,]+)?BY[^,]+([ ,]+)?') from tab1;
- Whoa ! looks really good, checking now . . .
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40354521
Forget mine for now.

Look at what sdstuber posted in  http:#a40354505

I think that is more what you want and is much cleaner than mine.
0
 

Author Comment

by:Gadsden Consulting
ID: 40354531
sdstuber, thanks ! works perfectly !!!!

---
slightwv, you had all but the FBYArrangement, but that was slick too !
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40354540
Since the part you want seems to have more rigid construction rules, it's a lot easier to search for that

here's a slight improvement on my last post, this explictly checks for the pattern at the beginning of the string and also correctly checks patterns like   M2W3 (bad)  vs M2,W3 (good)

REGEXP_SUBSTR(str, '^([MWTRF]+[0-9]+,)+([MWTRF]+[0-9]+)?')     -- keeping the trailing comma


REGEXP_SUBSTR(str, '^([MWTRF]+[0-9]+,)*[MWTRF]+[0-9]+')    -- removing the trailing comma
0
 

Author Closing Comment

by:Gadsden Consulting
ID: 40354544
Thanks, my boss will be impressed !!!! :-)
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40354546
you already identified one outlier   "MTWR1256,FBYARRANG"  - the expression in post 40354502 doesn't handle that correctly, at least I don't think so.  You didn't actually specify what the output should be for that one, so I guessed with mine.
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40354553
The expressions produce different results - so they can't both be correct.
0
 

Author Comment

by:Gadsden Consulting
ID: 40354612
sdstuber,
>>also correctly checks patterns like   M2W3 (bad)  vs M2,W3 (good)
- actually M2W3 is valid, and your expression works --> F1R56 produces F1R56

I'm using regexp_substr(REPLACE(mtg_time,' ',''),'([MWTRFS]+[0-9]+,?)+')

which also produces correctly: MTWR1256,FBYARRANG = MTWR1256, which is correct.

I'm not worried about the comma.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40354637
Gadsden Consulting,

As much as I appreciate the points I don't feel I deserve them.  With your permission I would like to unaccept this so you can award the points where they are due.
0
 
LVL 31

Expert Comment

by:awking00
ID: 40354767
Just to add a kink to the solutions, if a record looked like 'BY34, BY ARRANGEM', neither solution works, although slightwv's could be modifed to - select regexp_replace(text,'([ ,]+)?BY ?[A-Z]+,?') from table to accommodate that scenario.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40354778
>> if a record looked like 'BY34 ...

Not an issue.  Check the asker's previous question.  Allowed values are MTWRF

these are class periods, so M=Mon, T=Tues, R=Thurs, etc.

http://www.experts-exchange.com/Database/Oracle/Q_28528466.html
0
 
LVL 31

Expert Comment

by:awking00
ID: 40354806
Whew! I didn't see all of those posts. So the 'MTWR1256,FBYARRANG' should produce just MTRW1256? Have you confirmed that only the letters M, W, T, R, F, or S can be found prior to the "...ARRANG..." segment?
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40354809
even without precedent - I declared the criteria as part of my suggestions.  

If that assumption was not true it was on the asker to clarify that point.
I also stated the goal of my approach was to search for rigorous "inclusive" rules, rather than fuzzy "exclusive" rules.

So, whatever the inclusive part was, if it could be rigorously defined, then that was the way to go.
0
 

Author Comment

by:Gadsden Consulting
ID: 40354866
slightwv,
>>As much as I appreciate the points I don't feel I deserve them.  I would like to unaccept this so you can award the points where they are due.
- sure, but I like to give points when someone is making a significant contribution.

awking00, thanks for weighing in
>>BY34, BY ARRANGEM
-- this wouldn't happen (but it could if the user was having a really bad day . . . )
- the provided solution returns null, which is correct.

>> Have you confirmed that only the letters M, W, T, R, F, or S can be found prior to the "...ARRANG..." segment?
- yes, but the user could start typing with their nose, I suppose . . . the value isn't validated.

--------------
note: I was about to post another question to strip out characters and leave only numbers. That way, I can compare the numbers in the original meeting time, e.g.,  F2,T56,BYARRANGEM --> F2,T56,

then I said to myself, hey, why not try out your new skill . . . and voila
regexp_substr(mtg_time,'([0-9])+') works like a champ :-)

so MTWR1256,FBYARRANG = 1256 as does the stripped period MTWR1256,

and I am good-to-go . . .
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40355011
don't use the () unless you have to insider regular expressions they define sub-expressions which are expensive to unnest

regexp_substr(mtg_time,'[0-9]+')
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40355017
>>>  sure, but I like to give points when someone is making a significant contribution.

you can leave as is, or resplit with weighted points for example 300/200, 400/100  instead of 250/250
0
 

Author Comment

by:Gadsden Consulting
ID: 40355268
Oh, hope you don't mind a quickie . . .

SELECT regexp_substr('M78910,','([710])+') FROM dual;
- this returns 7, how come it doesn't return 710 ?

SELECT regexp_substr('M78910,','([0-8])+') FROM dual;
- this returns 78, how come it doesn't return 7810 ?
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40355285
([710])+ means 7 or 1 or 0 the "+" means one or more in series.

So, you are asking for 1 or more characters that are 7, 1 or 0 "together" in series.

In M78910, the first string that matches that pattern is the 7 (the 8 isn't in the list so it stops the match).

If you only want the 7's 1's and 0's 0 to remain:
SELECT regexp_replace('M78910,','[^710]+') FROM dual;

The '^ inside the square brackets means NOT.  So the replace says, replace everything that is NOT a 7, 1 or 0.
0
 

Author Comment

by:Gadsden Consulting
ID: 40355319
ok, thanks, slightwv. I'm still scratching my head on this a bit . . . will post other questions as needed.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40355342
Keep this in mind when dealing with regular expressions:
It is all about patterns.

If you can find the pattern in what you want to do, you can write a regex to get it.

They are INCREDIBLY POWERFUL and as such, can be INCREDIBLY CONFUSING.

The other frustrating thing about them is different implementations allow different 'features'.

Not everything in the book I'm about to recommend can be used in Oracle's implementation but I found it a good book to learn about regular expressions.  Well, back when I actually purchased books...

I'm also partial to this publisher when it comes to tech books:
http://shop.oreilly.com/product/9780596528126.do
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40355358
small correction/expansion on slightwv's explanation

'([710])+'

can be read as...

start expression
"(" - start a sub-expression    (you don't want this, as noted above)
"[" - start a character set
"710" - these are the characters in the character set
"]" - end the character set
")" - end the sub expression
"+" - repeat the previous pattern 1 or more times, the previous pattern is ([710])
end expression

the sub-expression in this usage does nothing for you except make your expression slower

regular expression parsers are generally very literal, so even though the () would be removed as a dummy operation in many languages, not so in a regexp.  Those () are instructions and the parser follows them even if they aren't helpful
0
 

Author Comment

by:Gadsden Consulting
ID: 40355510
slighttwv,
>>They are INCREDIBLY POWERFUL and as such, can be INCREDIBLY CONFUSING.
- agreed on the confusing part :-)

>>I'm also partial to this publisher when it comes to tech books:
- excellent suggestion, thx.

----
sdstuber,
great, thanks for the expln.
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40355515
another great resource on regexps

http://www.regular-expressions.info/tutorial.html
0
 

Author Comment

by:Gadsden Consulting
ID: 40355533
ststuber,

thanks as well, a new trick for my toolbag . . .
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

PL/SQL can be a very powerful tool for working directly with database tables. Being able to loop will allow you to perform more complex operations, but can be a little tricky to write correctly. This article will provide examples of basic loops alon…
This article describes how to use the timestamp of existing data in a database to allow Tableau to calculate the prior work day instead of relying on case statements or if statements to calculate the days of the week.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Via a live example, show how to take different types of Oracle backups using RMAN.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now