Solved

using regexp_instr to compare values, need a little help in Oracle Sql

Posted on 2014-10-08
34
274 Views
Last Modified: 2014-10-13
I used Regexp_substr to good effect recently (with EE tutoring), now I need a little help with regexp_instr.

Here's the neat trick that EE helped me with:
regexp_substr(REPLACE(p_mtg_time_in,' ',''),'([MWTRFS]+[0-9]+,?)+')

this is a scheduling application at a school, with class periods 1 - 10, meeting on Mon - Sat. the regexp_substr took these values to convert:
MTWR1256,FBYARRANG produces MTWR1256 (stripping out BYARRANG and also the "F" because it's not followed by 123456789 (or 10).

Now I want to compare values and say "Ok" or "Bad !". Here are a bunch of examples:
A      B                     C
F12      F1           good
F12      F2           good

A                                              B                    C
MW6,BY ARRANGEMENT      M6        good
MW6,BY ARRANGEMENT      W6        good
MW6,BY ARRANGEMENT    R6         bad

A                      B                     C
MWF2,R34      M2            good
MWF2,R34      W2            good
MWF2,R34      R3             good
MWF2,R34      R4             good
MWF2,R34      M3            bad
MWF2,R34      W3            bad

A              B                          C
T10,R9      T10              good
T10,R9      T5                BAD
T10,R9      T6                BAD
T10,R9      R3                BAD
T10,R9      R4                BAD
T10,R9      R9                good
T10,R9      R10             BAD - 10 is not preceded by R !                
T10,R9      T9                BAD - 9 is not preceded by T since R breaks the pattern !

column B is always the day (M,T,W,R,F,S) followed by the period (1 - 10)

so here are some rules
if column B is found in Column A, it's good (F1 = F1)

you can find other letters before the number (MWF2,R34  | M2         good !)

you can find other numbers after the letter (F12 |       F2           good !)

but MWF2,R34 |      M3            bad M3 isn't found ! only R3 and R4 would be good

MWF23,R34 |      M3 ok
                                M2 ok
                                M4 bad ! because of the intervening R
                                R2 bad ! there's no 2 after R
                               
                                R3 ok  
                                R4 ok
           
                                F4 bad ! because of the intervening R
                                F3 ok
                                F2 ok

also, the comma is irrelevant.
0
Comment
Question by:Gadsden Consulting
  • 19
  • 8
  • 7
34 Comments
 
LVL 76

Assisted Solution

by:slightwv (䄆 Netminder)
slightwv (䄆 Netminder) earned 200 total points
ID: 40369295
Here is my first attempt.

I'm thinking it can be made much cleaner.  I'll continue playing with it.

/*
drop table tab1 purge;
create table tab1(a varchar2(10), b varchar2(10), c varchar2(10));

insert into tab1 values('MW6','M6','good');
insert into tab1 values('MW6','W6','good');
insert into tab1 values('MW6','R6','bad');
insert into tab1 values('MWF2,R34','M2','good');
insert into tab1 values('MWF2,R34','W2','good');
insert into tab1 values('MWF2,R34','R3','good');
insert into tab1 values('MWF2,R34','R4','good');
insert into tab1 values('MWF2,R34','M3','bad');
insert into tab1 values('MWF2,R34','W3','bad');
insert into tab1 values('T10,R9','T10','good');
insert into tab1 values('T10,R9','T5','BAD');
insert into tab1 values('T10,R9','T6','BAD');
insert into tab1 values('T10,R9','R3','BAD');
insert into tab1 values('T10,R9','R4','BAD');
insert into tab1 values('T10,R9','R9','good');
insert into tab1 values('T10,R9','R10','BAD');
insert into tab1 values('T10,R9','T9','BAD');
insert into tab1 values('MWF2,R34','M2','good');
insert into tab1 values('F12','F2','good');
insert into tab1 values('MWF2,R34','M3','bad');
insert into tab1 values('MWF23,R34','M3','good');
commit;
*/

select c,
	case when
		(instr(a_first,b_day) > 0 and instr(a_first,b_period) > 0)
		or
		(instr(a_second,b_day) > 0 and instr(a_second,b_period) > 0)
	then 'good'
	else 'bad'
	end
from (
select c,
	regexp_substr(a,'^[^,]+') a_first,
	regexp_substr(a,'[^,]+$') a_second,
	regexp_substr(b,'^[MTWRFS]') b_day,
	regexp_substr(b,'[0-9]+$') b_period
from tab1
)
/

Open in new window

0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40369303
that's quite a bit trickier than the first, but still doable


  SELECT a,
         b,
         MAX(
             CASE
                 WHEN REGEXP_REPLACE(
                          REGEXP_SUBSTR(
                              a,
                              '[^,]+',
                              1,
                              COLUMN_VALUE
                          ),
                          '[^' || b || ']'
                      ) = b
                 THEN
                     'good'
                 ELSE
                     'bad'
             END
         )
             c
    FROM (SELECT RTRIM(REGEXP_SUBSTR(REPLACE(a, ' ', ''), '([MWTRFS]+[0-9]+,?)+'), ',') a, b, c
            FROM yourtable) x,
         TABLE(
                 SELECT COLLECT(LEVEL)
                   FROM DUAL
             CONNECT BY LEVEL <= LENGTH(x.a) - LENGTH(REPLACE(x.a, ',')) + 1
         )
GROUP BY a, b

Open in new window


or, if the max number of csv values in A is 2 then try this...

SELECT a,
       b,
       GREATEST(
           CASE
               WHEN REGEXP_REPLACE(
                        REGEXP_SUBSTR(
                            a,
                            '[^,]+',
                            1,
                            2
                        ),
                        '[^' || b || ']'
                    ) = b
               THEN
                   'good'
               ELSE
                   'bad'
           END,
           CASE
               WHEN REGEXP_REPLACE(
                        REGEXP_SUBSTR(
                            a,
                            '[^,]+',
                            1,
                            1
                        ),
                        '[^' || b || ']'
                    ) = b
               THEN
                   'good'
               ELSE
                   'bad'
           END
       )
           c
  FROM (SELECT RTRIM(REGEXP_SUBSTR(REPLACE(a, ' ', ''), '([MWTRFS]+[0-9]+,?)+'), ',') a, b, c
          FROM yourtable)

Open in new window

0
 

Author Comment

by:Gadsden Consulting
ID: 40369313
Thank you both ! I can't test it until tomorrow, but I'll do that first thing,
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40369395
just a note on usage -  slightwv's assumes the data has already been cleansed.

mine assumes the data is still dirty and will need to be cleansed as part of the check, so you can't use exactly the same input data for both
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40369579
sdstuber's post is the way to go.

I also missed the possibility of:
M1,T6,WR89,F3,S10

I took just TWO values based on the data you provided.  I wasn't thinking about the previous questions and additional possibilities.

Personally, instead of writing a bunch of 'magic' SQL for a bad design, I would change the design and normalize the design.  It would probably save you a LOT of headaches.
0
 

Author Comment

by:Gadsden Consulting
ID: 40369722
sdstuber, the data has been cleaned (at least the comparison side - T1, M2, etc.). I can use last week's lesson to clean the left side - MW6,BY ARRANGEMENT.

slightwv - thanks as well.  sorry, I forgot to include a wild (but valid) example, like M1,T6,WR89,F3,S10. There are several "triple" combinations as well.

>>instead of writing a bunch of 'magic' SQL for a bad design, I would change the design and normalize the design.
- I don't think I can change anything. For the previous question where I asked about regexp_substr, I was taking the input of M1,T6,WR89,F3,S10 and producing
1 on Mondays
6 on Tuesdays
89 on Weds and Thurs
etc.

So what this query is doing is validating the output, there are like 10,000 results of courses, dates, etc.

If I can get this post working, it's an independent - and automated - validation of the program that produced the periods for the given days.

I probably won't have time to fiddle with it tonight, but I'll be at it first thing tomorrow.

Even if I can handle 80% of the combinations, that's going to make validation a lot easier in that I'd only have to review 20%.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40369773
>>- I don't think I can change anything.

Think about this:
Just because the inputs and outputs are 'fixed' doesn't mean there isn't a better way between those boundaries.

There are a few ways to 'mask' the underlying architecture.

Off the top of my head I don't have the sql to turn a table like:
id,day,period
1,M,1
1,T,1
1,R,3
2,M,1
2,T,1
2,W,1

into:
MT1,R3
MTW1

but it should be possible.  You can parse the inputs into the normalized table.

This should make things much easier in the reporting/validation/etc... fronts.
0
 

Author Comment

by:Gadsden Consulting
ID: 40369781
Ok, I'll mull that over too, thanks for following up.
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40369827
if your data is already clean then it's even easier, same idea, just remove the cleaningsubquery from my first post


  SELECT a,
         b,
         MAX(CASE
                 WHEN REGEXP_REPLACE(REGEXP_SUBSTR(a,
                                                   '[^,]+',
                                                   1,
                                                   COLUMN_VALUE),
                                     '[^' || b || ']') = b
                 THEN
                     'good'
                 ELSE
                     'bad'
             END)
             c
    FROM yourtable x,
         TABLE(    SELECT COLLECT(LEVEL)
                     FROM DUAL
               CONNECT BY LEVEL <= LENGTH(x.a) - LENGTH(REPLACE(x.a, ',')) + 1)
GROUP BY a, b
0
 

Author Comment

by:Gadsden Consulting
ID: 40369841
Ok, thanks ! I'll be on this tomorrow :-)
0
 

Author Comment

by:Gadsden Consulting
ID: 40370556
Holy cow, it's working !!!! great team effort :-)

Slightwv, I used your table creation and popped in the data

sdstuber, I loaded slightwv's data to [my_table], and ran your query as is - bingo ! it worked with test data.

I've loaded my (complete set of) test data to [my_table], and am checking it now - so far so good ! There are some bad combinations, but that looks accurate as far as the data is concerned - so far the query is looking good . . . have to spot check 1,110 results, but that won't take too awful long.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40370564
>>great team effort :-)

Thanks for partial inclusion here but I was WAY off on this one.  sdstuber's method is the way to go here!
0
 

Author Comment

by:Gadsden Consulting
ID: 40370585
Now if only I could understand what it's doing internally . . . LOL . . .
0
 

Author Comment

by:Gadsden Consulting
ID: 40370624
550 checked, 550 accurate, including correctly identifying some bad combinations . . . 550 to go.

I will also check some of the wild ones, but all the triple ones have come back accurate, though all are good. So I'll pop in something bad to confirm.

But so far these all look good
MW56,R123,F34	M5	good
MW56,R123,F34	M6	good
MW56,R123,F34	W5	good
MW56,R123,F34	W6	good
MW56,R123,F34	R1	good
MW56,R123,F34	R2	good
MW56,R123,F34	R3	good
MW56,R123,F34	F3	good
MW56,R123,F34	F4	good
.
MT567	W5	bad
MT567	W6	bad
MT567	W7	bad
.
MF1,MWF3	M1	good
MF1,MWF3	M3	good
MF1,MWF3	M6	bad
MF1,MWF3	T6	bad
MF1,MWF3	W3	good
MF1,MWF3	W6	bad
MF1,MWF3	R6	bad
MF1,MWF3	F1	good
MF1,MWF3	F3	good
MF1,MWF3	F6	bad

Open in new window

0
 
LVL 73

Assisted Solution

by:sdstuber
sdstuber earned 300 total points
ID: 40370635
>>> Now if only I could understand what it's doing internally . . . LOL . . .

 TABLE(    SELECT COLLECT(LEVEL)
                     FROM DUAL
               CONNECT BY LEVEL <= LENGTH(x.a) - LENGTH(REPLACE(x.a, ',')) + 1)

Open in new window



This builds a set of numbers for each row, 1-N where N is how many pieces are in your schedule.

F12 is a set of 1  {1}
MWF2,R34   is a set of 2 {1,2}
M1,T6,WR89,F3,S10 is a set of 5 {1,2,3,4,5}


REGEXP_SUBSTR(a,
                                                   '[^,]+',
                                                   1,
                                                   COLUMN_VALUE)

Open in new window


This pulls each piece of the schedule out as a distinct value

So MWF2,R34 would become two rows
MWF2
R34


 REGEXP_REPLACE(..., '[^' || b || ']') 

Open in new window


This strips everything out of each piece that is NOT part of B
So, if I'm looking for T3,  then throw away everything that is NOT Tuesday and not 3
and then we check if what is left is equal to B or not.
0
 

Author Comment

by:Gadsden Consulting
ID: 40370673
sdstuber, thanks ! I will definitely study this, and thanks for breaking it down.
0
 
LVL 73

Accepted Solution

by:
sdstuber earned 300 total points
ID: 40370802
don't celebrate just yet...

The inclusion of "10" as a valid period breaks all of the previous methods.

Because something like   F10  will yield a false positive for F1
both my slightwv's  INSTR method, and my REGEXP_REPLACE method

However by combining both of our approaches and extending the CASE to include a specific check for 1 vs 10 it's possible to handle all of the cases.

 SELECT a,
         b,
         c,
         MAX(
             CASE
                 WHEN INSTR(sch, b_day) = 0 THEN 'bad'
                 WHEN b_period = '1' AND NOT REGEXP_LIKE(sch, '1([^0]|$)') THEN 'bad'
                 WHEN INSTR(sch, b_period) > 0 THEN 'good'
                 ELSE 'bad'
             END
         )
             new_method
    FROM (SELECT a,
                 b,
                 c,
                 REGEXP_SUBSTR(
                     a,
                     '[^,]+',
                     1,
                     COLUMN_VALUE
                 )
                     sch,
                 REGEXP_SUBSTR(b, '^[MTWRFS]') b_day,
                 REGEXP_SUBSTR(b, '[0-9]+$') b_period
            FROM yourtable x,
                 TABLE(
                         SELECT COLLECT(LEVEL)
                           FROM DUAL
                     CONNECT BY LEVEL <= LENGTH(x.a) - LENGTH(REPLACE(x.a, ',')) + 1
                 ))
GROUP BY a, b, c
ORDER BY a, b, c;

Open in new window

0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 

Author Comment

by:Gadsden Consulting
ID: 40370819
Ok, thanks for the followup !

I will finish testing the current logic and then fold in the "10" logic.

Bill
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40370872
Had some time to elaborate on what I posted back in http:#a40369773

Normalize the data, mask the normalization with a view.

The only thing it doesn't have is parsing the inputs to insert into the base tables.

Read up on INSTEAD OF triggers:
http://docs.oracle.com/cd/E11882_01/appdev.112/e10766/tdddg_triggers.htm#TDDDG52800

This is pretty quick and dirty but shows the abilities (I think).

drop table event_days purge;
create table event_days(event_id number, series_id number, day char(1));

drop table event_periods purge;
create table event_periods(event_id number, series_id number, period number);

--MWF23,R34,T10,F5
insert into event_days values(1,1,'M');
insert into event_days values(1,1,'W');
insert into event_days values(1,1,'F');
insert into event_days values(1,2,'R');
insert into event_days values(1,3,'T');
insert into event_days values(1,4,'F');

insert into event_periods values(1,1,2);
insert into event_periods values(1,1,3);
insert into event_periods values(1,2,3);
insert into event_periods values(1,2,4);
insert into event_periods values(1,3,10);
insert into event_periods values(1,4,5);


--MTWR1256
insert into event_days values(2,1,'M');
insert into event_days values(2,1,'T');
insert into event_days values(2,1,'W');
insert into event_days values(2,1,'R');

insert into event_periods values(2,1,1);
insert into event_periods values(2,1,2);
insert into event_periods values(2,1,5);
insert into event_periods values(2,1,6);
commit;

create or replace view event_vw as
select ed.event_id, listagg(days_for_event || periods_for_event,',') within group(order by case substr(days_for_event,1,1) when 'M' then 1 when 'T' then 2 when 'W' then 3 when 'R' then 4 when 'F' then 5 when 'S' then 6 end) event_list
from
	(
	select event_id, series_id, 
		listagg(day) within group (order by case day when 'M' then 1 when 'T' then 2 when 'W' then 3 when 'R' then 4 when 'F' then 5 when 'S' then 6 end) days_for_event
	from event_days ed
	group by ed.event_id,ed.series_id
	) ed
join
	(
	select event_id, series_id, 
		listagg(period) within group (order by period) periods_for_event
	from event_periods ep
	group by ep.event_id,ep.series_id
	) ep
on ed.event_id=ep.event_id and ed.series_id=ep.series_id
group by ed.event_id
/

select * from event_vw;

Open in new window

0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40370877
It also doesn't have the primary keys but that is ALL columns on both tables.

This will keep 'accidents' from happening.

It would also make the answer to this question a LOT easier.

With the primary keys, this record would never enter the tables:
MMTW224
0
 

Author Comment

by:Gadsden Consulting
ID: 40370994
I think it's good . . .

Results prior to latest change:
M1,WF1	        M1	good
M1,WF1	        M10	 bad
M1,WF1	        W1	good
M1,WF1	        F10	bad

M10,WF10	M1	good   <-- this should not be good !
M10,WF10	M10	good
M10,WF10	W1	good  <-- this should not be good !
M10,WF10	W10	good

Open in new window


With the fix, it looks good . . .
M1,WF1	            F10	bad
M1,WF1	            M1	good
M1,WF1	            M10	bad
M1,WF1	           W1	good

M10,WF10	   M1	bad     <-- correct !
M10,WF10	   M10	good
M10,WF10	   W1	bad       <-- correct !
M10,WF10	  W10	good

Open in new window


sdstuber, with your latest update, it didn't run out of the box . . . it squawked on column "c". So I commented out "c" and it worked . . .
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40371007
I included the manually derived "good" and"bad" into my local copy of your sample data.

That's how I checked for correctness.

I would look for anything where "c" != myresult

You real table, obviously wouldn't have that, but it's not really needed, it was only for debugging.
0
 

Author Comment

by:Gadsden Consulting
ID: 40371027
that's what I thought. It's working now, so I'm moving to get it to work against my real table.

Thx.
0
 

Author Comment

by:Gadsden Consulting
ID: 40371230
I am working to implement this, but it probably won't happen until tomorrow.

Looks like this hit the mark, though, so thanks again :-)
0
 

Author Comment

by:Gadsden Consulting
ID: 40373104
I'm going to close this, as I tested it and it works great. I still have to implement everything, this might spill into next week.

But this was really fantastic, I haven't presented to the client yet, I'll let you know when I do :-)
0
 

Author Closing Comment

by:Gadsden Consulting
ID: 40373114
super-fantastic !
0
 

Author Comment

by:Gadsden Consulting
ID: 40373145
slightwv, I'm sorry, I missed a couple of your posts:
Normalize the data, mask the normalization with a view.
- good link on INSTEAD OF triggers, thx.

- in your query, it's squawking, saying that "from keyword is not found where expected", and then the cursor rests on "group" (even in the inside queries).

>>With the primary keys, this record would never enter the tables: MMTW224
- yes, but in our system that COULD be recorded ! since user enters the data, they could fat-finger it . . . :-(
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40373176
Let me guess, database isn't 11g or above?

If 10g, I'll have to tweak it a little and am willing to do so if you think normalization is an option.  If you don't think this is a possibility, I would rather not take the time.

I'm thinking not since you stated:
but in our system that COULD be recorded

I wonder if the app is trapping errors in such a way that the database not accepting the data, would nicely tell the end user they are entering something 'bad'?
0
 

Author Comment

by:Gadsden Consulting
ID: 40373269
yes, d.b. is not 11g+, but we are moving there at the moment, but not next week . . .

Normalization is not an option, but I am interested in learning the tricks in your query . . . all of which I have never seen . . .

>>I wonder if the app is trapping errors in such a way that the database not accepting the data, would nicely tell the end user they are entering something 'bad'?
- definitely a good idea ! I have already commented on that in the course of my analysis and testing.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 40373303
>>but I am interested in learning the tricks in your query

Feel free to self-study and ask anything that you don't understand about what I posted.

LISTAGG is a string aggregator and is new to 11g.

It concatenates values together with whatever character you give it.  In the example I posted above, I didn't provide a character so it defaulted to null.

It makes creating a CSV much easier than old methods.

>>Normalization is not an option,

Then some more experimenting for you to do.

Prior to 11g, there were several tricks:
http://www.oracle-base.com/articles/misc/string-aggregation-techniques.php

In that link above, forget WM_CONCAT is even mentioned...  Seriously, purge it from your mind!!!  Don't even think about EVER using it!!!

There is also an XML trick that the other Expert here, sdstuber, enlightened me to many years ago.  I actually prefer it over the User Defined Function (UDF) although the UDF is supposed to be faster.

http://www.experts-exchange.com/Database/Oracle/Q_24914739.html#a25864822
0
 
LVL 73

Expert Comment

by:sdstuber
ID: 40373435
if 10g then use the COLLECT  builtin along with a function to write the collection into a string.  Much simpler and much more efficient.

But, the XML method has the advantage that you don't need to create any other objects.  Which, if you don't have permissions might make it the only option.
0
 

Author Comment

by:Gadsden Consulting
ID: 40373456
excellent, thanks to both of you for the follow-up. I am fighting to get back to this but getting caught in other (related) priorities. But I will be back for sure.
0
 

Author Comment

by:Gadsden Consulting
ID: 40373866
I did get it working with my own data and everything works ! I have to sort through some details, but I'm good to go with this post . . . thanks again :-)
0
 

Author Comment

by:Gadsden Consulting
ID: 40374005
Gentlemen,

complete success :-) I implemented, ran my test, and no "bad" records. The logic I tested included the prev post where I got regexp_substr tips, plus this one.

I'm expecting some kudo's next week when I show the boss . . .

and a good end to the week !
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Cursors in Oracle: A cursor is used to process individual rows returned by database system for a query. In oracle every SQL statement executed by the oracle server has a private area. This area contains information about the SQL statement and the…
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
This video explains at a high level with the mandatory Oracle Memory processes are as well as touching on some of the more common optional ones.
Via a live example, show how to take different types of Oracle backups using RMAN.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now