Link to home
Start Free TrialLog in
Avatar of William Peck
William PeckFlag for United States of America

asked on

using regexp_instr to compare values, need a little help in Oracle Sql

I used Regexp_substr to good effect recently (with EE tutoring), now I need a little help with regexp_instr.

Here's the neat trick that EE helped me with:
regexp_substr(REPLACE(p_mtg_time_in,' ',''),'([MWTRFS]+[0-9]+,?)+')

this is a scheduling application at a school, with class periods 1 - 10, meeting on Mon - Sat. the regexp_substr took these values to convert:
MTWR1256,FBYARRANG produces MTWR1256 (stripping out BYARRANG and also the "F" because it's not followed by 123456789 (or 10).

Now I want to compare values and say "Ok" or "Bad !". Here are a bunch of examples:
A      B                     C
F12      F1           good
F12      F2           good

A                                              B                    C
MW6,BY ARRANGEMENT      M6        good
MW6,BY ARRANGEMENT      W6        good
MW6,BY ARRANGEMENT    R6         bad

A                      B                     C
MWF2,R34      M2            good
MWF2,R34      W2            good
MWF2,R34      R3             good
MWF2,R34      R4             good
MWF2,R34      M3            bad
MWF2,R34      W3            bad

A              B                          C
T10,R9      T10              good
T10,R9      T5                BAD
T10,R9      T6                BAD
T10,R9      R3                BAD
T10,R9      R4                BAD
T10,R9      R9                good
T10,R9      R10             BAD - 10 is not preceded by R !                
T10,R9      T9                BAD - 9 is not preceded by T since R breaks the pattern !

column B is always the day (M,T,W,R,F,S) followed by the period (1 - 10)

so here are some rules
if column B is found in Column A, it's good (F1 = F1)

you can find other letters before the number (MWF2,R34  | M2         good !)

you can find other numbers after the letter (F12 |       F2           good !)

but MWF2,R34 |      M3            bad M3 isn't found ! only R3 and R4 would be good

MWF23,R34 |      M3 ok
                                M2 ok
                                M4 bad ! because of the intervening R
                                R2 bad ! there's no 2 after R
                               
                                R3 ok  
                                R4 ok
           
                                F4 bad ! because of the intervening R
                                F3 ok
                                F2 ok

also, the comma is irrelevant.
SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Sean Stuber
Sean Stuber

that's quite a bit trickier than the first, but still doable


  SELECT a,
         b,
         MAX(
             CASE
                 WHEN REGEXP_REPLACE(
                          REGEXP_SUBSTR(
                              a,
                              '[^,]+',
                              1,
                              COLUMN_VALUE
                          ),
                          '[^' || b || ']'
                      ) = b
                 THEN
                     'good'
                 ELSE
                     'bad'
             END
         )
             c
    FROM (SELECT RTRIM(REGEXP_SUBSTR(REPLACE(a, ' ', ''), '([MWTRFS]+[0-9]+,?)+'), ',') a, b, c
            FROM yourtable) x,
         TABLE(
                 SELECT COLLECT(LEVEL)
                   FROM DUAL
             CONNECT BY LEVEL <= LENGTH(x.a) - LENGTH(REPLACE(x.a, ',')) + 1
         )
GROUP BY a, b

Open in new window


or, if the max number of csv values in A is 2 then try this...

SELECT a,
       b,
       GREATEST(
           CASE
               WHEN REGEXP_REPLACE(
                        REGEXP_SUBSTR(
                            a,
                            '[^,]+',
                            1,
                            2
                        ),
                        '[^' || b || ']'
                    ) = b
               THEN
                   'good'
               ELSE
                   'bad'
           END,
           CASE
               WHEN REGEXP_REPLACE(
                        REGEXP_SUBSTR(
                            a,
                            '[^,]+',
                            1,
                            1
                        ),
                        '[^' || b || ']'
                    ) = b
               THEN
                   'good'
               ELSE
                   'bad'
           END
       )
           c
  FROM (SELECT RTRIM(REGEXP_SUBSTR(REPLACE(a, ' ', ''), '([MWTRFS]+[0-9]+,?)+'), ',') a, b, c
          FROM yourtable)

Open in new window

Avatar of William Peck

ASKER

Thank you both ! I can't test it until tomorrow, but I'll do that first thing,
just a note on usage -  slightwv's assumes the data has already been cleansed.

mine assumes the data is still dirty and will need to be cleansed as part of the check, so you can't use exactly the same input data for both
sdstuber's post is the way to go.

I also missed the possibility of:
M1,T6,WR89,F3,S10

I took just TWO values based on the data you provided.  I wasn't thinking about the previous questions and additional possibilities.

Personally, instead of writing a bunch of 'magic' SQL for a bad design, I would change the design and normalize the design.  It would probably save you a LOT of headaches.
sdstuber, the data has been cleaned (at least the comparison side - T1, M2, etc.). I can use last week's lesson to clean the left side - MW6,BY ARRANGEMENT.

slightwv - thanks as well.  sorry, I forgot to include a wild (but valid) example, like M1,T6,WR89,F3,S10. There are several "triple" combinations as well.

>>instead of writing a bunch of 'magic' SQL for a bad design, I would change the design and normalize the design.
- I don't think I can change anything. For the previous question where I asked about regexp_substr, I was taking the input of M1,T6,WR89,F3,S10 and producing
1 on Mondays
6 on Tuesdays
89 on Weds and Thurs
etc.

So what this query is doing is validating the output, there are like 10,000 results of courses, dates, etc.

If I can get this post working, it's an independent - and automated - validation of the program that produced the periods for the given days.

I probably won't have time to fiddle with it tonight, but I'll be at it first thing tomorrow.

Even if I can handle 80% of the combinations, that's going to make validation a lot easier in that I'd only have to review 20%.
>>- I don't think I can change anything.

Think about this:
Just because the inputs and outputs are 'fixed' doesn't mean there isn't a better way between those boundaries.

There are a few ways to 'mask' the underlying architecture.

Off the top of my head I don't have the sql to turn a table like:
id,day,period
1,M,1
1,T,1
1,R,3
2,M,1
2,T,1
2,W,1

into:
MT1,R3
MTW1

but it should be possible.  You can parse the inputs into the normalized table.

This should make things much easier in the reporting/validation/etc... fronts.
Ok, I'll mull that over too, thanks for following up.
if your data is already clean then it's even easier, same idea, just remove the cleaningsubquery from my first post


  SELECT a,
         b,
         MAX(CASE
                 WHEN REGEXP_REPLACE(REGEXP_SUBSTR(a,
                                                   '[^,]+',
                                                   1,
                                                   COLUMN_VALUE),
                                     '[^' || b || ']') = b
                 THEN
                     'good'
                 ELSE
                     'bad'
             END)
             c
    FROM yourtable x,
         TABLE(    SELECT COLLECT(LEVEL)
                     FROM DUAL
               CONNECT BY LEVEL <= LENGTH(x.a) - LENGTH(REPLACE(x.a, ',')) + 1)
GROUP BY a, b
Ok, thanks ! I'll be on this tomorrow :-)
Holy cow, it's working !!!! great team effort :-)

Slightwv, I used your table creation and popped in the data

sdstuber, I loaded slightwv's data to [my_table], and ran your query as is - bingo ! it worked with test data.

I've loaded my (complete set of) test data to [my_table], and am checking it now - so far so good ! There are some bad combinations, but that looks accurate as far as the data is concerned - so far the query is looking good . . . have to spot check 1,110 results, but that won't take too awful long.
>>great team effort :-)

Thanks for partial inclusion here but I was WAY off on this one.  sdstuber's method is the way to go here!
Now if only I could understand what it's doing internally . . . LOL . . .
550 checked, 550 accurate, including correctly identifying some bad combinations . . . 550 to go.

I will also check some of the wild ones, but all the triple ones have come back accurate, though all are good. So I'll pop in something bad to confirm.

But so far these all look good
MW56,R123,F34	M5	good
MW56,R123,F34	M6	good
MW56,R123,F34	W5	good
MW56,R123,F34	W6	good
MW56,R123,F34	R1	good
MW56,R123,F34	R2	good
MW56,R123,F34	R3	good
MW56,R123,F34	F3	good
MW56,R123,F34	F4	good
.
MT567	W5	bad
MT567	W6	bad
MT567	W7	bad
.
MF1,MWF3	M1	good
MF1,MWF3	M3	good
MF1,MWF3	M6	bad
MF1,MWF3	T6	bad
MF1,MWF3	W3	good
MF1,MWF3	W6	bad
MF1,MWF3	R6	bad
MF1,MWF3	F1	good
MF1,MWF3	F3	good
MF1,MWF3	F6	bad

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
sdstuber, thanks ! I will definitely study this, and thanks for breaking it down.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ok, thanks for the followup !

I will finish testing the current logic and then fold in the "10" logic.

Bill
Had some time to elaborate on what I posted back in http:#a40369773

Normalize the data, mask the normalization with a view.

The only thing it doesn't have is parsing the inputs to insert into the base tables.

Read up on INSTEAD OF triggers:
http://docs.oracle.com/cd/E11882_01/appdev.112/e10766/tdddg_triggers.htm#TDDDG52800

This is pretty quick and dirty but shows the abilities (I think).

drop table event_days purge;
create table event_days(event_id number, series_id number, day char(1));

drop table event_periods purge;
create table event_periods(event_id number, series_id number, period number);

--MWF23,R34,T10,F5
insert into event_days values(1,1,'M');
insert into event_days values(1,1,'W');
insert into event_days values(1,1,'F');
insert into event_days values(1,2,'R');
insert into event_days values(1,3,'T');
insert into event_days values(1,4,'F');

insert into event_periods values(1,1,2);
insert into event_periods values(1,1,3);
insert into event_periods values(1,2,3);
insert into event_periods values(1,2,4);
insert into event_periods values(1,3,10);
insert into event_periods values(1,4,5);


--MTWR1256
insert into event_days values(2,1,'M');
insert into event_days values(2,1,'T');
insert into event_days values(2,1,'W');
insert into event_days values(2,1,'R');

insert into event_periods values(2,1,1);
insert into event_periods values(2,1,2);
insert into event_periods values(2,1,5);
insert into event_periods values(2,1,6);
commit;

create or replace view event_vw as
select ed.event_id, listagg(days_for_event || periods_for_event,',') within group(order by case substr(days_for_event,1,1) when 'M' then 1 when 'T' then 2 when 'W' then 3 when 'R' then 4 when 'F' then 5 when 'S' then 6 end) event_list
from
	(
	select event_id, series_id, 
		listagg(day) within group (order by case day when 'M' then 1 when 'T' then 2 when 'W' then 3 when 'R' then 4 when 'F' then 5 when 'S' then 6 end) days_for_event
	from event_days ed
	group by ed.event_id,ed.series_id
	) ed
join
	(
	select event_id, series_id, 
		listagg(period) within group (order by period) periods_for_event
	from event_periods ep
	group by ep.event_id,ep.series_id
	) ep
on ed.event_id=ep.event_id and ed.series_id=ep.series_id
group by ed.event_id
/

select * from event_vw;

Open in new window

It also doesn't have the primary keys but that is ALL columns on both tables.

This will keep 'accidents' from happening.

It would also make the answer to this question a LOT easier.

With the primary keys, this record would never enter the tables:
MMTW224
I think it's good . . .

Results prior to latest change:
M1,WF1	        M1	good
M1,WF1	        M10	 bad
M1,WF1	        W1	good
M1,WF1	        F10	bad

M10,WF10	M1	good   <-- this should not be good !
M10,WF10	M10	good
M10,WF10	W1	good  <-- this should not be good !
M10,WF10	W10	good

Open in new window


With the fix, it looks good . . .
M1,WF1	            F10	bad
M1,WF1	            M1	good
M1,WF1	            M10	bad
M1,WF1	           W1	good

M10,WF10	   M1	bad     <-- correct !
M10,WF10	   M10	good
M10,WF10	   W1	bad       <-- correct !
M10,WF10	  W10	good

Open in new window


sdstuber, with your latest update, it didn't run out of the box . . . it squawked on column "c". So I commented out "c" and it worked . . .
I included the manually derived "good" and"bad" into my local copy of your sample data.

That's how I checked for correctness.

I would look for anything where "c" != myresult

You real table, obviously wouldn't have that, but it's not really needed, it was only for debugging.
that's what I thought. It's working now, so I'm moving to get it to work against my real table.

Thx.
I am working to implement this, but it probably won't happen until tomorrow.

Looks like this hit the mark, though, so thanks again :-)
I'm going to close this, as I tested it and it works great. I still have to implement everything, this might spill into next week.

But this was really fantastic, I haven't presented to the client yet, I'll let you know when I do :-)
super-fantastic !
slightwv, I'm sorry, I missed a couple of your posts:
Normalize the data, mask the normalization with a view.
- good link on INSTEAD OF triggers, thx.

- in your query, it's squawking, saying that "from keyword is not found where expected", and then the cursor rests on "group" (even in the inside queries).

>>With the primary keys, this record would never enter the tables: MMTW224
- yes, but in our system that COULD be recorded ! since user enters the data, they could fat-finger it . . . :-(
Let me guess, database isn't 11g or above?

If 10g, I'll have to tweak it a little and am willing to do so if you think normalization is an option.  If you don't think this is a possibility, I would rather not take the time.

I'm thinking not since you stated:
but in our system that COULD be recorded

I wonder if the app is trapping errors in such a way that the database not accepting the data, would nicely tell the end user they are entering something 'bad'?
yes, d.b. is not 11g+, but we are moving there at the moment, but not next week . . .

Normalization is not an option, but I am interested in learning the tricks in your query . . . all of which I have never seen . . .

>>I wonder if the app is trapping errors in such a way that the database not accepting the data, would nicely tell the end user they are entering something 'bad'?
- definitely a good idea ! I have already commented on that in the course of my analysis and testing.
>>but I am interested in learning the tricks in your query

Feel free to self-study and ask anything that you don't understand about what I posted.

LISTAGG is a string aggregator and is new to 11g.

It concatenates values together with whatever character you give it.  In the example I posted above, I didn't provide a character so it defaulted to null.

It makes creating a CSV much easier than old methods.

>>Normalization is not an option,

Then some more experimenting for you to do.

Prior to 11g, there were several tricks:
http://www.oracle-base.com/articles/misc/string-aggregation-techniques.php

In that link above, forget WM_CONCAT is even mentioned...  Seriously, purge it from your mind!!!  Don't even think about EVER using it!!!

There is also an XML trick that the other Expert here, sdstuber, enlightened me to many years ago.  I actually prefer it over the User Defined Function (UDF) although the UDF is supposed to be faster.

https://www.experts-exchange.com/questions/24914739/In-Oracle-how-can-I-concatenate-multiple-results-returned-from-select-statement-into-a-comma-separated-string.html?anchorAnswerId=25864822#a25864822
if 10g then use the COLLECT  builtin along with a function to write the collection into a string.  Much simpler and much more efficient.

But, the XML method has the advantage that you don't need to create any other objects.  Which, if you don't have permissions might make it the only option.
excellent, thanks to both of you for the follow-up. I am fighting to get back to this but getting caught in other (related) priorities. But I will be back for sure.
I did get it working with my own data and everything works ! I have to sort through some details, but I'm good to go with this post . . . thanks again :-)
Gentlemen,

complete success :-) I implemented, ran my test, and no "bad" records. The logic I tested included the prev post where I got regexp_substr tips, plus this one.

I'm expecting some kudo's next week when I show the boss . . .

and a good end to the week !