Link to home
Start Free TrialLog in
Avatar of toooki
toooki

asked on

Oracle string extraction via regular expression

I am parsing an Oracle (11gR2) table field (varchar2 datatype) to extract some values with specific pattern.

I use this:
Select TRIM(REGEXP_SUBSTR(upper(F1) ,'ik([^ :.><,_]+)',1,1,'i')) as F1_updated

These are the needed INPUT/OUTPUT:
IK234-565 / IK234-565
This is ik34 ff / ik34
These are ikval-3456 test / ikval-3456
These are IKva-345996_456 test/  IKva-345996
These are IK-345996:456 test/  IK-345996
These are IK-345996-456 test/ IK-345996

All above example works except the last one. It returns IK-345996-456 instead. But I need IK-345996. It needs to stop when it sees the second "-" char in the string. Is there any way to use the same function with some changes to achieve this?
Avatar of Sean Stuber
Sean Stuber

your expected results don't reflect the functionality of forcing UPPER case
Avatar of toooki

ASKER

Thank you.
Sorry, case is not that important but still wanted the upper case in the output. So the correct pairs are:

IK234-565 / IK234-565
This is ik34 ff / IK34
These are ikval-3456 test / IKVAL-3456
These are IKva-345996_456 test/  IKVA-345996
These are IK-345996:456 test/  IK-345996
These are IK-345996-456 test/ IK-345996
These are IK-345996-456AbcD test/ IK-345996
try this...

REGEXP_SUBSTR(UPPER(F1) ,'IK[^ :.><,_-]*-?([^ :.><,_-]+)')
Avatar of toooki

ASKER

Thanks a lot.
I tested and it work for:

select
TRIM(REGEXP_SUBSTR(UPPER('IKTEST23TEST-191-FRAMEWORK-RELATED') ,'IK[^ :.><,_-]*-?([^ :.><,_-]+)')) from dual;
INPUT/OUTPUT:
IKTEST23TEST-191-FRAMEWORK-RELATED / IKTEST23TEST-191

But does not work for:
select
TRIM(REGEXP_SUBSTR(UPPER('IKTEST23TEST-191FRAMEWORK-RELATED') ,'IK[^ :.><,_-]*-?([^ :.><,_-]+)')) from dual;

I actually need:
INPUT/OUTPUT:
IKTEST23TEST-191FRAMEWORK-RELATED / IKTEST23TEST-191
(It needs to stop right after getting a non-numeric value (here "F") after the first "-" character in the first string starting with "IK")....


See if this works.   If it doesn't please add records to the test and explain why it doesn't work.
drop table tab1 purge;
create table tab1 (col1 varchar(50));

insert into tab1 values('IK234-565');
insert into tab1 values('This is ik34 ff');
insert into tab1 values('These are ikval-3456 test');
insert into tab1 values('These are IKva-345996_456 test');
insert into tab1 values('These are IK-345996:456 test');
insert into tab1 values('These are IK-345996-456 test');
insert into tab1 values('IKTEST23TEST-191FRAMEWORK-RELATED');
commit;

select regexp_substr(upper(col1),
	'IK[-A-Z0-9]*[0-9]+'
)
from tab1;

Open in new window

I was informed I had a bug in my previous post.  See if this is better:


select regexp_substr(upper(col1),
      'IK[A-Z0-9_]*-?[A-Z0-9]*[0-9]+'
)
from tab1;


REGEXP_SUBSTR(UPPER(col1) ,'IK[^ :.><,_-]*-?[0-9]+')
oops, forgot the after dash part

REGEXP_SUBSTR(UPPER(col1) ,'IK[^ :.><,_-]*-?[^ :.><,_-]*[0-9]+')


if only letters and numbers are allowed then slightwv's   [A-Z0-9]  is simpler than [^ :.><,_-]
Yes, please clarify the actual requirements.

For example, what is the expected results for:  
'IKTEST23@TEST-191FRAMEWORK-RELATED'
Avatar of toooki

ASKER

Many thanks to you all.
This is INPUT/OUTPUT
IKTEST23@TEST-191FRAMEWORK-RELATED / IKTEST23@TEST-191

slightwv's query I tried and that fixed the issue for the string I was having problem with:
select regexp_substr(upper(col1),
      'IK[-A-Z0-9]*[0-9]+'
)

I am trying this on other strings in the table to see if the values I get are as needed..
I am also trying other queries you sent.
>>IKTEST23@TEST-191FRAMEWORK-RELATED / IKTEST23@TEST-191

Then sdstubers works and mine does not.

Mine does with a quick add of 'allowed' values:  'IK[-A-Z0-9@]*[0-9]+'

It all depends on which path you wish to go down:  include all 'allowed' or exclude all not-allowed.  That is basically the difference between the two methods.
Avatar of toooki

ASKER

Thank you all.

Yes that is right..

I tried:
TRIM(REGEXP_SUBSTR(UPPER(col1), 'IK[^ :.><,_-]*-?[^ :.><,_-]*[0-9]+'))
And that gives me the correct O/P for all inputs I had in the field in the table.

select regexp_substr(upper(col1),
      'IK[-A-Z0-9]*[0-9]+'
)

The above as giving a bit incorrect Output for Input strings like "IKTEST-12325-12345"
Was giving IKTEST-12325 but wanted "IKTEST-12325-12345"

For TRIM(REGEXP_SUBSTR(UPPER(col1), 'IK[^ :.><,_-]*-?[^ :.><,_-]*[0-9]+'))
I was trying to understand the meaning.
One question, which syntax in this query cause to exclude the pattern we put in it?
'IK[^ :.><,_-]*-?[^ :.><,_-]*[0-9]+'

breaking it into pieces...


IK                = IK

[^ :.><,_-]* =  0 or more characters that are NOT one of these ":.><,_-"

-?                = 0 or 1  "-"

[0-9]+         = 1 or more digits
you should ask the exclusion question as a new question
>>The above as giving a bit incorrect Output for Input strings like "IKTEST-12325-12345"
>>Was giving IKTEST-12325 but wanted "IKTEST-12325-12345"

Your requirements keep changing.  In the original post you said "These are IK-345996-456 test/ IK-345996"

Which is it?

If you want:  IKTEST-12325-12345

Just add the '-' to the second part of mine:        'IK[A-Z0-9_]*-?[A-Z0-9-]*[0-9]+'

Avatar of toooki

ASKER

Thank you.

Thank you sdstuber and slightwv for the explanations. I now understand.

My apology. My original post was correct:
Needed INPUT/OUTPUT:
"These are IK-345996-456 test/ IK-345996"

It was my mistake in the post 35497728.
I wanted to mean just the opposite:
/* The above as giving a bit incorrect Output for Input strings like "IKTEST-12325-12345"
Was getting "IKTEST-12325-12345" but wanted "IKTEST-12325". Very sorry. */

So far
TRIM(REGEXP_SUBSTR(UPPER(col1), 'IK[^ :.><,_-]*-[^ :.><,_-]*[0-9]+'))
this worked for me . The "-" is always there in input so I made that mandatory. But I need a small change and  for that I submitted a new question...
https://www.experts-exchange.com/questions/26987344/Oracle-string-extraction-via-regular-expression.html
Avatar of toooki

ASKER

Ok, thank you slightwv. Sorry I misunderstood. So I put the details here:

Needed this Input/Output:
This is the needed INPUT/OUTPOT:
These are IK-345996-456 test/ IK-345996
There are IK there areIK-345996-456 test/ IK-345996
sdcaIKsdIKtest-1234TTT / IKtest-1234

The query:
TRIM(REGEXP_SUBSTR(UPPER(col1), 'IK[^ :.><,_-]*-[^ :.><,_-]*[0-9]+'))
gives me correct O/P for all of the above except the last.
I needed to extract the string starting with IK before the first occurrence of "IK" before the "-" string and ending at the last occurrence of a number (0 to 9) after "-" character.

I tried:
TRIM(REGEXP_SUBSTR(UPPER(col1), 'IK[^ IK:.><,_-]*-[^ :.><,_-]*[0-9]+'))
that works for:
sdcaIKsdIKtest-1234TTT / IKtest-1234
But fails if the input is IKtesti-1234 or IKteskt-1234
Avatar of toooki

ASKER

In the last post I wanted to mean:
I needed to extract the string starting with IK with the first occurrence of "IK" before the "-" string and ending at the last occurrence of a number (0 to 9) after the "-" character.

if the requirements change they should be new questions.

I think this one has been answered hasn't it?
Please add to my test case.

Insert new rows to the table as needed and post expected results.  This way we can tweak the expression and ensure all previous ones still work.  Also post why or why the row doesn't work.

Check out below

Here is the output from that run:

IK234-565
IK34
IKVAL-3456
IKVA-345996
IK-345996
IK-345996
IKTEST23TEST-191
IKTEST23TEST-191
IKTEST23TEST-191
IKTEST23
IK-345996
IKTEST-1234
IKTESTI-1234
IKTESKT-1234

14 rows selected.

Open in new window

drop table tab1 purge;
create table tab1 (col1 varchar(50));

insert into tab1 values('IK234-565');
insert into tab1 values('This is ik34 ff');
insert into tab1 values('These are ikval-3456 test');
insert into tab1 values('These are IKva-345996_456 test');
insert into tab1 values('These are IK-345996:456 test');
insert into tab1 values('These are IK-345996-456 test');
insert into tab1 values('IKTEST23TEST-191FRAMEWORK-RELATED');
insert into tab1 values('IKTEST23TEST-191FRAMEWORK-RELATED');
insert into tab1 values('IKTEST23TEST-191FRAMEWORK-RELATED');
insert into tab1 values('IKTEST23@TEST-191FRAMEWORK-RELATED');
insert into tab1 values('These are IK-345996-456 test/ IK-345996');
insert into tab1 values('There are IK there areIK-345996-456 test/ IK-345996');
insert into tab1 values('sdcaIKsdIKtest-1234TTT / IKtest-1234');
insert into tab1 values('IKtesti-1234');
insert into tab1 values('IKteskt-1234');
commit;

select regexp_replace(upper(col1),
	'(.*)(IK[A-Z0-9_]*-?[A-Z0-9]*[0-9]+)(.*)','\2'
)
from tab1;

Open in new window

I disagree with sdstuber on asking a new question every time you change the requirements.  It could end up with 100 questions every time you find a new example that breaks.

What I do ask is to please make sure you provide us with as complete requirements as possible.

I feel that you are running what we provide, find a 'failure', the change the requirements.  Can you not go through all the examples and provide a complete list?
SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>> with 100 questions every time you find a new example that breaks

that's fine

because I've already answered this question and the first exception and the followup.

It's time to close this one and if there are new questions (even if related) that's fine.
Avatar of toooki

ASKER

Thank you all.

select regexp_replace(upper(col1),
      '(.*)(IK[A-Z0-9_@]*-?[A-Z0-9]*[0-9]+)(.*)','\2'
)

The above looks to be working for most sample inputs I have.  And it does work on example:
INPUT/OUTPUT
sdcaIKsdIKtest-1234TTT / IKtest-1234


Previous solutions that you and sdstuber gave to me used REGEXP_SUBSTR but this is using REGEXP_REPLACE. Like in post 35498381 could you please write a line or two to explain this query, if possible?
Many thanks.

\2 is called a backreference

count parentheses from the left and the 2nd block is the substring that is referenced, in this case...   (IK[A-Z0-9_@]*-?[A-Z0-9]*[0-9]+)

the replace looks for the entire string as described above. and replaces it with the back referenced substring


the only significantly different from what was posted before is the bracketed lists

[A-Z0-9]  - this an INCLUSION list,  as opposed to [^ IK:.><,_-]  which is an EXCLUSION list  note the "^" as the first character.
Avatar of toooki

ASKER

Thanks a lot. That explanation really helps.

Sorry I still have a question.

All outputs are coming correctly.
Except..
If the INPUT does not match the pattern IK*number
Then it needs to return null...

Example (needed):
This is IK /
IK-1234 / IK-1234
IK1234 / IK1234
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of toooki

ASKER

Many thanks. It seems I am all ok with this solution.
Thanks!