Link to home
Start Free TrialLog in
Avatar of suvmitra
suvmitraFlag for India

asked on

oracle sql splitting a comma-separated string but ignoring commas in quotes

Hi,

Kindly help on the following problem. I am using Oracle SQL to split comma separated string values. However, need to ignore the values with comma.

Below is the sample data, which needs to be separated/split (comma seperated)
'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No'

Open in new window


sample output would be,
franchise: Px Oncology (Solid Tumors, all XA Onc projects)
zlinical_xombination_study: No

Open in new window



problem is, the value 'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)' is also having a comma within its value, need to ignore this comma while splitting

I am using the below code,
select LTRIM (REGEXP_SUBSTR (ERROR_TEXT, '[^,]+', 1, 1), ',')    AS part_1 FROM ERROR_LOG_TEMP_5;

Open in new window



The output I am getting is ['franchise': 'Px Oncology (Solid Tumors]

It should provide,  ['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)']
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Will the string ALWAYS have a ', as the delimiter?

See if this works for you:
rtrim(regexp_substr(error_text,'''.*'','),',')


hmm, I'm not aware of a bulletproof version for plain PL/SQL, cause CSV is pretty complex with it simple rules - almost like Go.

So I would:

1) Convince the provider of that data to use XML or JSON.
2) see 1).
3) really see 1).
4) Implement the following logic, it should cover the simpler use-cases:

    a) Split at the comma.
    b) Count the single quotes per item.
    c) If an item has an odd number of single quotes, then merge it with the next item and restore the content comma.
    d) When a merge occurred goto b).
Avatar of suvmitra

ASKER

Hi slightwv (䄆 Netminder) ,
Your solution is not providing me the expected result which is 'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)'  instead it is just returning the actual data set.
>>Your solution is not providing me the expected result which is

It seems to provide that for me.

I pass in the full string and from the fiddle below I get:
'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)'


Fiddle here:
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=77ee8e88e9af0e56613d4a210b90de90 

with error_log_temp_5 as (
   select q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
      error_text from dual
)
select
rtrim(regexp_substr(error_text,'''.*'','),',')
from error_log_temp_5
/

Open in new window


Thanks, I understand the problem here.

Please check the code above, I added one more value in the string and now it is returning the first two values ignoring comma. So, what I need is only the franchise value not others. 
Just one more pattern of the string, may be useful for coding.

The franchise value will be always at the beginning of the string. It may be upto 65 chars. So, incase the string is too long say 4,000 chars - we may need to look at first 100 chars at max to get the franchise value. All other values are separated by comma.

But the franchise value may also contain a comma. need to ignore the comma and get the full value from franchise.
Try this one.  

I added what I'm thinking may be the next step and split ALL of them into individual rows.

The key is to replace ', with a character that will NEVER be in the field.  I guessed chr(254).  Use whatever you want.

with error_log_temp_5 as (
   select replace(
         q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
         ,''',',chr(254)
      ) error_text
   from dual
)
select
trim(regexp_substr(error_text,'[^' || chr(254) || ']+', 1, column_value)) || ''''
from error_log_temp_5,
   table(
      select collect(level) from dual
      connect by level <= length(error_text) - length(replace(error_text, chr(254))) + 1
   )
/


Open in new window

some issue with the code. Brackets related.

Error:
ORA-12726: unmatched bracket in regular expression
12726. 00000 -  "unmatched bracket in regular expression"
*Cause:    The regular expression did not have balanced brackets.
*Action:   Ensure the brackets are correctly balanced.

Open in new window


Open in new window

Cool "bug".  I'm running 12.2 and what I posted runs fine on my database.  On dbfiddleuk 11g and 18c and livesql, it fails.

Change chr(254) for chr(1) in all three places and it works in dbfiddleuk and livesql.
Hi slightwv (䄆 Netminder) ,

Yes, this is working. Thank you.

I just need to know further, instead of the hardcoded value, what would be the updated code if we use the field/column directly from table.

Source field here is 
error_text

Open in new window

So, I guess the below code needs further update.

with error_log_temp_5 as (
   select replace(
         q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
         ,''',',chr(254)
      ) error_text
   from dual
)

Open in new window

>>what would be the updated code if we use the field/column directly from table.

Remove the CTE (WITH).

select
trim(regexp_substr(error_text,'[^' || chr(1) || ']+', 1, column_value)) || ''''
from error_log_temp_5,
   table(
      select collect(level) from dual
      connect by level <= length(error_text) - length(replace(error_text, chr(1))) + 1
   )
/

Open in new window


I used the CTE to have the same table_name, column name and data as your original SQL.
Hi slightwv (䄆 Netminder) ,

Thanks again. I think it is very close but failing at my end. I am attaching data after masking. Kindly just use this data to get the desired result. I am good otherwise.

Book1.xlsx
I asked a question you never answered back in my first post:
Will the string ALWAYS have a ', as the delimiter?

The answer is now:  No.

What is your Oracle version?

What is the final goal here?

What you appear to have is semi-JSON.  If you wrap the text in '{}', it is JSON.

Depending on your final goal, it might be best to forget hacking together parsing strings, and parse using JSON functions.
Well, the source is an XML file. But, then I have imported the XML as CSV into Oracle DB. Then after trimming this is the final output from Error_Text field which I posted in Book1.xlsx

So now, it is just a string coming from oracle table. As I mentioned earlier, what if we just consider first 100 characters of the field value and then perform the extraction for franchise value?

Having said that, I just want to acknowledge, I am really happy with your approach so far. And if you want me to close the thread, I will.
If you have access to the original XML, why are we working with "parsed" data?  Why not just work directly with the RAW XML to extract what you want?  It will be MUCH easier than trying to hack around an XML to CSV hack.

No need to close this out until we eventually get you what you want.
This is the requirement, making it more readable for the user. In XML file also it is same way coming for the field. The original application is written in python which creates this error log in XML format.
Splitting out the CSV you have as you say you want it really isn't that mush easier to read.

I'm betting you then want the final results in the key/pair format as rows and columns:
key                  value
------------------   -----------------------------------------------
franchise            Px Oncology (Solid Tumors, all MA Onc projects)
...
DBP-complexity       (null)
...
approved_countries   'JPN'
...

Open in new window


Do you have access to the Python script or the original log files?  I'm betting it has to be JSON in the original format.  Not sure why the CSV string is added to XML in that way, but I would look at changing that!
Well, you may be 100% correct.
But, right now I am just concerned about the franchise value not anything else.

I am not concerned about changing the source code, could be too much of work there.

Can we just have a try with first 100 chars of the string provided in excel and then try extracting the franchise value ignoring comma within? It will be really nice.
If all you want is the first value, the SQL I posted will do that.

Where it broke was when it got into number JSON fields.  Just remove the TABLE piece and grab the first one:
with error_log_temp_5 as (
   select replace(
         q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
         ,''',',chr(254)
      ) error_text
   from dual
)
select
trim(regexp_substr(error_text,'[^' || chr(254) || ']+', 1, 1)) || ''''
from error_log_temp_5
/

Open in new window


Continuing this path will only cause you problems when you need to extract one of the number or array columns...  It will ONLY wok on string values.
Hi,
This is still not working.

ASKER CERTIFIED SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial