suvmitra
asked on
oracle sql splitting a comma-separated string but ignoring commas in quotes
Hi,
Kindly help on the following problem. I am using Oracle SQL to split comma separated string values. However, need to ignore the values with comma.
Below is the sample data, which needs to be separated/split (comma seperated)
sample output would be,
problem is, the value 'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)' is also having a comma within its value, need to ignore this comma while splitting
I am using the below code,
The output I am getting is ['franchise': 'Px Oncology (Solid Tumors]
It should provide, ['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)']
Kindly help on the following problem. I am using Oracle SQL to split comma separated string values. However, need to ignore the values with comma.
Below is the sample data, which needs to be separated/split (comma seperated)
'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No'
sample output would be,
franchise: Px Oncology (Solid Tumors, all XA Onc projects)
zlinical_xombination_study: No
problem is, the value 'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)' is also having a comma within its value, need to ignore this comma while splitting
I am using the below code,
select LTRIM (REGEXP_SUBSTR (ERROR_TEXT, '[^,]+', 1, 1), ',') AS part_1 FROM ERROR_LOG_TEMP_5;
The output I am getting is ['franchise': 'Px Oncology (Solid Tumors]
It should provide, ['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)']
hmm, I'm not aware of a bulletproof version for plain PL/SQL, cause CSV is pretty complex with it simple rules - almost like Go.
So I would:
1) Convince the provider of that data to use XML or JSON.
2) see 1).
3) really see 1).
4) Implement the following logic, it should cover the simpler use-cases:
a) Split at the comma.
b) Count the single quotes per item.
c) If an item has an odd number of single quotes, then merge it with the next item and restore the content comma.
d) When a merge occurred goto b).
So I would:
1) Convince the provider of that data to use XML or JSON.
2) see 1).
3) really see 1).
4) Implement the following logic, it should cover the simpler use-cases:
a) Split at the comma.
b) Count the single quotes per item.
c) If an item has an odd number of single quotes, then merge it with the next item and restore the content comma.
d) When a merge occurred goto b).
ASKER
Hi slightwv (䄆 Netminder) ,
Your solution is not providing me the expected result which is 'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)' instead it is just returning the actual data set.
Your solution is not providing me the expected result which is 'franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)' instead it is just returning the actual data set.
>>Your solution is not providing me the expected result which is
It seems to provide that for me.
I pass in the full string and from the fiddle below I get:
Fiddle here:
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=77ee8e88e9af0e56613d4a210b90de90
It seems to provide that for me.
I pass in the full string and from the fiddle below I get:
Fiddle here:
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=77ee8e88e9af0e56613d4a210b90de90
ASKER
with error_log_temp_5 as (
select q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
error_text from dual
)
select
rtrim(regexp_substr(error_text,'''.*'','),',')
from error_log_temp_5
/
Thanks, I understand the problem here.
Please check the code above, I added one more value in the string and now it is returning the first two values ignoring comma. So, what I need is only the franchise value not others.
ASKER
Just one more pattern of the string, may be useful for coding.
The franchise value will be always at the beginning of the string. It may be upto 65 chars. So, incase the string is too long say 4,000 chars - we may need to look at first 100 chars at max to get the franchise value. All other values are separated by comma.
But the franchise value may also contain a comma. need to ignore the comma and get the full value from franchise.
The franchise value will be always at the beginning of the string. It may be upto 65 chars. So, incase the string is too long say 4,000 chars - we may need to look at first 100 chars at max to get the franchise value. All other values are separated by comma.
But the franchise value may also contain a comma. need to ignore the comma and get the full value from franchise.
Try this one.
I added what I'm thinking may be the next step and split ALL of them into individual rows.
The key is to replace ', with a character that will NEVER be in the field. I guessed chr(254). Use whatever you want.
I added what I'm thinking may be the next step and split ALL of them into individual rows.
The key is to replace ', with a character that will NEVER be in the field. I guessed chr(254). Use whatever you want.
with error_log_temp_5 as (
select replace(
q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
,''',',chr(254)
) error_text
from dual
)
select
trim(regexp_substr(error_text,'[^' || chr(254) || ']+', 1, column_value)) || ''''
from error_log_temp_5,
table(
select collect(level) from dual
connect by level <= length(error_text) - length(replace(error_text, chr(254))) + 1
)
/
ASKER
some issue with the code. Brackets related.
Error:
Error:
ORA-12726: unmatched bracket in regular expression
12726. 00000 - "unmatched bracket in regular expression"
*Cause: The regular expression did not have balanced brackets.
*Action: Ensure the brackets are correctly balanced.
Cool "bug". I'm running 12.2 and what I posted runs fine on my database. On dbfiddleuk 11g and 18c and livesql, it fails.
Change chr(254) for chr(1) in all three places and it works in dbfiddleuk and livesql.
Change chr(254) for chr(1) in all three places and it works in dbfiddleuk and livesql.
ASKER
Hi slightwv (䄆 Netminder) ,
Yes, this is working. Thank you.
I just need to know further, instead of the hardcoded value, what would be the updated code if we use the field/column directly from table.
Source field here is
Yes, this is working. Thank you.
I just need to know further, instead of the hardcoded value, what would be the updated code if we use the field/column directly from table.
Source field here is
error_text
So, I guess the below code needs further update.with error_log_temp_5 as (
select replace(
q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
,''',',chr(254)
) error_text
from dual
)
>>what would be the updated code if we use the field/column directly from table.
Remove the CTE (WITH).
I used the CTE to have the same table_name, column name and data as your original SQL.
Remove the CTE (WITH).
select
trim(regexp_substr(error_text,'[^' || chr(1) || ']+', 1, column_value)) || ''''
from error_log_temp_5,
table(
select collect(level) from dual
connect by level <= length(error_text) - length(replace(error_text, chr(1))) + 1
)
/
I used the CTE to have the same table_name, column name and data as your original SQL.
ASKER
Hi slightwv (䄆 Netminder) ,
Thanks again. I think it is very close but failing at my end. I am attaching data after masking. Kindly just use this data to get the desired result. I am good otherwise.
Book1.xlsx
Thanks again. I think it is very close but failing at my end. I am attaching data after masking. Kindly just use this data to get the desired result. I am good otherwise.
Book1.xlsx
I asked a question you never answered back in my first post:
Will the string ALWAYS have a ', as the delimiter?
The answer is now: No.
What is your Oracle version?
What is the final goal here?
What you appear to have is semi-JSON. If you wrap the text in '{}', it is JSON.
Depending on your final goal, it might be best to forget hacking together parsing strings, and parse using JSON functions.
Will the string ALWAYS have a ', as the delimiter?
The answer is now: No.
What is your Oracle version?
What is the final goal here?
What you appear to have is semi-JSON. If you wrap the text in '{}', it is JSON.
Depending on your final goal, it might be best to forget hacking together parsing strings, and parse using JSON functions.
ASKER
Well, the source is an XML file. But, then I have imported the XML as CSV into Oracle DB. Then after trimming this is the final output from Error_Text field which I posted in Book1.xlsx
So now, it is just a string coming from oracle table. As I mentioned earlier, what if we just consider first 100 characters of the field value and then perform the extraction for franchise value?
Having said that, I just want to acknowledge, I am really happy with your approach so far. And if you want me to close the thread, I will.
So now, it is just a string coming from oracle table. As I mentioned earlier, what if we just consider first 100 characters of the field value and then perform the extraction for franchise value?
Having said that, I just want to acknowledge, I am really happy with your approach so far. And if you want me to close the thread, I will.
If you have access to the original XML, why are we working with "parsed" data? Why not just work directly with the RAW XML to extract what you want? It will be MUCH easier than trying to hack around an XML to CSV hack.
No need to close this out until we eventually get you what you want.
No need to close this out until we eventually get you what you want.
ASKER
This is the requirement, making it more readable for the user. In XML file also it is same way coming for the field. The original application is written in python which creates this error log in XML format.
Splitting out the CSV you have as you say you want it really isn't that mush easier to read.
I'm betting you then want the final results in the key/pair format as rows and columns:
Do you have access to the Python script or the original log files? I'm betting it has to be JSON in the original format. Not sure why the CSV string is added to XML in that way, but I would look at changing that!
I'm betting you then want the final results in the key/pair format as rows and columns:
key value
------------------ -----------------------------------------------
franchise Px Oncology (Solid Tumors, all MA Onc projects)
...
DBP-complexity (null)
...
approved_countries 'JPN'
...
Do you have access to the Python script or the original log files? I'm betting it has to be JSON in the original format. Not sure why the CSV string is added to XML in that way, but I would look at changing that!
ASKER
Well, you may be 100% correct.
But, right now I am just concerned about the franchise value not anything else.
I am not concerned about changing the source code, could be too much of work there.
Can we just have a try with first 100 chars of the string provided in excel and then try extracting the franchise value ignoring comma within? It will be really nice.
But, right now I am just concerned about the franchise value not anything else.
I am not concerned about changing the source code, could be too much of work there.
Can we just have a try with first 100 chars of the string provided in excel and then try extracting the franchise value ignoring comma within? It will be really nice.
If all you want is the first value, the SQL I posted will do that.
Where it broke was when it got into number JSON fields. Just remove the TABLE piece and grab the first one:
Continuing this path will only cause you problems when you need to extract one of the number or array columns... It will ONLY wok on string values.
Where it broke was when it got into number JSON fields. Just remove the TABLE piece and grab the first one:
with error_log_temp_5 as (
select replace(
q'['franchise': 'Px Oncology (Solid Tumors, all XA Onc projects)', 'zlinical_xombination_study': 'No', 'clinical_comparator_study': 'No]'
,''',',chr(254)
) error_text
from dual
)
select
trim(regexp_substr(error_text,'[^' || chr(254) || ']+', 1, 1)) || ''''
from error_log_temp_5
/
Continuing this path will only cause you problems when you need to extract one of the number or array columns... It will ONLY wok on string values.
ASKER
Hi,
This is still not working.
This is still not working.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
See if this works for you:
rtrim(regexp_substr(error_text,'''.*'','),',')