Christoph Holzapfel
asked on
How to skip bad characters in ORACLE xml Parsing (avoiding ORA LPX-00216)?
Hello Experts,
I have to parse 3rd-party-xml-Data on a daily basis to an oracle 11 database. I connect the file as an external table and INSERT the wanted data into the database with the following code (as shown and improved in this thread):
External table:
Querry:
Nearly every xml-file which I receive contains characters that the database rejects as "bad" or wrong (eg. "ß" or a long "-", sample-data below). In January when I had to get the whole process at work, I build a workaround. Since then I parse the data into a mysql-database and export it as csv which I import to the oracle-database. There I don't have any parsing-problems ...
Now I would like to skip this workaround and get it working in oracle. How do I have to modify the above Insert-Statement to skip the bad characters? Is it possible to regex the no-ascii-characters? I don't need the "TITLE"-column necessarily but the database stops parsing even without this column.
I would be very glad if someone could help me ...
BR
Christoph
Sample-Data
I have to parse 3rd-party-xml-Data on a daily basis to an oracle 11 database. I connect the file as an external table and INSERT the wanted data into the database with the following code (as shown and improved in this thread):
External table:
CREATE TABLE "EXTVLB"
( "XML" CLOB
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "XML_IMP"
ACCESS PARAMETERS
( FIELDS TERMINATED BY ','
(clob_pointer CHAR(200))
COLUMN TRANSFORMS (xml FROM LOBFILE (clob_pointer) FROM (xml_imp) CLOB) )
LOCATION
( 'pointer.txt'
)
)
;
Querry:
INSERT INTO VLB (
ISBN,
EAN,
TITEL,
PRODUKTART,
AKTION,
L_AEND,
PREIS,
PREISTYP,
UPDDATE
)
SELECT DISTINCT x.ISBN, x.EAN, x.TITEL, x.PRODUKTART, x.AKTION, x.L_AEND, y.PREIS, y.PREISTYP, sysdate as UPDDATE
FROM extvlb ev,
XMLTABLE('/products/product' passing ev.xml
COLUMNS
ISBN VARCHAR2(20) PATH 'isbn',
EAN VARCHAR2 (20) PATH 'ean',
TITEL VARCHAR2(1000) PATH 'title',
PRODUKTART VARCHAR2(10) PATH 'productform',
AKTION VARCHAR2(10) PATH 'action',
L_AEND VARCHAR2 (10) PATH 'last_price_mod',
PRICE_OBJ XMLTYPE PATH 'price'
) x,
XMLTABLE('price' passing x.PRICE_OBJ
COLUMNS
MARKET VARCHAR(2) PATH 'market',
PREIS VARCHAR2(10) PATH 'amount',
PREISTYP VARCHAR2 (40) PATH 'pricetype'
) y
WHERE y.MARKET = 'DE'
Nearly every xml-file which I receive contains characters that the database rejects as "bad" or wrong (eg. "ß" or a long "-", sample-data below). In January when I had to get the whole process at work, I build a workaround. Since then I parse the data into a mysql-database and export it as csv which I import to the oracle-database. There I don't have any parsing-problems ...
Now I would like to skip this workaround and get it working in oracle. How do I have to modify the above Insert-Statement to skip the bad characters? Is it possible to regex the no-ascii-characters? I don't need the "TITLE"-column necessarily but the database stops parsing even without this column.
I would be very glad if someone could help me ...
BR
Christoph
Sample-Data
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<action>INSERT</action>
<isbn>9783959622929</isbn>
<vlb_record_id>0000c4ecfb6f432499858c08637f9a81</vlb_record_id>
<is_digital>FALSE</is_digital>
<productform>BC</productform>
<title>Rodinia – Die Rückkehr des Zauberers</title>
<fixed_retailprice>
<fixedprice_de>TRUE</fixedprice_de>
<fixedprice_at>TRUE</fixedprice_at>
</fixed_retailprice>
<last_price_mod>20161223</last_price_mod>
<price>
<market>DE</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>14.95</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783959622929</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
<percent>7.00</percent>
<type>1</type>
<share>14.95</share>
</tax_component>
</price>
<price>
<market>AT</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<is_calculated>TRUE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>15.40</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783959622929</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
<percent>10.00</percent>
<type>1</type>
<share>15.40</share>
</tax_component>
</price>
<price>
<market>CH</market>
<pricetype>20</pricetype>
<no_of_pieces>1</no_of_pieces>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>17.00</amount>
<currency>CHF</currency>
<tax_component>
<component_id_gtin>9783959622929</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
<percent>2.50</percent>
<type>1</type>
<share>17.00</share>
</tax_component>
</price>
</product>
<product>
<action>INSERT</action>
<isbn>9783834912565</isbn>
<vlb_record_id>0001346b5c074b44a5ef3fc1471f7db8</vlb_record_id>
<is_digital>FALSE</is_digital>
<productform>BC</productform>
<title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</title>
<fixed_retailprice>
<fixedprice_de>TRUE</fixedprice_de>
<fixedprice_at>TRUE</fixedprice_at>
</fixed_retailprice>
<last_price_mod>20161214</last_price_mod>
<price>
<market>DE</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<pricetext>Fixed Retail Price</pricetext>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>59.95</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783834912565</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
<percent>7.00</percent>
<type>1</type>
<share>59.95</share>
</tax_component>
</price>
<price>
<market>CH</market>
<pricetype>20</pricetype>
<no_of_pieces>1</no_of_pieces>
<pricetext>Recommended Retail Price</pricetext>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>75.00</amount>
<currency>CHF</currency>
<tax_component>
<component_id_gtin>9783834912565</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
<percent>2.50</percent>
<type>1</type>
<share>75.00</share>
</tax_component>
</price>
<price>
<market>AT</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<pricetext>Fixed Retail Price</pricetext>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>61.63</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783834912565</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
<percent>10.00</percent>
<type>1</type>
<share>61.63</share>
</tax_component>
</price>
</product>
ASKER
Hi Sean,
thank you. I am runnig 11.2. The two rows in the example contain lines which stopped the parsing. The "-" in line 9 and more and the "ß" in line 73 and more. For the first occurence oracle-db says a invalid character 128 (0x80). But if you can parse the data, it perhaps displays not the real problem ...
P.S.: I not only corrected ugly spelling ... I have to go home and will be back online later this evening (if you ask you why I don't react).
thank you. I am runnig 11.2. The two rows in the example contain lines which stopped the parsing. The "-" in line 9 and more and the "ß" in line 73 and more. For the first occurence oracle-db says a invalid character 128 (0x80). But if you can parse the data, it perhaps displays not the real problem ...
P.S.: I not only corrected ugly spelling ... I have to go home and will be back online later this evening (if you ask you why I don't react).
actually, the xml did not parse completely correctly, it's missing an end tag for </products>
but once I added that it parsed ok
Let's try a simple test and make sure it's not a problem reading the data from the external table
Try running this...
but once I added that it parsed ok
Let's try a simple test and make sure it's not a problem reading the data from the external table
Try running this...
create table testtab (testxml clob);
insert into testtab (testxml) values ('<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<action>INSERT</action>
<isbn>9783959622929</isbn>
<vlb_record_id>0000c4ecfb6f432499858c08637f9a81</vlb_record_id>
<is_digital>FALSE</is_digital>
<productform>BC</productform>
<title>Rodinia – Die Rückkehr des Zauberers</title>
<fixed_retailprice>
<fixedprice_de>TRUE</fixedprice_de>
<fixedprice_at>TRUE</fixedprice_at>
</fixed_retailprice>
<last_price_mod>20161223</last_price_mod>
<price>
<market>DE</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>14.95</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783959622929</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
<percent>7.00</percent>
<type>1</type>
<share>14.95</share>
</tax_component>
</price>
<price>
<market>AT</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<is_calculated>TRUE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>15.40</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783959622929</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
<percent>10.00</percent>
<type>1</type>
<share>15.40</share>
</tax_component>
</price>
<price>
<market>CH</market>
<pricetype>20</pricetype>
<no_of_pieces>1</no_of_pieces>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>17.00</amount>
<currency>CHF</currency>
<tax_component>
<component_id_gtin>9783959622929</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
<percent>2.50</percent>
<type>1</type>
<share>17.00</share>
</tax_component>
</price>
</product>
');
update testtab set testxml = testxml ||
'<product>
<action>INSERT</action>
<isbn>9783834912565</isbn>
<vlb_record_id>0001346b5c074b44a5ef3fc1471f7db8</vlb_record_id>
<is_digital>FALSE</is_digital>
<productform>BC</productform>
<title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</title>
<fixed_retailprice>
<fixedprice_de>TRUE</fixedprice_de>
<fixedprice_at>TRUE</fixedprice_at>
</fixed_retailprice>
<last_price_mod>20161214</last_price_mod>
<price>
<market>DE</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<pricetext>Fixed Retail Price</pricetext>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>59.95</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783834912565</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
<percent>7.00</percent>
<type>1</type>
<share>59.95</share>
</tax_component>
</price>
<price>
<market>CH</market>
<pricetype>20</pricetype>
<no_of_pieces>1</no_of_pieces>
<pricetext>Recommended Retail Price</pricetext>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>75.00</amount>
<currency>CHF</currency>
<tax_component>
<component_id_gtin>9783834912565</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
<percent>2.50</percent>
<type>1</type>
<share>75.00</share>
</tax_component>
</price>
<price>
<market>AT</market>
<pricetype>10</pricetype>
<no_of_pieces>1</no_of_pieces>
<pricetext>Fixed Retail Price</pricetext>
<is_calculated>FALSE</is_calculated>
<is_provisional>FALSE</is_provisional>
<amount>61.63</amount>
<currency>EUR</currency>
<tax_component>
<component_id_gtin>9783834912565</component_id_gtin>
<component_productform>BC</component_productform>
<component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
<percent>10.00</percent>
<type>1</type>
<share>61.63</share>
</tax_component>
</price>
</product>
</products>';
select xmltype(testxml) from testtab;
SELECT DISTINCT x.ISBN, x.EAN, x.TITEL, x.PRODUKTART, x.AKTION, x.L_AEND, y.PREIS, y.PREISTYP, sysdate as UPDDATE
FROM testtab ev,
XMLTABLE('/products/product' passing xmltype(ev.testxml)
COLUMNS
ISBN VARCHAR2(20) PATH 'isbn',
EAN VARCHAR2 (20) PATH 'ean',
TITEL VARCHAR2(1000) PATH 'title',
PRODUKTART VARCHAR2(10) PATH 'productform',
AKTION VARCHAR2(10) PATH 'action',
L_AEND VARCHAR2 (10) PATH 'last_price_mod',
PRICE_OBJ XMLTYPE PATH 'price'
) x,
XMLTABLE('price' passing x.PRICE_OBJ
COLUMNS
MARKET VARCHAR(2) PATH 'market',
PREIS VARCHAR2(10) PATH 'amount',
PREISTYP VARCHAR2 (40) PATH 'pricetype'
) y
WHERE y.MARKET = 'DE';
I created an 11.2.0.1 db and tested my script above. The xml parsing worked correctly.
SQL> SELECT DISTINCT x.ISBN, x.EAN, x.TITEL, x.PRODUKTART, x.AKTION, x.L_AEND, y.PREIS, y.PREIST
2 FROM testtab ev,
3 XMLTABLE('/products/product' passing xmltype(ev.testxml)
4 COLUMNS
5 ISBN VARCHAR2(20) PATH 'isbn',
6 EAN VARCHAR2 (20) PATH 'ean',
7 TITEL VARCHAR2(1000) PATH 'title',
8 PRODUKTART VARCHAR2(10) PATH 'productform',
9 AKTION VARCHAR2(10) PATH 'action',
10 L_AEND VARCHAR2 (10) PATH 'last_price_mod',
11 PRICE_OBJ XMLTYPE PATH 'price'
12 ) x,
13 XMLTABLE('price' passing x.PRICE_OBJ
14 COLUMNS
15 MARKET VARCHAR(2) PATH 'market',
16 PREIS VARCHAR2(10) PATH 'amount',
17 PREISTYP VARCHAR2 (40) PATH 'pricetype'
18 ) y
19 WHERE y.MARKET = 'DE';
ISBN EAN
-------------------- --------------------
TITEL
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
PRODUKTART AKTION L_AEND PREIS PREISTYP UPDDATE
---------- ---------- ---------- ---------- ---------------------------------------- ---------
9783959622929
Rodinia - Die Rückkehr des Zauberers
BC INSERT 20161223 14.95 10 23-MAY-17
9783834912565
Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau
BC INSERT 20161214 59.95 10 23-MAY-17
ASKER
Hi Sean,
the INSERT/UPDATE-procedure worked. I also tested this with "fresh" data, which I hadn't converted from utf8 to ANSI and back to get rid of the bad characters (which did not work). Because you mentioned the directory I created a new one and now I could read the records form the external table. This irritates me, because the records I posted yesterday belong to a file with about 1 Million lines. After the first errormessage I split it up and took the portion before the first bad character ("Rodinia", one of the two records we worked with) and could read this part from the external table. The part with "Rodinia" doesn't work - until I inserted it via INSERT-statement this morning.
To be secure that the directory was the reason for the error, I now parse the file with todays records from the new directory, but this will take a while, because the file is about 80 MB.
the INSERT/UPDATE-procedure worked. I also tested this with "fresh" data, which I hadn't converted from utf8 to ANSI and back to get rid of the bad characters (which did not work). Because you mentioned the directory I created a new one and now I could read the records form the external table. This irritates me, because the records I posted yesterday belong to a file with about 1 Million lines. After the first errormessage I split it up and took the portion before the first bad character ("Rodinia", one of the two records we worked with) and could read this part from the external table. The part with "Rodinia" doesn't work - until I inserted it via INSERT-statement this morning.
To be secure that the directory was the reason for the error, I now parse the file with todays records from the new directory, but this will take a while, because the file is about 80 MB.
ASKER
Sean, I could'nt write for some time because of illness and business. It was'nt the directory. As we have seen I could import the data - and the next file again was rejected by the database. After some investigation I found out, that our it-provider runs a process every night that breaks the import process if it is running into the night ... And depending on the filesize the import runs for several hours.
Thanks for help ...
Thanks for help ...
This question needs an answer!
Become an EE member today
7 DAY FREE TRIALMembers can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Do you have an example that fails?
Perhaps you're running into a bug. What db version are you using? I tested on 12.2.0.1