Link to home
Start Free TrialLog in
Avatar of Christoph Holzapfel
Christoph Holzapfel

asked on

How to skip bad characters in ORACLE xml Parsing (avoiding ORA LPX-00216)?

Hello Experts,

I have to parse 3rd-party-xml-Data on a daily basis to an oracle 11 database. I connect the file as an external table and INSERT the wanted data into the database with the following code (as shown and improved in this thread):

External table:
CREATE TABLE "EXTVLB" 
   (	"XML" CLOB
   ) 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_LOADER
      DEFAULT DIRECTORY "XML_IMP"
      ACCESS PARAMETERS
      ( FIELDS TERMINATED BY ','
(clob_pointer CHAR(200))
COLUMN TRANSFORMS (xml FROM LOBFILE (clob_pointer) FROM (xml_imp) CLOB)        )
      LOCATION
       ( 'pointer.txt'
       )
    )
  ;

Open in new window


Querry:
INSERT INTO VLB (
ISBN, 
EAN, 
TITEL, 
PRODUKTART, 
AKTION,
L_AEND,
PREIS, 
PREISTYP,
UPDDATE
)
SELECT DISTINCT x.ISBN, x.EAN, x.TITEL, x.PRODUKTART, x.AKTION, x.L_AEND, y.PREIS, y.PREISTYP, sysdate as UPDDATE
FROM extvlb ev, 
XMLTABLE('/products/product' passing ev.xml
      COLUMNS 
      ISBN VARCHAR2(20) PATH 'isbn',
      EAN VARCHAR2 (20) PATH 'ean',
      TITEL VARCHAR2(1000) PATH 'title',
      PRODUKTART VARCHAR2(10) PATH 'productform',
      AKTION VARCHAR2(10) PATH 'action',
      L_AEND VARCHAR2 (10) PATH 'last_price_mod',
      PRICE_OBJ XMLTYPE PATH 'price'
) x,
XMLTABLE('price' passing x.PRICE_OBJ
			COLUMNS
			MARKET VARCHAR(2) PATH 'market',
			PREIS VARCHAR2(10) PATH 'amount',
      PREISTYP VARCHAR2 (40) PATH 'pricetype'
) y
WHERE y.MARKET = 'DE'

Open in new window


Nearly every xml-file which I receive contains characters that the database rejects as "bad" or wrong (eg. "ß" or a long "-", sample-data below). In January when I had to get the whole process at work, I build a workaround. Since then I parse the data into a mysql-database and export it as csv which I import to the oracle-database. There I don't have any parsing-problems ...
Now I would like to skip this workaround and get it working in oracle. How do I have to modify the above Insert-Statement to skip the bad characters? Is it possible to regex the no-ascii-characters? I don't need the "TITLE"-column necessarily but the database stops parsing even without this column.
I would be very glad if someone could help me ...

BR
Christoph

Sample-Data
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
    <action>INSERT</action>
    <isbn>9783959622929</isbn>
    <vlb_record_id>0000c4ecfb6f432499858c08637f9a81</vlb_record_id>
    <is_digital>FALSE</is_digital>
    <productform>BC</productform>
    <title>Rodinia – Die Rückkehr des Zauberers</title>
    <fixed_retailprice>
        <fixedprice_de>TRUE</fixedprice_de>
        <fixedprice_at>TRUE</fixedprice_at>
    </fixed_retailprice>
    <last_price_mod>20161223</last_price_mod>
    <price>
        <market>DE</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>14.95</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783959622929</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
            <percent>7.00</percent>
            <type>1</type>
            <share>14.95</share>
        </tax_component>
    </price>
    <price>
        <market>AT</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <is_calculated>TRUE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>15.40</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783959622929</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
            <percent>10.00</percent>
            <type>1</type>
            <share>15.40</share>
        </tax_component>
    </price>
    <price>
        <market>CH</market>
        <pricetype>20</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>17.00</amount>
        <currency>CHF</currency>
        <tax_component>
            <component_id_gtin>9783959622929</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
            <percent>2.50</percent>
            <type>1</type>
            <share>17.00</share>
        </tax_component>
    </price>
</product>
<product>
    <action>INSERT</action>
    <isbn>9783834912565</isbn>
    <vlb_record_id>0001346b5c074b44a5ef3fc1471f7db8</vlb_record_id>
    <is_digital>FALSE</is_digital>
    <productform>BC</productform>
    <title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</title>
    <fixed_retailprice>
        <fixedprice_de>TRUE</fixedprice_de>
        <fixedprice_at>TRUE</fixedprice_at>
    </fixed_retailprice>
    <last_price_mod>20161214</last_price_mod>
    <price>
        <market>DE</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <pricetext>Fixed Retail Price</pricetext>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>59.95</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783834912565</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
            <percent>7.00</percent>
            <type>1</type>
            <share>59.95</share>
        </tax_component>
    </price>
    <price>
        <market>CH</market>
        <pricetype>20</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <pricetext>Recommended Retail Price</pricetext>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>75.00</amount>
        <currency>CHF</currency>
        <tax_component>
            <component_id_gtin>9783834912565</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
            <percent>2.50</percent>
            <type>1</type>
            <share>75.00</share>
        </tax_component>
    </price>
    <price>
        <market>AT</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <pricetext>Fixed Retail Price</pricetext>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>61.63</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783834912565</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
            <percent>10.00</percent>
            <type>1</type>
            <share>61.63</share>
        </tax_component>
    </price>
</product>

Open in new window

Avatar of Sean Stuber
Sean Stuber

The xml in your example parses fine for me.

Do you have an example that fails?

Perhaps you're running into a bug.  What db version are you using?  I tested on 12.2.0.1
Avatar of Christoph Holzapfel

ASKER

Hi Sean,

thank you. I am runnig 11.2. The two rows in the example contain lines which stopped the parsing. The "-" in line 9 and more and the "ß" in line 73 and more. For the first occurence oracle-db says a invalid character 128 (0x80). But if you can parse the data, it perhaps displays not the real problem ...

P.S.: I not only corrected ugly spelling ... I have to go home and will be back online later this evening (if you ask you why I don't react).
actually, the xml did not parse completely correctly,  it's missing an end tag for </products>
but once I added that it parsed ok

Let's try a simple test and make sure it's not a problem reading the data from the external table

Try running this...

create table testtab (testxml clob);

insert into testtab (testxml) values ('<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
    <action>INSERT</action>
    <isbn>9783959622929</isbn>
    <vlb_record_id>0000c4ecfb6f432499858c08637f9a81</vlb_record_id>
    <is_digital>FALSE</is_digital>
    <productform>BC</productform>
    <title>Rodinia – Die Rückkehr des Zauberers</title>
    <fixed_retailprice>
        <fixedprice_de>TRUE</fixedprice_de>
        <fixedprice_at>TRUE</fixedprice_at>
    </fixed_retailprice>
    <last_price_mod>20161223</last_price_mod>
    <price>
        <market>DE</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>14.95</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783959622929</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
            <percent>7.00</percent>
            <type>1</type>
            <share>14.95</share>
        </tax_component>
    </price>
    <price>
        <market>AT</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <is_calculated>TRUE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>15.40</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783959622929</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
            <percent>10.00</percent>
            <type>1</type>
            <share>15.40</share>
        </tax_component>
    </price>
    <price>
        <market>CH</market>
        <pricetype>20</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>17.00</amount>
        <currency>CHF</currency>
        <tax_component>
            <component_id_gtin>9783959622929</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Rodinia – Die Rückkehr des Zauberers</component_title>
            <percent>2.50</percent>
            <type>1</type>
            <share>17.00</share>
        </tax_component>
    </price>
</product>
');

update testtab set testxml = testxml ||
'<product>
    <action>INSERT</action>
    <isbn>9783834912565</isbn>
    <vlb_record_id>0001346b5c074b44a5ef3fc1471f7db8</vlb_record_id>
    <is_digital>FALSE</is_digital>
    <productform>BC</productform>
    <title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</title>
    <fixed_retailprice>
        <fixedprice_de>TRUE</fixedprice_de>
        <fixedprice_at>TRUE</fixedprice_at>
    </fixed_retailprice>
    <last_price_mod>20161214</last_price_mod>
    <price>
        <market>DE</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <pricetext>Fixed Retail Price</pricetext>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>59.95</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783834912565</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
            <percent>7.00</percent>
            <type>1</type>
            <share>59.95</share>
        </tax_component>
    </price>
    <price>
        <market>CH</market>
        <pricetype>20</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <pricetext>Recommended Retail Price</pricetext>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>75.00</amount>
        <currency>CHF</currency>
        <tax_component>
            <component_id_gtin>9783834912565</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
            <percent>2.50</percent>
            <type>1</type>
            <share>75.00</share>
        </tax_component>
    </price>
    <price>
        <market>AT</market>
        <pricetype>10</pricetype>
        <no_of_pieces>1</no_of_pieces>
        <pricetext>Fixed Retail Price</pricetext>
        <is_calculated>FALSE</is_calculated>
        <is_provisional>FALSE</is_provisional>
        <amount>61.63</amount>
        <currency>EUR</currency>
        <tax_component>
            <component_id_gtin>9783834912565</component_id_gtin>
            <component_productform>BC</component_productform>
            <component_title>Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau</component_title>
            <percent>10.00</percent>
            <type>1</type>
            <share>61.63</share>
        </tax_component>
    </price>
</product>
</products>';

select xmltype(testxml) from testtab;

SELECT DISTINCT x.ISBN, x.EAN, x.TITEL, x.PRODUKTART, x.AKTION, x.L_AEND, y.PREIS, y.PREISTYP, sysdate as UPDDATE
FROM testtab ev, 
XMLTABLE('/products/product' passing xmltype(ev.testxml)
      COLUMNS 
      ISBN VARCHAR2(20) PATH 'isbn',
      EAN VARCHAR2 (20) PATH 'ean',
      TITEL VARCHAR2(1000) PATH 'title',
      PRODUKTART VARCHAR2(10) PATH 'productform',
      AKTION VARCHAR2(10) PATH 'action',
      L_AEND VARCHAR2 (10) PATH 'last_price_mod',
      PRICE_OBJ XMLTYPE PATH 'price'
) x,
XMLTABLE('price' passing x.PRICE_OBJ
			COLUMNS
			MARKET VARCHAR(2) PATH 'market',
			PREIS VARCHAR2(10) PATH 'amount',
      PREISTYP VARCHAR2 (40) PATH 'pricetype'
) y
WHERE y.MARKET = 'DE';

Open in new window

I created an 11.2.0.1 db and tested my script above.  The xml parsing worked correctly.


SQL> SELECT DISTINCT x.ISBN, x.EAN, x.TITEL, x.PRODUKTART, x.AKTION, x.L_AEND, y.PREIS, y.PREIST
  2  FROM testtab ev,
  3  XMLTABLE('/products/product' passing xmltype(ev.testxml)
  4        COLUMNS
  5        ISBN VARCHAR2(20) PATH 'isbn',
  6        EAN VARCHAR2 (20) PATH 'ean',
  7        TITEL VARCHAR2(1000) PATH 'title',
  8        PRODUKTART VARCHAR2(10) PATH 'productform',
  9        AKTION VARCHAR2(10) PATH 'action',
 10        L_AEND VARCHAR2 (10) PATH 'last_price_mod',
 11        PRICE_OBJ XMLTYPE PATH 'price'
 12  ) x,
 13  XMLTABLE('price' passing x.PRICE_OBJ
 14                     COLUMNS
 15                     MARKET VARCHAR(2) PATH 'market',
 16                     PREIS VARCHAR2(10) PATH 'amount',
 17        PREISTYP VARCHAR2 (40) PATH 'pricetype'
 18  ) y
 19  WHERE y.MARKET = 'DE';

ISBN                 EAN
-------------------- --------------------
TITEL
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

PRODUKTART AKTION     L_AEND     PREIS      PREISTYP                                 UPDDATE
---------- ---------- ---------- ---------- ---------------------------------------- ---------
9783959622929
Rodinia - Die Rückkehr des Zauberers
BC         INSERT     20161223   14.95      10                                       23-MAY-17

9783834912565
Projektorganisation und Finanzierung von Erneuerungsmaßnahmen im Wohnungsbau
BC         INSERT     20161214   59.95      10                                       23-MAY-17

Open in new window

Hi Sean,

the INSERT/UPDATE-procedure worked. I also tested this with "fresh" data, which I hadn't converted from utf8 to ANSI and back to get rid of the bad characters (which did not work). Because you mentioned the directory I created a new one and now I could read the records form the external table. This irritates me, because the records I posted yesterday belong to a file with about 1 Million lines. After the first errormessage I split it up and took the portion before the first bad character ("Rodinia", one of the two records we worked with) and could read this part from the external table. The part with "Rodinia" doesn't work - until I inserted it via INSERT-statement this morning.
To be secure that the directory was the reason for the error, I now parse the file with todays records from the new directory, but this will take a while, because the file is about 80 MB.
Sean, I could'nt write for some time because of illness and business. It was'nt the directory. As we have seen I could import the data - and the next file again was rejected by the database. After some investigation I found out, that our it-provider runs a process every night that breaks the import process if it is running into the night ... And depending on the filesize the import runs for several hours.
Thanks for help ...
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.