Link to home
Start Free TrialLog in
Avatar of aman0711
aman0711

asked on

XML to Oracle

Hi experts,
                   You must have seen a similar type question posted by my earlier and I am still stuck at the same stuff.

                    I have an XML file (around 160 mb), some of the tags are quite simple and straight forward where I can simple parse the XML file and get that tags data in a table.

                    The problem I am facing is with the below shown piece of XML data, where tags are nested. how do I extract this data along with maintaining the relationship? I have also attached the grid view of tag.

                    please put in your valuable suggestions.
<TXN_MEASUREMENT agent="42483" slot="560735" datetime="2009-MAR-11 00:01:46" target="911165" agent_inst="44624" profile="0">
	<TXN_SUMMARY delta_msec="34455" element_count="465" trans_level_comp_msec="0" estimated_cache_delta_msec="32213" content_errors="1" delta_user_msec="35946" resp_bytes="2767717"/>
		<TXN_PAGE page_seq="1">
			<TXN_PAGE_PERFORMANCE delta_msec="883" connect_delta="2" first_byte_msec="32" delta_user_msec="949" remain_packets_delta="20" first_packet_delta="107" system_delta="0" request_delta="0" dns_lookup_msec="1" start_msec="0" estimated_cache_delta_msec="442"/>
			<TXN_PAGE_OBJECT element_count="54" page_bytes="284417"/>
			<TXN_PAGE_STATUS content_errors="0"/>
			<TXN_PAGE_DETAILS page="1">
				<TXN_BASE_PAGE record_seq="1">
					<TXN_DETAIL_PERFORMANCE first_packet_delta="107" system_delta="0" connect_delta="2" request_delta="0" dns_delta="1" element_delta="130" remain_packets_delta="20"/>
					<TXN_DETAIL_OBJECT msmt_conn_id="1336" element_cached="0" request_bytes="332" conn_string_text="http://www.sample.com" content_type="51" header_bytes="1287" ip_address="121.0.0.1" object_text="/accounts/" content_bytes="4111"/>
					<TXN_DETAIL_STATUS status_code="200"/>
				</TXN_BASE_PAGE>
			</TXN_PAGE_DETAILS>
		</TXN_PAGE>

Open in new window

Grid.PNG
Avatar of Sean Stuber
Sean Stuber

what sort of sql output are you looking for?

given the above xml snippet, what would you like the resulting rows/columns of your sql to be?
Avatar of aman0711

ASKER

Hi Sean,
                 As usual, was waiting for your reply. I even posted some questions yesterday but I guess you weren't around.

                I am looking to build the attached table out of this. Actually this file contains some other tags as well, and the table I want contains data from all such different tags.
   
         I dont even know how to start with this problem, :(

sUMMARY.PNG
The XML doesn't look complete.
Is TXN_SUMMARY supposed to have the /> at the end, I assumed it wrapped the TXN_PAGE tag.
Hi mrjoltcola,
                   yes you are right, The XML isnt complete. i just posted a snippet of it.
Actually it has a /> at the end.
<TXN_SUMMARY delta_msec="34455" element_count="465" trans_level_comp_msec="0" estimated_cache_delta_msec="32213" content_errors="1" delta_user_msec="35946" resp_bytes="2767717" />
 
can you post a more complete example, preferably with more than one slot and a syntactically correct set of nodes?
Sean, its kind of proprietary. Can i instead email you and mrjoltcola the trimmed down XML file?
sure, you can email me.  I won't get it until I get home laer this afternoon, but I'll work on it when I get there.
Thanks Sean, Thats a great help.
--Email removed by Modularity, EE Moderator - this is the one, right?

I will send you the schema as well and what exact table I want out of this.
Its been close to two weeks since I am banging my head with this problem and I am still no where close to beginning :-(
yes, but  I had obfuscated my email address on purpose.

note,  all communication for a question is supposed to be kept in the question thread itself.  So, once I get whatever you send me, I will strip out a useful snippet and repost to this thread to continue working on it.

I'll mask the data when I do though.  

I will request the admins to remove your previous post
oh that will be cool :-)
Thank you so much Sean.... I even had a hard time trimming down that huge XML file to smaller size and figuring out the schema. but at least that part is done.

No one in my team has a single clue about this task. You suggestions and help is highly appreciated :-)
SOLUTION
Avatar of mrjoltcola
mrjoltcola
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
wow thanks mrjoltcola :)

    on EE points are important :-) Can I open another question for u, where you can help ,e out with your solution? please let me if it is possible, i will do it right now
I am sorry modularity,

                  Will take care of this in future.

Thanks
>>on EE points are important :-) Can I open another question for u

I help my favorite users regardless of points, so please don't worry about points!
Secondly, no, do not open a new question, we must keep it to this one, and there will, I'm sure, be an interesting and useful solution for future reference.
Thanks..
well do i come under your favorite users :-)
Though we havent interacted much before..

You are right , this question might helpful to some other newbies like me. its literally been 2-3 weeks, since we are running our heads over this XML data feed
>> well do i come under your favorite users :-)

Of course :)
Thank you so much :-)
so I emailed Sean with what all details I have, do you want to have a look at it as well?
yikes!!!!  that's a heck of an xml and the data is spread all around.

It's not so much an xml parsing problem as a where-is-the-data problem.

mrjoltcola is right about the approach I'm taking with a couple of differences.

first, I'm assuming this is 10g

I was planning to use xmltable instead of xmlsequence (same net effect though)
and I was planning to do a multi-table insert.


to save some keystrokes for others trying to help...  below are the ddl to create the target tables for the data load.  I've left off the primary keys since they shouldn't be pertinent to this project
We can always add them later if it turns out they are needed.


are slot and slot_id the same thing?  if not, how do slots and agents map to one another if not through txn_measurement attributes?

when I figure out how to get the data mapped together I'll post a trimmed and masked example for everyone to work from.
Of course, when I get that done, I should have the answer within a few minutes after that.

Like any interesting problem, it's not the syntax typing that consumes the time, it's the figuring out what the problem really is.

:)

I'm going to bed for now though.  I'll work more on it tomorrow after I get your answers.


CREATE TABLE page
(
    slot_id               NUMBER(15),
    agent_id              NUMBER(15),
    agent_instance_id     NUMBER(15),
    datetime              DATE,
    page_seq              NUMBER,
    delta_msec            NUMBER(7),
    delta_user_msec       NUMBER(7),
    ERROR_CODE            NUMBER(7),
    content_errs          NUMBER(7),
    element_count         NUMBER(3),
    page_bytes            NUMBER(8),
    dns_lookup_msec       NUMBER(7),
    first_byte_msec       NUMBER(7),
    start_msec            NUMBER(7),
    connect_delta         NUMBER(7),
    redir_count           NUMBER(2),
    redir_delta           NUMBER(7),
    system_delta          NUMBER(7),
    ssl_handshake_delta   NUMBER(7),
    request_delta         NUMBER(7),
    first_packet_delta    NUMBER(7),
    remain_packet_delta   NUMBER(7)
);
 
CREATE TABLE content_detail
(
    slot_id               NUMBER(15),
    agent_id              NUMBER(15),
    agent_instance_id     NUMBER(15),
    datetime              DATE,
    page_seq              NUMBER,
    record_seq            NUMBER,
    element_delta         NUMBER(7),
    ERROR_CODE            NUMBER(7),
    status_code           NUMBER(7),
    conn_string_text      VARCHAR2(300),
    object_text           VARCHAR2(2048),
    ip_address            VARCHAR2(16),
    content_bytes         NUMBER(9),
    content_type          VARCHAR2(64),
    request_delta         NUMBER(7),
    system_delta          NUMBER(7),
    request_bytes         NUMBER(9),
    first_packet_delta    NUMBER(7),
    remain_packet_delta   NUMBER(7),
    dns_delta             NUMBER(7),
    connect_delta         NUMBER(7),
    ssl_handshake_delta   NUMBER(7),
    start_msec            NUMBER(7),
    header_code           NUMBER(2),
    ERROR_TEXT            VARCHAR2(60)
);
 
CREATE TABLE measurement
(
    slot_id        NUMBER(15),
    target_id      NUMBER(15),
    agreement_id   NUMBER(15),
    slot_alias     VARCHAR2(60),
    pages          NUMBER(2)
);

Open in new window

 
Hi Sean,  
                    I really cant describe how much you are helping me J Thanks a tonnnnnnnn.  
                    So here are the few things you asked for:  
   
                   >> first, I'm assuming this is 10g  
                    Yes, it will go on a 10g  
   
                   >> are slot and slot_id the same thing?  
                   Yes, they are the same thing. Same goes for agent and agent_id  
   
             
                    In the mean time I will read on the two topics you and mrjoltcola discussed. i.e. xmltable, xmlsequence  
               
here's a small snippet comparing xmlsequence vs xmltable

the sample data is based on another question I helped with
with  bststruct as (
select xmltype(
'<country>
  <id>MAL</id>
  <sponsornode spacctid="100000" spacctcode="100000" level="0" legno="1">
    <sponsornode spacctid="NSH" spacctcode="nscode" level="1" legno="1"/>
    <sponsornode spacctid="WXII" spacctcode="wxcode" level="1" legno="2"/>
    <sponsornode spacctid="JOH" spacctcode="jhcode" level="1" legno="3"/>
    <sponsornode spacctid="SNK" spacctcode="sncode" level="1" legno="4"/>
    <sponsornode spacctid="KEN" spacctcode="kncode" level="1" legno="5"/>
  </sponsornode>
</country>') nmstruct from dual
)
SELECT   EXTRACTVALUE(VALUE(s), '/sponsornode/@spacctid') spacctid,
         EXTRACTVALUE(VALUE(s), '/sponsornode/@spacctcode') spacctcode,
         EXTRACTVALUE(VALUE(s), '/sponsornode/@level') lvl,
         EXTRACTVALUE(VALUE(s), '/sponsornode/@legno') legno
  FROM   bststruct, table(XMLSEQUENCE(EXTRACT(bststruct.nmstruct, '//sponsornode'))) s;
 
with  bststruct as (
select xmltype(
'<country>
  <id>MAL</id>
  <sponsornode spacctid="100000" spacctcode="100000" level="0" legno="1">
    <sponsornode spacctid="NSH" spacctcode="nscode" level="1" legno="1"/>
    <sponsornode spacctid="WXII" spacctcode="wxcode" level="1" legno="2"/>
    <sponsornode spacctid="JOH" spacctcode="jhcode" level="1" legno="3"/>
    <sponsornode spacctid="SNK" spacctcode="sncode" level="1" legno="4"/>
    <sponsornode spacctid="KEN" spacctcode="kncode" level="1" legno="5"/>
  </sponsornode>
</country>') nmstruct from dual
)
SELECT   EXTRACTVALUE(VALUE(s), '/sponsornode/@spacctid') spacctid,
         EXTRACTVALUE(VALUE(s), '/sponsornode/@spacctcode') spacctcode,
         EXTRACTVALUE(VALUE(s), '/sponsornode/@level') lvl,
         EXTRACTVALUE(VALUE(s), '/sponsornode/@legno') legno
  FROM   bststruct, xmltable('//sponsornode' passing nmstruct) s;

Open in new window

Oh k. Thanks. I will go through this code snippet and the theory I got on google together. Thanks Sean :)
Sean our task is workable right? I am scared on it :-(
@sdstuber

My knowledge is dated. This is why I am EE to learn too! Thanks for the comparison, it is simpler for sure and I will add this to my sample scripts. We don't use 9i anymore and the syntax was horrible, as you demonstrated.
Hi Sean, mrjoltcola,
               The file I sent to Sean is close to 200Kb. but the actual file would be around 150mb. Will that cause any kind of problem?
 
it will make it slower because of volume but it shouldn't change the results.

Note, there are bug reports in Metalink about processing very large files.
Most (maybe all) have patches associated with them.

When we get something working on a the small sample, try it on the big file and see.  If you encounter a problem there then you may have to contact oracle support.  I wouldn't worry about that possibility until we get that far though.

And as for your previous question "is it workable?" ---  Certainly!!!
I'm not going to be knock it out in 2-3 minutes but it's certainly solvable.
>> I'm not going to be knock it out in 2-3 minutes but it's certainly solvable.
I am sorry Sean, I didnt mean in the rushing sense. :-)
and Sean how do we load this XML file in our database? use SQL *loader?
 
Hi Sean,
              Went through the snippet you gave me.
              is this line acting as an overhead??
               FROM   bststruct, table(XMLSEQUENCE(EXTRACT(bststruct.nmstruct, '//sponsornode'))) s;
              Tried to google on xmlTable ansd xmlSequence. There are many articles on xmlTable but not so much on xmlSequence.
              or the difference between the two
what do you mean by "an overhead"

bstruct has the data, so it must be in the FROM clause

the next part uses the xmltype column in bstruct
first, EXTRACT finds all <sponsornode>   elements
since there are more than one XMLSEQUENCE builds a collection from that extraction and then the TABLE function returns the collection in a format that is queryable as if it were a table or view.

xmltable  function does the same thing as all three of those steps rolled into one.

it's not any more or less efficient, just nicer to read.
actually, it probably is a little more efficient since there is no passing of information across functions,  it can do it all in internally in one step.

I haven't been able to measure a difference but maybe with 150mb file you might
if you care to test both
Ohk. Thanks :-)
This is all very new to me, so just trying my best to google and get as much info as I can.
I also read that, xmlType column cannot support an xml document more than 4000 words?
so thats why we are going to use xmlTable?
what's a "word"?

like I said before,  there have been bugs with large docs, but I haven't hit any lately (I'm using 10.2.0.4 and 11.1.0.7)  if you do see problems wth large docs, contact oracle support for a patch.
Oops I am sory Sean, I meant Characters.
Read the following statement somewhere:
If a large XML document (typically greater than 4000 characters) is inserted into an XMLType column using a String object in JDBC, the run-time error "java.sql.SQLException: Data size bigger than max size for this type" is encountered. This problem can be solved by using a CLOB object to hold the large XML documents.
Hi Sean,
               Sorry i am bothering you again and again :-)
               Could you please tell me a little heads up, how to begin this problem? It will help me understanding your solution.
 
loading xml document into the databse is itself a big task :-)
Download Toad for Oracle trial edition and you'll see how easy it is.
Just downloaded it, gonna play with it for a while. if I get stuck then will ping u again :)
do you need any of the PAGE_META_DATA nodes?

all of the page information you want is in the TXN_PAGE nodes, correct?

I'm concerned with parse times.  Hopefully you won't be trying to load a whole bunch of these monster files.

My initial poc efforts are already verging on painful.

I think I can post a trimmed/masked version of the file tomorrow. So the work can properly continue within this thread.
Hi Sean,
       
       Thanks for helping me out :-)

       >> I'm concerned with parse times.
       I have to parse this xml file (the actual one of 150 mb) once everyday.

       I guess in the PAGE table, there is a PAGE_SEQ field as a primary key and PAGE_SEQ is present in PAGE_META_DATA.

      Same goes with CONTENT DETAIL table.
>> I have to parse this xml file (the actual one of 150 mb) once everyday.

Eep.

Storing 150mb of XML to parse everyday is like using a single table to store your whole database schema. It is not what XML was intended for. XML works better for data-interchange than data-permanent-storage. It is one of the most inefficient formats to parse memory/cpu wise. If this really gets busy, you probably should look to writing a C++ program to do it.

But since its just once a day, if you can tolerate the parse/load time and do it at 5am, maybe, then would it be a problem using the PL/SQL route?
I agree, 150Mb is a lot of data to have to parse.  Are you sure you need to do that every day?  If so, why?

For others in the thread.  I sent Aman a smaller sample with masked data.  I have asked him to confirm it's ok to post and also to confirm that my masking hasn't invalidated any relationships.
When it's validated I'll post it here.
Hi Sean, mrjoltcola,

            Hope you had a great weekend.

          >>  Are you sure you need to do that every day?  If so, why?

          This is a datafeed that we get from a third party vendor. Data will be used to publish some reports for higher management. So everyday a new report will be published (just an excel spreadsheet) for previous days data.


          >>  I sent Aman a smaller sample with masked data

          Sean thanks a lot. It was absolutely fine.


          >> you can tolerate the parse/load time and do it at 5am

          parse load time is not an issue. We can even start it at 3 in the morning :-D
here is the abbreviated xml snippet.  this should allows others to pursue the problem as well.
ee-short.txt
Thank you Sean :-)


I can probably guess most of the tag/attribute mappings to table/column  but to avoid errors, could you provide the mappings?

Attached is a start of how I'm tackling this.  Once you post all of the correct mappings I'll make the corresponding adjustments and additions.

I've already loaded both the original xml you posted and the short version into a table called "mydoc"

With just this start I can parse the small one in 0.1 - 0.2 seconds.  The original takes 30-35 seconds.
SELECT   *
  FROM   (SELECT   EXTRACTVALUE(VALUE(agreement), '/TXN_META_DATA/@agreement_id') agreement_id,
                   EXTRACTVALUE(VALUE(agent), 'AGENT_META_DATA/@agent_id') agent_id,
                   EXTRACTVALUE(VALUE(agent), 'AGENT_META_DATA/@instance_id') agent_instance_id,
                   EXTRACTVALUE(VALUE(slot), 'SLOT_META_DATA/@slot_id') slot_id,
                   EXTRACTVALUE(VALUE(slot), 'SLOT_META_DATA/@slot_alias') slot_alias,
                   EXTRACTVALUE(VALUE(slot), 'SLOT_META_DATA/@pages') pages,
                   EXTRACTVALUE(VALUE(meas), 'TXN_MEASUREMENT/@agent') meas_agent_id,
                   EXTRACTVALUE(VALUE(meas), 'TXN_MEASUREMENT/@slot') meas_slot_id,
                   EXTRACTVALUE(VALUE(meas), 'TXN_MEASUREMENT/@datetime') meas_date,
                   EXTRACTVALUE(VALUE(meas), 'TXN_MEASUREMENT/@target') target_id,
                   EXTRACTVALUE(VALUE(meas), 'TXN_MEASUREMENT/@agent_inst') meas_agent_inst_id,
                   EXTRACTVALUE(VALUE(tpage), 'TXN_PAGE/@page_seq') page_seq,
                   EXTRACTVALUE(VALUE(meas), 'TXN_MEASUREMENT/TXN_SUMMARY/@delta_msec')
                       summary_delta_msec,
                   EXTRACTVALUE(VALUE(tpage), 'TXN_PAGE/TXN_PAGE_PERFORMANCE/@delta_msec')
                       page_delta_msec,
                   EXTRACTVALUE(VALUE(tpage), 'TXN_PAGE/TXN_PAGE_PERFORMANCE/@ssl_handshake_delta')
                       page_ssl_handshake_delta,
                   EXTRACTVALUE(VALUE(basepage),
                                'TXN_BASE_PAGE/TXN_DETAIL_PERFORMANCE/@ssl_handshake_delta')
                       base_ssl_handshake_delta
            FROM   (SELECT   xmltype(xmlclob) xml
                      FROM   mydoc
                     WHERE   id = 2),
                   XMLTABLE ('//TXN_META_DATA' PASSING xml) agreement,
                   XMLTABLE ('//AGENT_META_DATA' PASSING VALUE(agreement)) agent,
                   XMLTABLE ('//SLOT_META_DATA' PASSING VALUE(agreement)) slot,
                   XMLTABLE ('//TXN_MEASUREMENT' PASSING xml) meas,
                   XMLTABLE ('//TXN_PAGE' PASSING VALUE(meas)) tpage,
                   XMLTABLE ('//TXN_BASE_PAGE' PASSING VALUE(tpage)) basepage)
 WHERE   agent_id = meas_agent_id
     AND agent_instance_id = meas_agent_inst_id
     AND slot_id = meas_slot_id

Open in new window

Hi Sean,
 
               I didnt exactly get "Mapping"? That is all the info I have for it :-(

               Actually there might be some attributes which are desired in the Table but not present in the XML file, We just need to omit those from the Table. Please let me know if you were asking for some other info.
what I meant by "mapping" is

there are 3 tables to populate indicate which node and attribute applies to each table and column.

something like the attachment below
note, multiple nodes may map to the same table/column, that's fine, those become the "join" conditions

What I have below is just guess on my part, but I think I did ok.  Please confirm or change and then add the rest.

TABLE measurement
     slot_id   
             TXN_DATA_FEED/TXN_META_DATA/SLOT_META_DATA@slot_id
             TXN_DATA_FEED/DP_TXN_MEASURMENTS/TXN_MEASURMENT@slot
    target_id
             TXN_DATA_FEED/DP_TXN_MEASURMENTS/TXN_MEASURMENT@target
    agreement_id
             TXN_DATA_FEED/TXN_META_DATA@agreement_id
    slot_alias
             TXN_DATA_FEED/TXN_META_DATA/SLOT_META_DATA@slot_alias
    pages
              TXN_DATA_FEED/TXN_META_DATA/SLOT_META_DATA@pages

Open in new window

ohk. got it :)

 You already figured it out perfectly :-)
only one of them,  please fill in the rest
ok, i will Fill in the Page table and other one too.:-)
Hello Sean,
                      Big apologies for the delay. here is the mapping for PAGE table. I hope I did it correctly. :-)

PAGE data table
	slot_id 
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT@slot
	agent_id
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT@agent
    agent_instance_id
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT@agent_inst
	datetime
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT@datetime
    page_seq
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE@page_seq
    delta_msec
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_SUMMARY@delta_msec
    delta_user_msec
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_SUMMARY@delta_user_msec
	content_errs
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_SUMMARY@content_errors
    element_count
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_SUMMARY@element_count
    page_bytes
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_OBJECT@page_bytes
	dns_lookup_msec
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@dns_lookup_msec
    first_byte_msec
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@first_byte_msec
	start_msec
		  TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@start_msec
    connect_delta
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@connect_delta
    system_delta
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@system_delta
    request_delta
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@request_delta
	first_packet_delta
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@first_packet_delta
    remain_packets_delta
	      TXN_DATA_FEED/DP_TXN_MEASUREMENT/TXN_MEASUREMENT/TXN_PAGE/TXN_PAGE_PERFORMANCE@remain_packet_delta
    

Open in new window

hello, did you get the last table?
Hey Sean,
                       The last table? oops the content detail? I guess that wasnt that important as of now.. I thought If I will get a solution from you, then I will try work it out by referencing your solution
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I am really very sorry Sean and mrjoltcola ... You guys are great