Link to home
Start Free TrialLog in
Avatar of tonelm54
tonelm54

asked on

splitting a sting into elements

Ive got an XML file which was written by a program and I cant get anything to read it, as its 12Gb in size I dont want to manually go through line by line trying to correct it. Ive managed to figure out phasing the parts I need, however Im stuck when it comes down to the individual elements.

Ive got a line such as:-
<field name="createdby" value="0fb646e4-5590-e611-80e8-1458d05b422c" lookupentity="systemuser" lookupentityname="User 27" />

And I want to be able to split the line up into an array such as:-
     field name="createdby" 
     value="0fb646e4-5590-e611-80e8-1458d05b422c" 
     lookupentity="systemuser" 
     lookupentityname="User 27"

Open in new window


But as the row sometimes has different elements I wanted someway of doing this, but cant figure it out.

I thought I could load each line into a simpleXML_Load_String and extract the data, but becuase the row isnt closed it complains about it. If I manually fix the field to:-
<field name="createdby" value="0fb646e4-5590-e611-80e8-1458d05b422c" lookupentity="systemuser" lookupentityname="User 27">NO Value</field>
Its happy, but I really dont want to go through a 12Gb file and fix manually each line.

Any ideas?
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I should also mention that if you have the same GUIDs being repeated millions of times, and if the XML storage / processing is entirely up to you, you might be able to significantly reduce the size of your XML by simply keeping a table of GUIDs in memory and replacing them with indexes to entries in that table, and then write the table later on in the XML. For example:

<yourxmlfile>
  <guids>
    <guid id="1">0fb646e4-5590-e611-80e8-1458d05b422c</guid>
    <guid id="2">dffa356-6611-f727-9f12-abcd1234ff00</guid>
    ...etc...
  </guids>
  <records>
    <record id="1">
      <fields>
        <field name="createdby" value="1" lookupentity="systemuser" lookupentityname="User 27" />
     </fields>
    </record>
    <record id="2">
      <fields>
      <field name="createdby" value="1" lookupentity="systemuser" lookupentityname="User 27" />
     </fields>
    </record>
    <record id="3">
      <fields>
      <field name="createdby" value="2" lookupentity="systemuser" lookupentityname="User 28" />
     </fields>
    </record>
    .etc...
  </fields>
</yourxmlfile>

Open in new window


You can potentially save over 30 bytes for every GUID in that field, so if your XML file is full of these things, it might help. Also, you could potentially swap out long tag names with shortened versions, like <f>...</f> instead of <field>...</field> to further compress the XML file while preserving its structure (you'd just have to update any mappings that referenced "field" and change it to "f", for example).
@Jonathan: Still waiting to see if there is more test data, but for this sample, there is nothing that needs to be "fixed."  It's a perfectly valid XML document -- it just has all of its information in the attributes, not between opening and closing tags.  It doesn't even need to be turned into an array -- foreach() can iterate over the attributes.

In terms of processing a 12GB file, while that may be possible, in theory, with a 64-bit machine, it seems unlikely and will probably have to be taken in smaller bites
Yeah, that was my thought, too - I mentioned that near the end of my first comment:
Since the /> is technically a valid ending, my assumption is that your original issues are either related to file size or to XML that doesn't confirm to a WSDL or some other XML rule.

I was just thrown off by the remark that changing it to: "<field...>NO Value</field>" would work for him. If that's the case, maybe he was using some kind of poorly-built custom XML parser that didn't understand self-closing tags.
Avatar of tonelm54
tonelm54

ASKER

Sorry, I was unable to supply any test data, however the project has been cancelled so I dont need this anymore.

Thank you for your support anyways