Question

Convert XML to CSV (Pipe Delimited) - Large Files 100mb to 1GB

Asked by: thyros

I am working with large xml files (ranging from 100mb to 1GB+), thus it is not really possible to load them into memory.. and the goal is to convert these files to a csv format, but pipe (|) delimited.


The XML file looks like this;


  <?xml version="1.0" encoding="UTF-8" ?>
- <merch_item_feed>
- <item_data>
- <item_basic_data>
  <item_unique_id>0115526102</item_unique_id>
  <item_ean>9780115526107</item_ean>
  <item_sku>0115526102</item_sku>
  <item_upc />
  <item_mpn />
  <item_brand>Stationery Office Books</item_brand>
  <item_name>The Official Learning to Drive Pack (Driving Skills)</item_name>
  <item_model />
  <item_category>Book</item_category>
  <item_short_desc>Paperback, Stationery Office Books</item_short_desc>
  <item_page_url>http://www.amazon.co.uk/exec/obidos/ASIN/0115526102/AssocID/ref=nosim</item_page_url>
  <amzn_page_url>http://www.amazon.co.uk/exec/obidos/ASIN/0115526102/AssocID/ref=nosim</amzn_page_url>
  <offer_page_url>http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115526102/new/ASIN/0115526102&camp=1634&creative=6738</offer_page_url>
  <offer_used_url>http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115526102/used/ASIN/0115526102&camp=1634&creative=6738</offer_used_url>
  <item_image_url>http://ec1.images-amazon.com/images/I/31DY01WGJ1L.jpg</item_image_url>
  <item_image_url_small>http://ec1.images-amazon.com/images/I/11FE11DK7YL.jpg</item_image_url_small>
  <item_salesrank>233499</item_salesrank>
  <item_price>21.23</item_price>
  <item_inventory>Usually dispatched within 1-2 business days</item_inventory>
  <item_shipping_charge>Check Site.</item_shipping_charge>
  <amzn_price>24.99</amzn_price>
  <amzn_inventory>Usually dispatched within 24 hours</amzn_inventory>
  <amzn_shipping_charge>Free!</amzn_shipping_charge>
  <fm_price>24.99</fm_price>
  <fm_inventory>Usually dispatched within 24 hours</fm_inventory>
  <fm_shipping_charge>Free!</fm_shipping_charge>
  <tp_new_price>21.23</tp_new_price>
  <tp_new_inventory>Usually dispatched within 1-2 business days</tp_new_inventory>
  <tp_new_shipping_charge>Check Site.</tp_new_shipping_charge>
  <tp_used_price>20.00</tp_used_price>
  <tp_used_inventory>In Stock</tp_used_inventory>
  <tp_used_shipping_charge>Check Site.</tp_used_shipping_charge>
  </item_basic_data>
- <prod_specific_data category="book">
  <known_attr_val_pair attr="book_author" val="Driving Standards Agency" />
  <known_attr_val_pair attr="book_isbn" val="0115526102" />
  <known_attr_val_pair attr="book_format" val="Paperback" />
  </prod_specific_data>
- <merch_cat_list>
- <merch_cat_item>
  <merch_cat_name>277082</merch_cat_name>
  <merch_cat_path>Books/Subjects/Reference/Transport/Automotive/Driving & the Highway Code</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>278131</merch_cat_name>
  <merch_cat_path>Books/Subjects/Science & Nature/Engineering & Technology/Civil Engineering/Road & Transport</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>10834521</merch_cat_name>
  <merch_cat_path>Books/Special Features/34% off Books over £10/Science & Nature</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>10834491</merch_cat_name>
  <merch_cat_path>Books/Special Features/34% off Books over £10/Reference & Languages</merch_cat_path>
  </merch_cat_item>
  </merch_cat_list>
  </item_data>
- <item_data>
- <item_basic_data>
  <item_unique_id>0115528423</item_unique_id>
  <item_ean>9780115528422</item_ean>
  <item_sku>0115528423</item_sku>
  <item_upc />
  <item_mpn />
  <item_brand>The Stationary Office (TSO)</item_brand>
  <item_name>The Official DSA Theory Test for Motorcyclists CD-ROM</item_name>
  <item_model />
  <item_category>Software</item_category>
  <item_short_desc>, Platforms: Windows XP</item_short_desc>
  <item_page_url>http://www.amazon.co.uk/exec/obidos/ASIN/0115528423/AssocID/ref=nosim</item_page_url>
  <amzn_page_url>http://www.amazon.co.uk/exec/obidos/ASIN/0115528423/AssocID/ref=nosim</amzn_page_url>
  <offer_page_url>http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115528423/new/ASIN/0115528423&camp=1634&creative=6738</offer_page_url>
  <offer_used_url>http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115528423/used/ASIN/0115528423&camp=1634&creative=6738</offer_used_url>
  <item_image_url>http://ec1.images-amazon.com/images/I/31Au9yM7IZL.jpg</item_image_url>
  <item_image_url_small>http://ec1.images-amazon.com/images/I/11ldivFAIML.jpg</item_image_url_small>
  <item_salesrank>1068</item_salesrank>
  <item_price>16.99</item_price>
  <item_inventory>Not yet released</item_inventory>
  <item_shipping_charge>Free!</item_shipping_charge>
  <amzn_price>16.99</amzn_price>
  <amzn_inventory>Not yet released</amzn_inventory>
  <amzn_shipping_charge>Free!</amzn_shipping_charge>
  <fm_price>16.99</fm_price>
  <fm_inventory>Not yet released</fm_inventory>
  <fm_shipping_charge>Free!</fm_shipping_charge>
  </item_basic_data>
- <prod_specific_data category="software">
  <known_attr_val_pair attr="hardware_platform" val="PC" />
  <known_attr_val_pair attr="software_os" val="Windows XP" />
  <known_attr_val_pair attr="software_format" val="CD-ROM" />
  </prod_specific_data>
- <merch_cat_list>
- <merch_cat_item>
  <merch_cat_name>277082</merch_cat_name>
  <merch_cat_path>Books/Subjects/Reference/Transport/Automotive/Driving & the Highway Code</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>278131</merch_cat_name>
  <merch_cat_path>Books/Subjects/Science & Nature/Engineering & Technology/Civil Engineering/Road & Transport</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>912026</merch_cat_name>
  <merch_cat_path>Software/Categories/Hobbies & Pastimes/Driving Tests</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>16305411</merch_cat_name>
  <merch_cat_path>Software/Categories/Hobbies & Pastimes/All Hobbies & Pastimes</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>317243011</merch_cat_name>
  <merch_cat_path>Software/Categories/Digital Imaging/Creativity Software</merch_cat_path>
  </merch_cat_item>
- <merch_cat_item>
  <merch_cat_name>341610011</merch_cat_name>
  <merch_cat_path>uk-shops/Education Resources/Software/Driving Tests</merch_cat_path>
  </merch_cat_item>
  </merch_cat_list>
  </item_data>
- </merchitemfeed>


Objective is to extract the data from the 'item_basic_data' elements and separate them by pipe character.

Output should look something like (with the field headers);

item_unique_id|item_ean|iteam_upc
12345678901|12345678|12345678
12345678901|12345678|12345678
12345678901|12345678|12345678

------------------------------------------

Please note that only the information from 'item_basic_data' needs to be extracted - instructions on how to accomplish this is sufficient as an answer.  However, if you know your stuff, I would appreciate a solution that could extract the first instance of 'merch_cat_path'.  If you notice, each 'item_basic_data' has 4 or 5 duplicate elements of 'merch_cat_path', but we only want the first instance if possible.

I am assuming we will need some xslt file, but I don't know how to write it.  I am experimenting with a program that will do the processing of the input xml, transform xslt, and output csv files, but it does not supply the xslt file itself.

Also, if you have any suggestions for similar programs that can handle & process large xml files - preferably freeware, but commercial is ok too.


This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2007-09-17 at 10:24:09ID22833778
Tags

xml

,

csv

,

convert

,

pipe

,

file

Topics

Extensible Stylesheet Language Transformation (XSLT)

,

Extensible Markup Language (XML)

Participating Experts
2
Points
500
Comments
17

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. xml to csv, csv to xml
    I want to be able to convert xml files into csv files, alter the csv file with code I have written and then convert the csv file back to xml. I know there is a bunch of classes already written ie pear. Problem is I'm not sure where to start. Do I need pears XML_Parser,...
  2. Insert line break in XML/CSV when using XSLT
    I am using ASP and Microsoft XML Parser to create XML and CSV files from ADO recordset. It works fine but everything is written to one line in the document. I do the same and create a HTML using XSLT and it contains line breaks. How do i get line breaks in XSLT? I have tried...
  3. simple xml to csv conversion with xslt
    I have a very simple xml document which i need to convert to csv using xslt. <products> <product id="32323232323"> <title>The title</title> <Description>the Description</Description> etc </product> </products> I nee...
  4. CSV to XML using XSLT
    I have a file which is in the following format 1,2,3,4,5,6,7^a^b^c^d^e^f^g^h^i^j I need to convert it to an XML using XSLT. The csv file could change its format sometime later so need to use XSLT other than changing the code, changes in XSL should be enough Could anyone hel...
  5. xml to csv using xslt
    I have no idea where to start but someone mentioned that I could convert an xml file to a csv file using an xslt. This is a sample of my xml file Transactions will have different names. The xml file needs to be exported in the format see below <transaction name="buil...

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: abelPosted on 2007-09-17 at 10:34:23ID: 19906666

Hi,

It  is fairly easy to do what you want in XSLT, but the size maybe a problem. The only product I know of that can deal with large sizes like your is Saxon-SA, and then, still only when you construct your stylesheet ready for streaming processing (SAX based processing).

You XSLT may look the following

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <!-- tell the processor to output text -->
    <xsl:output method="text" />

    <!-- throw away unwanted nodes -->
    <xsl:template match="*" />

   <!-- here you can make one long list of all items you want in the CSV row -->
   <xsl:template match="item_basic_data">
      <xsl:value-of select="item_unique_id|item_ean|itam_upc" separator="|" />
      <xsl:text>&#xA;</xsl:text>
   </xsl:template>

</xsl:stylesheet>


Cheers,
-- Abel Braaksma

 

by: GertonePosted on 2007-09-17 at 10:44:11ID: 19906727

The XSLT would be really simple to get this done,
but I don't think that XSLT is the right answer with these filesizes.
You definitely need a streaming parser for this

You could look at STX
http://stx.sourceforge.net/
which is a forward only subset of XSLT, but that works with an extremely low footprint
Have a look first whether or not this would fit your architecture
I can then help you with the STX, whch would be as simple as the XSLT required for this
If you have some Perl knowledge XML::STX would work fine

There is a Streaming PullParser for Java, StAX,
you could use this for cutting your file in pieces (of 50MB eg.)
create the pipe-CSV for each piece using XSLT
and bring the result together after XSLT-ing
I am not a Java programmer, so I can't help you with that

Here is a nice introduction to StAX if you are a java programmer
http://www.ibm.com/developerworks/library/x-stax1.html
The technique is know as pull parsing, if you are a Python programmer you could use pulldom
http://www.ibm.com/developerworks/library/x-tipulldom.html
Uche has a more recent article where he has such a splitter in the middle of the article
http://www.ibm.com/developerworks/library/x-increfeed.html

cheers

Geert

 

by: abelPosted on 2007-09-17 at 10:44:55ID: 19906737

Oh, btw, the above *only* works on a 2.0 processor! The one I mentioned, Saxon-SA: www.saxonica.com, is a 2.0 processor, and arguably the best and fastest around. In terms of speed, the most recent MSXML and .NET processors are quicker for most scenarios, but they cannot stream and they do not do the much more versatile XSLT 2.0 language.

If you cannot do Saxon 8, you should consider running the whole thing on a machine that has at least 4 tmes the amount of memory that your data has (true for MSXML, Saxon, libxslt, Gestalt and Xalan). If you don't, the machine will start swapping and the processing time goes from speedy to exponentially slow.

If you use a Java based browser on a windows machine, you have another problem: the Java JVM for clients has a maximum of approximately 300-400MB. You can allocate more on the commandline, but it won't be used and there starts the swapping around again. This problem should not occur with the server versions of the JVM. Nor should they occur on non-windows systems (not checked though).

As to your solution: you may want to change the order of the elements. If they need to be in the same order as the input document, the above solution will suffice pretty well. If they don't, you should consider using the comma-operator (again: only XSLT 2.0). You xsl:value will then look like this:

<xsl:value-of select="item_unique_id , item_ean  , itam_upc" separator="|" />

Or, if there is quite a lot, you can use whitespace to make it clearer, and you may even add comments (again, only XSLT 2.0) to the XPath:

<xsl:value-of select="
     item_unique_id,    (: the unique identifier :)
     item_ean ,            (: the EAN number of the item :)
     itam_upc ,
     item_xxx" separator="|" />

Etc.

Since you are new to XSLT: do not confuse the "|" operator with the "|" character in your output. The pipeline as operator instructs the processor to take the distinct union of all nodes that match the combined criteria:

    item_upc only selects all elements that match item_upc
    Item_upc | item_ean  will select all elements that match either item_upc or item_ean and will return them in document order.


HTH,

Cheers,
-- Abel Braaksma

 

by: abelPosted on 2007-09-17 at 11:01:25ID: 19906855

On using streaming large documents, for Saxon, the article is here: http://www.saxonica.com/documentation/sourcedocs/serial.html

Note that, from these instructions, you must modify your XSLT document slightly:


<xsl:stylesheet
    xmlns:my="http://myfunctions"
    xmlns:saxon="http://sf.net/saxon/"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <!-- tell the processor to output text -->
    <xsl:output method="text" />

    <!-- use this param to select the file on the commandline -->
    <xsl:param name="document-url" />

    <!-- start push mode using the copy-of / document function -->
    <xsl:template name="main">
         <xsl:apply-templates select="my:read-document()" />
    </xsl:template>

    <!-- we need a function to prevent copying the tree into memory -->
    <xsl:function name="my:read-document">
        <xsl:copy-of select="document($document-url)//item_basic_data" saxon:read-once="true" />
    </xsl:function>

   <!-- here you can make one long list of all items you want in the CSV row -->
   <xsl:template match="item_basic_data">
     
<xsl:value-of select="
     item_unique_id,    (: the unique identifier :)
     item_ean ,              (: the EAN number of the item :)
     itam_upc ,
     item_xxx" separator="|" />

      <xsl:text>&#xA;</xsl:text>
   </xsl:template>

</xsl:stylesheet>


Now we are only processing what we want, so no need for the "throw away" template anymore  (the one with match="*"). In addition, we have added three elements that should take care of processing in push mode. This may seem like a lot of fuzz for getting push mode to work, but once you look past it, the simple processing instructions in the template matching declarations are still the same simple ones.

Cheers,
-- Abel Braaksma

 

by: thyrosPosted on 2007-09-17 at 11:21:49ID: 19906997

Thanks for your comments.

Unfortunately I have very limited experience with XML (only what I have experimented with in trial and error).  I am not a java programmer, don't know any perl or python, but I can somewhat appreciate the structure of xml through my experience with xhtml.

I am trying to convert around 10 or 12 such large xml files to csv format.  I have been using some programs such as xml2csv which is command line based, so I don't even open the original xml file, but the conversion  has the effect of reducing the filesize enough for me to open the csv file  up in latest version of Excel (300,000k rows or records).  However, it seems to be causing some glitches because it is ',' comma delimited and the content has comma's too.

The program I am looking at now is twxml2csv which is gui based, but it needs an xsl file.

http://babelfish.altavista.com/babelfish/trurl_pagecontent?lp=de_en&trurl=http%3a%2f%2fwww.wenzlaff.de%2fxmltocsv.html
original in german: http://www.wenzlaff.de/xmltocsv.html


As a minimum, I am trying to extract the following headers from the xml file in the original post;

item_unique_id|item_ean|item_sku|item_upc|item_mpn|item_brand|item_name|item_model|item_category|item_short_desc|item_page_url|item_image_url|item_image_url_small|item_salesrank|item_price|item_inventory|item_shipping_charge|amzn_price|amzn_inventory|amzn_shipping_charge|fm_price|fm_inventory|fm_shipping_charge|tp_new_price|tp_new_inventory|tp_new_shipping_charge|tp_used_price|tp_used_inventory|tp_used_shipping_charge


After that, if possible, we can try extracting  the merch_cat_path field (one instance of it per product).

>>
"Since you are new to XSLT: do not confuse the "|" operator with the "|" character in your output. The pipeline as operator instructs the processor to take the distinct union of all nodes that match the combined criteria:"
<<

Although I don't fully understand, but I wasn't aware there was a distinct usage of the pipe character as an operator in xslt.

If you could please tell me what I need to do in layman terms, to convert the xml to csv file, with pipe as the delimiter.  I don't need to open the original xml files, only convert them to csv - because once that is done, I should be ok to open them in excel.

 

by: abelPosted on 2007-09-17 at 12:00:50ID: 19907335

Ok, I see the problem here. Then, from an XSLT standpoint, I can offer you a working solution, but you'll have to use the non-free version of Saxon for it. But I think XSLT is far too far an overkill for what you want.

The tool you mention won't help you: it cannot handle the size you talk of. It uses XSLT 1.0 and cannot make use of the extensions I mentioned above.

But I can help you with a method that will transform your files into CSV in minutes. I used that technique several times with success to downsize large files from SAP. All you need is an installation of Perl (or Ruby or Python, but you say you are not a programmer, so let's stick to the easiest around) and the ability to call something on the commandline.

I'll construct a nice regular expression for you that processes the file. All you need to do is tell the program what fields to pick up. You do not need to check you XML from the 'inside' or learn XSLT or Perl for it.

If that will help your problem (and I think it does), I'll be happy to provide you with a little code for doing just that.

Cheers,
-- Abel Braaksma

 

by: abelPosted on 2007-09-17 at 12:03:23ID: 19907357

Btw: your XML is not well formed. It seems to me that you loaded it into IE and than copied / pasted it from there. Instead, it is usually better to do a "view source" or to open the document in a text editor or an XML aware editor.

IE puts '-' signs in front of collapsable nodes. And it interprets entities like &amp; and shows the & instead of the source &amp;. When you copy what you see, your file is not what it was anymore ;)

 

by: abelPosted on 2007-09-17 at 15:51:28ID: 19909055

Here's a very naive, but perhaps well-enough, attempt to do what you want. It captures all the tags you mention and outputs is contents. It also captures the first merch_cat_path and leaves any additionals.

A few buts:
 - This is clearly a one-off solution as it is not really easy to maintain it in the long run
 - It is quite naive: it only prints a field when it finds one, not when it does not find one

The last one is important. If you can say for sure that the tags you mention are always there (and are just empty, but still there, when they don't have content) and in the same order, there is no problem. If not, we have to reverse the way the logic is build in Perl.

Download perl (i.e., from activestate.com) and copy the following script (starting with the '#' sign) to a text file and call it xml_to_csv.pl. Most current installers add Perl to the command path. You can then call it as follows to process your files:

perl xml_to_csv.pl < myinput.xml > myoutput.csv

Then the file "myinput.xml" obviously contains your input and the file "myoutput.csv" will contain your output. If you want to append to the output you need to use >> output.csv instead.

The tags that you want to match go into a file with the name tags_to_match.txt. This is just a list of tags where each tag goes on one line on itself (no spaces appended or anything!)

This is the perl script:
-----------------------------
#!/usr/bin/perl -w
use strict;

# little script for thyros, Expert-Exchange.com Q_22833778.html
# see http://www.experts-exchange.com/Q_22833778.html

open TAGS, "tags_to_match.txt" or die("tags file 'tags_to_match.txt' not found");
my $all_tags = do { local $/; <TAGS> };
$all_tags = join("|", split(/\n/,$all_tags));

my $merge_cat_path_count = 0;
while (<>) {
      # new record
      if(m!</item_data>!) {
            print "$/" ;
            $merge_cat_path_count = 0;
      }
      print "$1|" if m!<(?:$all_tags)>(.*)</(?:$all_tags)>!;
      if (m!<(?:merch_cat_path)>(.*)</(?:merch_cat_path)>!) {
            print "$1" if $merge_cat_path_count == 0;
            $merge_cat_path_count++;
      }
}



This is the tags_to_match.txt i used
------------------------------------------------
item_unique_id
item_ean
item_sku
item_upc
item_mpn
item_brand
item_name
item_model
item_category
item_short_desc
item_page_url
item_image_url
item_image_url_small
item_salesrank
item_price
item_inventory
item_shipping_charge
amzn_price
amzn_inventory
amzn_shipping_charge
fm_price
fm_inventory
fm_shipping_charge
tp_new_price
tp_new_inventory
tp_new_shipping_charge
tp_used_price
tp_used_inventory
tp_used_shipping_charge


This is the output I received:
--------------------------------------
0115526102|9780115526107|0115526102|Stationery Office Books|The Official Learning to Drive Pack (Driving Skills)|Book|Paperback, Stationery Office Books|http://www.amazon.co.uk/exec/obidos/ASIN/0115526102/AssocID/ref=nosim|http://ec1.images-amazon.com/images/I/31DY01WGJ1L.jpg|http://ec1.images-amazon.com/images/I/11FE11DK7YL.jpg|233499|21.23|Usually dispatched within 1-2 business days|Check Site.|24.99|Usually dispatched within 24 hours|Free!|24.99|Usually dispatched within 24 hours|Free!|21.23|Usually dispatched within 1-2 business days|Check Site.|20.00|In Stock|Check Site.|Books/Subjects/Reference/Transport/Automotive/Driving &amp; the Highway Code
0115528423|9780115528422|0115528423|The Stationary Office (TSO)|The Official DSA Theory Test for Motorcyclists CD-ROM|Software|, Platforms: Windows XP|http://www.amazon.co.uk/exec/obidos/ASIN/0115528423/AssocID/ref=nosim|http://ec1.images-amazon.com/images/I/31Au9yM7IZL.jpg|http://ec1.images-amazon.com/images/I/11ldivFAIML.jpg|1068|16.99|Not yet released|Free!|16.99|Not yet released|Free!|16.99|Not yet released|Free!|Books/Subjects/Reference/Transport/Automotive/Driving &amp; the Highway Code


Of course, the web server will word-wrap the output. It is really only two lines because your input in your question only contained two records.

HTH, let me know what you find,

Cheers,
-- Abel Braaksma

 

by: thyrosPosted on 2007-09-18 at 05:13:03ID: 19912003

Hello Abel,

I followed your instructions, and managed to process the files.  I checked the results in excel, and some of the fields were shifting to the left, i.e. merch_cat_path was showing in the shipping or stock column.  I think this is due to the element missing in the original xml when you said "it only prints a field when it finds one, not when it does not find one".  

We are using the final csv files to be parsed into a database, which relies on a consistent column structure, so the final csv would need to have something like:

item_unique_id|item_ean|iteam_upc|etc|etc|etc
12345678901|12345678|12345678|||abc
12345678901|12345678|12345678|sd||abc
12345678901|12345678|12345678|||abc

So there would be pipe characters present for each element, even if the element content was empty.  Also, if it is possible, can you please tweak the script to output the field headers into the final csv file as above.

Just to clarify, if we wanted to append results (not overwrite?) to an existing file, we'd use a command like so; ?

perl xml_to_csv.pl < myinput.xml >> myoutput.csv


Thanks for your help and patience.

 

by: abelPosted on 2007-09-18 at 06:19:00ID: 19912477

I'm glad it works so far (but not yet well enough).

yes, that would be the correct command. But, if you want the headers, you will output the headers again... I'll try to customize the script for you. I am currently @work, so bear with me for a short while (probably this evening european time).

 

by: thyrosPosted on 2007-09-18 at 06:25:29ID: 19912533

If you could, please.  Thanks very much, I appreciate it.

 

by: thyrosPosted on 2007-09-19 at 04:26:35ID: 19919646

Hi again,

If you have the time to do it, I would like to request one change please.  In the earlier post I mentioned that we only need to extract one instance of 'merch_cat_path' (since there were sometimes upto 4 or 5 copies of it), however, if it is possible and not too much trouble, I'd like to be able to extract upto 5 instances of them provided they are all on the same row but separated as usual by pipe - so there would be some more field headers..

merch_cat_path|merch_cat_path2|merch_cat_path3|merch_cat_path4|merch_cat_path5

Only if it is not too much trouble.. I appreciate one instance of it just fine aswell, so its up to you.

Thanks again.

 

by: abelPosted on 2007-09-20 at 03:38:33ID: 19927222

No problem. Here it is:

#!/usr/bin/perl -w
use strict;

# little script for thyros, Expert-Exchange.com Q_22833778.html
# see http://www.experts-exchange.com/Q_22833778.html

open TAGS, "tags_to_match.txt" or die("ERROR: tags file 'tags_to_match.txt' not found");
my $all_tags = do { local $/; <TAGS> };
my @tags = split(/\n/,$all_tags);

my $merge_cat_path_count = 0;
$/ = "</item_data>";
while (<>) {
      die("ERROR: pipeline '|' character found in data!") if /\|/;

      foreach my $tag (@tags) {
            my $tmpTag = $tag;  # work on a copy
            my $count = 1;
            $count = $1 if $tmpTag =~ s/(\d+)$//;
            print "$1" if m!(?:<$tmpTag>([^<]*)</$tmpTag>.*?){$count}!ms;            # do not set /g modifier!
            print "|";      
      }
      print "\n";
}

 

by: abelPosted on 2007-09-20 at 03:45:04ID: 19927244

Hi Thyros

It will *always* output a | character for each line in the tags-file. If the tag is found in the current record (defined by </item_data>) it will print the contents of that element.

As it is now, it will output one empty row at the end of the loop. My guess was that that wouldn't be  too much of a problem

I added minor error checking: if the file you try to parse contains | characters, it will stop processing and output an error. That is because when you would have a pipeline in the text, it would make the output wrong (shifting columns).

Finally, I added a little feature, following your request. If one field can occur more than once, for each entry, add the field name in the tags-file *plus* a number indicating the occurrence you want to capture. I.e.:

merch_cat_path1 will catch the first occurrence of merch_cat_path
merch_cat_path5 will catch the fifth occurrence of merch_cat_path

if you do not add a number, the first occurrence and only the first occurrence will be captured. Using this style, you have all the freedom to place these merch_cat_path or any other repeatable item at any place in the column order. The contents of the tags-file I used while testing is similar to the previous one, but I added:

merch_cat_path1
merch_cat_path2
merch_cat_path3
merch_cat_path4
merch_cat_path5
merch_cat_path6


Hope this one works well for you!

Cheers,
-- Abel Braaksma

 

by: thyrosPosted on 2007-09-20 at 05:45:49ID: 19927767

Hi Abel,

Thanks very much for this, it does the job great and has just made my life easier.  Just one thing:  it's not a big deal, but would it be possible to output the contents of the tags_to_match fields into the first row of the output.csv file?  If not, it's no problem, the file does the job as it is.

The selective merch_cat_path function is very nifty! and as for the pipeline error checking - this was also an excellent and very much essential addition - because in my tests, one of the first xml files tested was converted over perfectly, however in some other files which I was having problems with earlier turns out to be due to having pipe character in the content.. so this extra feature picked up on that and alerted me to it.  Simply brilliant.

I am going to post another question about how to strip the pipe character from large xml files without having to open the document - if you have any ideas.. check my recent questions in the next few minutes to post your answer.

Once again, thanks for this solution, I have spent the past 5 days non-stop researching and testing different things to make it work.. but none of them have been this elegant and simple.. so it means alot to me. Thanks.

A+

Peace..

 

by: abelPosted on 2007-09-20 at 06:01:17ID: 19927873

> Just one thing:  it's not a big deal, but would it be possible to output the contents of the tags_to_match fields into the first row of the output.csv file?  

of course it is possible, but you wanted to concatenate the output from several runs. That would mean (since the perl program is unaware of previous runs) that you would end up having the headers several times in the file, i.e., for each concatenated output.

> this was also an excellent and very much essential addition
glad to hear that it worked out well :)


> without having to open the document
well, that's not really true: the program will open it for you, but you don't see it ;)

> but none of them have been this elegant and simple.. so it means alot to me.
you're welcome! Thanks for the points!

> I am going to post another question about how to strip the pipe character
add this line under the while(<>) {
s/\|//g;

not tested though.

Cheers,
-- Abel Braaksma

PS: be aware of the fact that this little tool is *absolutely not* capable of understanding XML. It is just a quick and easy hack to get this simple job done and it does so by looking at the XML as a string, not as an XML tree. It is not (or hardly) extendable in the long run. If you want a more stable or more capable solution, there're some options.... but they will look different from this one.

 

by: abelPosted on 2007-09-20 at 06:06:24ID: 19927908

Btw, if you don't want to strip the pipe, but want it replaced with some string or character, use this instead:

s/\|/*/g;

which will replace the | with a *. You can place multiple characters between the last two //. If you get an error, you've used a character that must be escaped. Add a \ (backslash) in front of it. I.e., if you want a dollar sign:

s/\|/\$/g;

The /g modifier makes sure that *all* pipelines are replaced.

When you add this line, you can safely remove the line that starts with "die("ERROR: pipeline......". It will never be hit anymore (you've replaced the pipeline).

20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...