- For individual users
- Instant access to solutions
- Ask your tech questions
- Start your 30-day Free Trial
Main Topics
Browse All TopicsI am working with large xml files (ranging from 100mb to 1GB+), thus it is not really possible to load them into memory.. and the goal is to convert these files to a csv format, but pipe (|) delimited.
The XML file looks like this;
<?xml version="1.0" encoding="UTF-8" ?>
- <merch_item_feed>
- <item_data>
- <item_basic_data>
<item_unique_id>0115526102
<item_ean>9780115526107</i
<item_sku>0115526102</item
<item_upc />
<item_mpn />
<item_brand>Stationery Office Books</item_brand>
<item_name>The Official Learning to Drive Pack (Driving Skills)</item_name>
<item_model />
<item_category>Book</item_
<item_short_desc>Paperback
<item_page_url>http://www.
<amzn_page_url>http://www.
<offer_page_url>http://www
<offer_used_url>http://www
<item_image_url>http://ec1
<item_image_url_small>http
<item_salesrank>233499</it
<item_price>21.23</item_pr
<item_inventory>Usually dispatched within 1-2 business days</item_inventory>
<item_shipping_charge>Chec
<amzn_price>24.99</amzn_pr
<amzn_inventory>Usually dispatched within 24 hours</amzn_inventory>
<amzn_shipping_charge>Free
<fm_price>24.99</fm_price>
<fm_inventory>Usually dispatched within 24 hours</fm_inventory>
<fm_shipping_charge>Free!<
<tp_new_price>21.23</tp_ne
<tp_new_inventory>Usually dispatched within 1-2 business days</tp_new_inventory>
<tp_new_shipping_charge>Ch
<tp_used_price>20.00</tp_u
<tp_used_inventory>In Stock</tp_used_inventory>
<tp_used_shipping_charge>C
</item_basic_data>
- <prod_specific_data category="book">
<known_attr_val_pair attr="book_author" val="Driving Standards Agency" />
<known_attr_val_pair attr="book_isbn" val="0115526102" />
<known_attr_val_pair attr="book_format" val="Paperback" />
</prod_specific_data>
- <merch_cat_list>
- <merch_cat_item>
<merch_cat_name>277082</me
<merch_cat_path>Books/Subj
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>278131</me
<merch_cat_path>Books/Subj
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>10834521</
<merch_cat_path>Books/Spec
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>10834491</
<merch_cat_path>Books/Spec
</merch_cat_item>
</merch_cat_list>
</item_data>
- <item_data>
- <item_basic_data>
<item_unique_id>0115528423
<item_ean>9780115528422</i
<item_sku>0115528423</item
<item_upc />
<item_mpn />
<item_brand>The Stationary Office (TSO)</item_brand>
<item_name>The Official DSA Theory Test for Motorcyclists CD-ROM</item_name>
<item_model />
<item_category>Software</i
<item_short_desc>, Platforms: Windows XP</item_short_desc>
<item_page_url>http://www.
<amzn_page_url>http://www.
<offer_page_url>http://www
<offer_used_url>http://www
<item_image_url>http://ec1
<item_image_url_small>http
<item_salesrank>1068</item
<item_price>16.99</item_pr
<item_inventory>Not yet released</item_inventory>
<item_shipping_charge>Free
<amzn_price>16.99</amzn_pr
<amzn_inventory>Not yet released</amzn_inventory>
<amzn_shipping_charge>Free
<fm_price>16.99</fm_price>
<fm_inventory>Not yet released</fm_inventory>
<fm_shipping_charge>Free!<
</item_basic_data>
- <prod_specific_data category="software">
<known_attr_val_pair attr="hardware_platform" val="PC" />
<known_attr_val_pair attr="software_os" val="Windows XP" />
<known_attr_val_pair attr="software_format" val="CD-ROM" />
</prod_specific_data>
- <merch_cat_list>
- <merch_cat_item>
<merch_cat_name>277082</me
<merch_cat_path>Books/Subj
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>278131</me
<merch_cat_path>Books/Subj
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>912026</me
<merch_cat_path>Software/C
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>16305411</
<merch_cat_path>Software/C
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>317243011<
<merch_cat_path>Software/C
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>341610011<
<merch_cat_path>uk-shops/E
</merch_cat_item>
</merch_cat_list>
</item_data>
- </merchitemfeed>
Objective is to extract the data from the 'item_basic_data' elements and separate them by pipe character.
Output should look something like (with the field headers);
item_unique_id|item_ean|it
12345678901|12345678|12345
12345678901|12345678|12345
12345678901|12345678|12345
--------------------------
Please note that only the information from 'item_basic_data' needs to be extracted - instructions on how to accomplish this is sufficient as an answer. However, if you know your stuff, I would appreciate a solution that could extract the first instance of 'merch_cat_path'. If you notice, each 'item_basic_data' has 4 or 5 duplicate elements of 'merch_cat_path', but we only want the first instance if possible.
I am assuming we will need some xslt file, but I don't know how to write it. I am experimenting with a program that will do the processing of the input xml, transform xslt, and output csv files, but it does not supply the xslt file itself.
Also, if you have any suggestions for similar programs that can handle & process large xml files - preferably freeware, but commercial is ok too.
This question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.
Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.
If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.
Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.
Access the answers to your technology questions today.
30-day free trial. Register in 60 seconds.
Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Try it out and discover for yourself.
30-day free trial. Register in 60 seconds.
Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.
Business Accounts
Answer for Membership