Solved

Regex to extract XML contents

Posted on 2011-02-13
22
773 Views
Last Modified: 2012-05-11
Hi,

I have the following XML contents to work on:

<dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
<dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
<dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
<cd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<cd id="B">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<rd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</rd>
<dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>

There are "dvd", "cd" and "rd" tags commingled. I want to see if a regex can extract all the "dvd" contents including the tags.

Thanks
0
Comment
Question by:wsyy
  • 9
  • 7
  • 3
  • +2
22 Comments
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34884584
Is this what you mean?

Pattern:
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);

Gives result:
    [0] => Array
        (
            [0] => <dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
            [1] => <dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
            [2] => <dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
            [3] => <dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>
        )
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884588
Do you necessarily need to do it through regex, or parsing it with SAX
will also be acceptable?
0
 
LVL 7

Expert Comment

by:dimaj
ID: 34884606
I'm sure that you could use regular expressions to do the job, I'm just not very familiar with regex to construct one for you...
I did however find couple of articles on how to read and parse XML using Java and here are the links:
http://www.java-tips.org/java-se-tips/javax.xml.parsers/how-to-read-xml-file-in-java.html
http://www.java-samples.com/showtutorial.php?tutorialid=152

Hope they help

dimaj
0
 

Author Comment

by:wsyy
ID: 34884662
for_yan, you are right SAX can be alternative solution. Yet I would like to avoid navigate the DOM tree to get all the nodes.

Also, I would think regex can be faster and easier to implement (am I right?).

dimaj, I will take a look at what you come up with.

TerryAtOpus: that is exact what I am looking for. I will do a test asap. Thanks.
0
 

Author Comment

by:wsyy
ID: 34884670
TerryAtOpus, do you have the code example? I just want to know how to get group(1) etc.
0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 125 total points
ID: 34884672
I generally work with PHP, so I generated the code from www.myregextester.com - the full code generated is this (hopefully it will work for you):

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884675
With SAX you don't need  to get all nodes necessarily,
but once regex is what you want, then probably you should be happy with solution of TerryAtOpus
0
 
LVL 92

Expert Comment

by:objects
ID: 34884676
> Also, I would think regex can be faster and easier to implement (am I right?).

regex is actually quite slow
0
 

Author Comment

by:wsyy
ID: 34884677
Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                  System.out.println("Group 0: " + m.group(0));
                  System.out.println("Group 1: " + m.group(1));
                  System.out.println("Group 2: " + m.group(2));
            }

System throws out exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
      at java.util.regex.Matcher.group(Unknown Source)
0
 

Author Comment

by:wsyy
ID: 34884681
objects:

do you think SAX is better than regex? what would be a piece of SAX code that can do the same job?

thanks
0
 

Author Comment

by:wsyy
ID: 34884686
for_yan or objects:

can you please kindly provide a code example? thanks
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 92

Expert Comment

by:objects
ID: 34884688
sax would be a lot faster
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884689
Do you want to return XML between the DVDs or just the data - title: XXX
                                                                                                      Author:YYYY
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34884691
Try this - it might shed some light on the structure of the result:

Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                System.out.println("match [" + m.group() + "]");
            }
0
 

Author Comment

by:wsyy
ID: 34884714
for_yan, i would like the XML between the DVDs, not just the attributes or data.
0
 

Author Comment

by:wsyy
ID: 34884722
I just noticed the following from Jdom API

void      XMLOutputter.output(Element element, java.io.OutputStream out)
          Print out an Element, including its Attributes, and all contained (child) elements, etc.

I think this would help if it can output to a string, not to console or a file.

Just a thought.

0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884772
This will guive you everything but without attributes.
Let me add attributes

Mind, that I added top elment to your file


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {

                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

output:
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>

Open in new window


test2.xml
0
 

Author Comment

by:wsyy
ID: 34884817
for_yan,. the top elements like <dvd></dvd> or <product></product>, as well as their attributes, are needed thanks.
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 125 total points
ID: 34884818

Now it writes it al:


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window


Output:

<dvd id=A>
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
</dvd>
<dvd id=B>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
</dvd>
<dvd id=C>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
</dvd>
<dvd id=D>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>
</dvd>

Open in new window

0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884823
There was no products or product in your example, but that is not a problem of course
0
 

Author Closing Comment

by:wsyy
ID: 34884829
Big thanks to all!
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884843

This is with <products> and </products>
Still I needed to add one top element, like <data> and </data.
enclosing it all. I'm sure with proper
XML header there would be no need to
do it

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                boolean products = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}
				if (qName.toUpperCase().equalsIgnoreCase("products")) {
                           System.out.println("<products>");
                }


                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}
					if (qName.toUpperCase().equalsIgnoreCase("products")) {
						products = false;
                        System.out.println("</products>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now