Regex to extract XML contents

Hi,

I have the following XML contents to work on:

<dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
<dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
<dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
<cd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<cd id="B">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<rd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</rd>
<dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>

There are "dvd", "cd" and "rd" tags commingled. I want to see if a regex can extract all the "dvd" contents including the tags.

Thanks
wsyyAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Terry WoodsIT GuruCommented:
Is this what you mean?

Pattern:
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);

Gives result:
    [0] => Array
        (
            [0] => <dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
            [1] => <dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
            [2] => <dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
            [3] => <dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>
        )
0
for_yanCommented:
Do you necessarily need to do it through regex, or parsing it with SAX
will also be acceptable?
0
dimajCommented:
I'm sure that you could use regular expressions to do the job, I'm just not very familiar with regex to construct one for you...
I did however find couple of articles on how to read and parse XML using Java and here are the links:
http://www.java-tips.org/java-se-tips/javax.xml.parsers/how-to-read-xml-file-in-java.html
http://www.java-samples.com/showtutorial.php?tutorialid=152

Hope they help

dimaj
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

wsyyAuthor Commented:
for_yan, you are right SAX can be alternative solution. Yet I would like to avoid navigate the DOM tree to get all the nodes.

Also, I would think regex can be faster and easier to implement (am I right?).

dimaj, I will take a look at what you come up with.

TerryAtOpus: that is exact what I am looking for. I will do a test asap. Thanks.
0
wsyyAuthor Commented:
TerryAtOpus, do you have the code example? I just want to know how to get group(1) etc.
0
Terry WoodsIT GuruCommented:
I generally work with PHP, so I generated the code from www.myregextester.com - the full code generated is this (hopefully it will work for you):

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
for_yanCommented:
With SAX you don't need  to get all nodes necessarily,
but once regex is what you want, then probably you should be happy with solution of TerryAtOpus
0
objectsCommented:
> Also, I would think regex can be faster and easier to implement (am I right?).

regex is actually quite slow
0
wsyyAuthor Commented:
Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                  System.out.println("Group 0: " + m.group(0));
                  System.out.println("Group 1: " + m.group(1));
                  System.out.println("Group 2: " + m.group(2));
            }

System throws out exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
      at java.util.regex.Matcher.group(Unknown Source)
0
wsyyAuthor Commented:
objects:

do you think SAX is better than regex? what would be a piece of SAX code that can do the same job?

thanks
0
wsyyAuthor Commented:
for_yan or objects:

can you please kindly provide a code example? thanks
0
objectsCommented:
sax would be a lot faster
0
for_yanCommented:
Do you want to return XML between the DVDs or just the data - title: XXX
                                                                                                      Author:YYYY
0
Terry WoodsIT GuruCommented:
Try this - it might shed some light on the structure of the result:

Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                System.out.println("match [" + m.group() + "]");
            }
0
wsyyAuthor Commented:
for_yan, i would like the XML between the DVDs, not just the attributes or data.
0
wsyyAuthor Commented:
I just noticed the following from Jdom API

void      XMLOutputter.output(Element element, java.io.OutputStream out)
          Print out an Element, including its Attributes, and all contained (child) elements, etc.

I think this would help if it can output to a string, not to console or a file.

Just a thought.

0
for_yanCommented:
This will guive you everything but without attributes.
Let me add attributes

Mind, that I added top elment to your file


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {

                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

output:
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>

Open in new window


test2.xml
0
wsyyAuthor Commented:
for_yan,. the top elements like <dvd></dvd> or <product></product>, as well as their attributes, are needed thanks.
0
for_yanCommented:

Now it writes it al:


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window


Output:

<dvd id=A>
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
</dvd>
<dvd id=B>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
</dvd>
<dvd id=C>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
</dvd>
<dvd id=D>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>
</dvd>

Open in new window

0
for_yanCommented:
There was no products or product in your example, but that is not a problem of course
0
wsyyAuthor Commented:
Big thanks to all!
0
for_yanCommented:

This is with <products> and </products>
Still I needed to add one top element, like <data> and </data.
enclosing it all. I'm sure with proper
XML header there would be no need to
do it

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                boolean products = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}
				if (qName.toUpperCase().equalsIgnoreCase("products")) {
                           System.out.println("<products>");
                }


                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}
					if (qName.toUpperCase().equalsIgnoreCase("products")) {
						products = false;
                        System.out.println("</products>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.