Solved

Regex to extract XML contents

Posted on 2011-02-13
22
775 Views
Last Modified: 2012-05-11
Hi,

I have the following XML contents to work on:

<dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
<dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
<dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
<cd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<cd id="B">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<rd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</rd>
<dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>

There are "dvd", "cd" and "rd" tags commingled. I want to see if a regex can extract all the "dvd" contents including the tags.

Thanks
0
Comment
Question by:wsyy
  • 9
  • 7
  • 3
  • +2
22 Comments
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34884584
Is this what you mean?

Pattern:
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);

Gives result:
    [0] => Array
        (
            [0] => <dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
            [1] => <dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
            [2] => <dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
            [3] => <dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>
        )
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884588
Do you necessarily need to do it through regex, or parsing it with SAX
will also be acceptable?
0
 
LVL 7

Expert Comment

by:dimaj
ID: 34884606
I'm sure that you could use regular expressions to do the job, I'm just not very familiar with regex to construct one for you...
I did however find couple of articles on how to read and parse XML using Java and here are the links:
http://www.java-tips.org/java-se-tips/javax.xml.parsers/how-to-read-xml-file-in-java.html
http://www.java-samples.com/showtutorial.php?tutorialid=152

Hope they help

dimaj
0
 

Author Comment

by:wsyy
ID: 34884662
for_yan, you are right SAX can be alternative solution. Yet I would like to avoid navigate the DOM tree to get all the nodes.

Also, I would think regex can be faster and easier to implement (am I right?).

dimaj, I will take a look at what you come up with.

TerryAtOpus: that is exact what I am looking for. I will do a test asap. Thanks.
0
 

Author Comment

by:wsyy
ID: 34884670
TerryAtOpus, do you have the code example? I just want to know how to get group(1) etc.
0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 125 total points
ID: 34884672
I generally work with PHP, so I generated the code from www.myregextester.com - the full code generated is this (hopefully it will work for you):

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884675
With SAX you don't need  to get all nodes necessarily,
but once regex is what you want, then probably you should be happy with solution of TerryAtOpus
0
 
LVL 92

Expert Comment

by:objects
ID: 34884676
> Also, I would think regex can be faster and easier to implement (am I right?).

regex is actually quite slow
0
 

Author Comment

by:wsyy
ID: 34884677
Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                  System.out.println("Group 0: " + m.group(0));
                  System.out.println("Group 1: " + m.group(1));
                  System.out.println("Group 2: " + m.group(2));
            }

System throws out exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
      at java.util.regex.Matcher.group(Unknown Source)
0
 

Author Comment

by:wsyy
ID: 34884681
objects:

do you think SAX is better than regex? what would be a piece of SAX code that can do the same job?

thanks
0
 

Author Comment

by:wsyy
ID: 34884686
for_yan or objects:

can you please kindly provide a code example? thanks
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 92

Expert Comment

by:objects
ID: 34884688
sax would be a lot faster
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884689
Do you want to return XML between the DVDs or just the data - title: XXX
                                                                                                      Author:YYYY
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34884691
Try this - it might shed some light on the structure of the result:

Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                System.out.println("match [" + m.group() + "]");
            }
0
 

Author Comment

by:wsyy
ID: 34884714
for_yan, i would like the XML between the DVDs, not just the attributes or data.
0
 

Author Comment

by:wsyy
ID: 34884722
I just noticed the following from Jdom API

void      XMLOutputter.output(Element element, java.io.OutputStream out)
          Print out an Element, including its Attributes, and all contained (child) elements, etc.

I think this would help if it can output to a string, not to console or a file.

Just a thought.

0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884772
This will guive you everything but without attributes.
Let me add attributes

Mind, that I added top elment to your file


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {

                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

output:
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>

Open in new window


test2.xml
0
 

Author Comment

by:wsyy
ID: 34884817
for_yan,. the top elements like <dvd></dvd> or <product></product>, as well as their attributes, are needed thanks.
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 125 total points
ID: 34884818

Now it writes it al:


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window


Output:

<dvd id=A>
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
</dvd>
<dvd id=B>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
</dvd>
<dvd id=C>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
</dvd>
<dvd id=D>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>
</dvd>

Open in new window

0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884823
There was no products or product in your example, but that is not a problem of course
0
 

Author Closing Comment

by:wsyy
ID: 34884829
Big thanks to all!
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884843

This is with <products> and </products>
Still I needed to add one top element, like <data> and </data.
enclosing it all. I'm sure with proper
XML header there would be no need to
do it

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                boolean products = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}
				if (qName.toUpperCase().equalsIgnoreCase("products")) {
                           System.out.println("<products>");
                }


                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}
					if (qName.toUpperCase().equalsIgnoreCase("products")) {
						products = false;
                        System.out.println("</products>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

896 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now