Link to home
Start Free TrialLog in
Avatar of wsyy
wsyy

asked on

Regex to extract XML contents

Hi,

I have the following XML contents to work on:

<dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
<dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
<dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
<cd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<cd id="B">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<rd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</rd>
<dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>

There are "dvd", "cd" and "rd" tags commingled. I want to see if a regex can extract all the "dvd" contents including the tags.

Thanks
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Is this what you mean?

Pattern:
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);

Gives result:
    [0] => Array
        (
            [0] => <dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
            [1] => <dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
            [2] => <dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
            [3] => <dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>
        )
Do you necessarily need to do it through regex, or parsing it with SAX
will also be acceptable?
I'm sure that you could use regular expressions to do the job, I'm just not very familiar with regex to construct one for you...
I did however find couple of articles on how to read and parse XML using Java and here are the links:
http://www.java-tips.org/java-se-tips/javax.xml.parsers/how-to-read-xml-file-in-java.html
http://www.java-samples.com/showtutorial.php?tutorialid=152

Hope they help

dimaj
Avatar of wsyy
wsyy

ASKER

for_yan, you are right SAX can be alternative solution. Yet I would like to avoid navigate the DOM tree to get all the nodes.

Also, I would think regex can be faster and easier to implement (am I right?).

dimaj, I will take a look at what you come up with.

TerryAtOpus: that is exact what I am looking for. I will do a test asap. Thanks.
Avatar of wsyy

ASKER

TerryAtOpus, do you have the code example? I just want to know how to get group(1) etc.
ASKER CERTIFIED SOLUTION
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
With SAX you don't need  to get all nodes necessarily,
but once regex is what you want, then probably you should be happy with solution of TerryAtOpus
> Also, I would think regex can be faster and easier to implement (am I right?).

regex is actually quite slow
Avatar of wsyy

ASKER

Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                  System.out.println("Group 0: " + m.group(0));
                  System.out.println("Group 1: " + m.group(1));
                  System.out.println("Group 2: " + m.group(2));
            }

System throws out exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
      at java.util.regex.Matcher.group(Unknown Source)
Avatar of wsyy

ASKER

objects:

do you think SAX is better than regex? what would be a piece of SAX code that can do the same job?

thanks
Avatar of wsyy

ASKER

for_yan or objects:

can you please kindly provide a code example? thanks
sax would be a lot faster
Do you want to return XML between the DVDs or just the data - title: XXX
                                                                                                      Author:YYYY
Try this - it might shed some light on the structure of the result:

Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                System.out.println("match [" + m.group() + "]");
            }
Avatar of wsyy

ASKER

for_yan, i would like the XML between the DVDs, not just the attributes or data.
Avatar of wsyy

ASKER

I just noticed the following from Jdom API

void      XMLOutputter.output(Element element, java.io.OutputStream out)
          Print out an Element, including its Attributes, and all contained (child) elements, etc.

I think this would help if it can output to a string, not to console or a file.

Just a thought.

This will guive you everything but without attributes.
Let me add attributes

Mind, that I added top elment to your file


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {

                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

output:
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>

Open in new window


test2.xml
Avatar of wsyy

ASKER

for_yan,. the top elements like <dvd></dvd> or <product></product>, as well as their attributes, are needed thanks.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
There was no products or product in your example, but that is not a problem of course
Avatar of wsyy

ASKER

Big thanks to all!

This is with <products> and </products>
Still I needed to add one top element, like <data> and </data.
enclosing it all. I'm sure with proper
XML header there would be no need to
do it

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                boolean products = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}
				if (qName.toUpperCase().equalsIgnoreCase("products")) {
                           System.out.println("<products>");
                }


                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}
					if (qName.toUpperCase().equalsIgnoreCase("products")) {
						products = false;
                        System.out.println("</products>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window