?
Solved

Regex to extract XML contents

Posted on 2011-02-13
22
Medium Priority
?
785 Views
Last Modified: 2012-05-11
Hi,

I have the following XML contents to work on:

<dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
<dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
<dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
<cd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<cd id="B">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</cd>
<rd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</rd>
<dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>

There are "dvd", "cd" and "rd" tags commingled. I want to see if a regex can extract all the "dvd" contents including the tags.

Thanks
0
Comment
Question by:wsyy
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 7
  • 3
  • +2
22 Comments
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34884584
Is this what you mean?

Pattern:
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);

Gives result:
    [0] => Array
        (
            [0] => <dvd id="A">
  <title>Lord of the Rings: The Fellowship of the Ring</title>
  <length>178</length>
  <actor>Ian Holm</actor>
  <actor>Elijah Wood</actor>
  <actor>Ian McKellen</actor>
</dvd>
            [1] => <dvd id="B">
  <title>The Matrix</title>
  <length>136</length>
  <actor>Keanu Reeves</actor>
  <actor>Laurence Fishburne</actor>
</dvd>
            [2] => <dvd id="C">
  <title>Amadeus</title>
  <length>158</length>
  <actor>F. Murray Abraham</actor>
  <actor>Tom Hulce</actor>
  <actor>Elizabeth Berridge</actor>
</dvd>
            [3] => <dvd id="D">
  <title>Chain Reaction</title>
  <length>106</length>
  <actor>Morgan Freeman</actor>
  <actor>Keanu Reeves</actor>
</dvd>
        )
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884588
Do you necessarily need to do it through regex, or parsing it with SAX
will also be acceptable?
0
 
LVL 7

Expert Comment

by:dimaj
ID: 34884606
I'm sure that you could use regular expressions to do the job, I'm just not very familiar with regex to construct one for you...
I did however find couple of articles on how to read and parse XML using Java and here are the links:
http://www.java-tips.org/java-se-tips/javax.xml.parsers/how-to-read-xml-file-in-java.html
http://www.java-samples.com/showtutorial.php?tutorialid=152

Hope they help

dimaj
0
Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

 

Author Comment

by:wsyy
ID: 34884662
for_yan, you are right SAX can be alternative solution. Yet I would like to avoid navigate the DOM tree to get all the nodes.

Also, I would think regex can be faster and easier to implement (am I right?).

dimaj, I will take a look at what you come up with.

TerryAtOpus: that is exact what I am looking for. I will do a test asap. Thanks.
0
 

Author Comment

by:wsyy
ID: 34884670
TerryAtOpus, do you have the code example? I just want to know how to get group(1) etc.
0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 500 total points
ID: 34884672
I generally work with PHP, so I generated the code from www.myregextester.com - the full code generated is this (hopefully it will work for you):

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>",Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884675
With SAX you don't need  to get all nodes necessarily,
but once regex is what you want, then probably you should be happy with solution of TerryAtOpus
0
 
LVL 92

Expert Comment

by:objects
ID: 34884676
> Also, I would think regex can be faster and easier to implement (am I right?).

regex is actually quite slow
0
 

Author Comment

by:wsyy
ID: 34884677
Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                  System.out.println("Group 0: " + m.group(0));
                  System.out.println("Group 1: " + m.group(1));
                  System.out.println("Group 2: " + m.group(2));
            }

System throws out exception:

Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
      at java.util.regex.Matcher.group(Unknown Source)
0
 

Author Comment

by:wsyy
ID: 34884681
objects:

do you think SAX is better than regex? what would be a piece of SAX code that can do the same job?

thanks
0
 

Author Comment

by:wsyy
ID: 34884686
for_yan or objects:

can you please kindly provide a code example? thanks
0
 
LVL 92

Expert Comment

by:objects
ID: 34884688
sax would be a lot faster
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884689
Do you want to return XML between the DVDs or just the data - title: XXX
                                                                                                      Author:YYYY
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34884691
Try this - it might shed some light on the structure of the result:

Pattern re = Pattern.compile("<dvd[^>]*>.*?</dvd>", Pattern.DOTALL|Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
            Matcher m = re.matcher(file);
            while (m.find()){
                System.out.println("match [" + m.group() + "]");
            }
0
 

Author Comment

by:wsyy
ID: 34884714
for_yan, i would like the XML between the DVDs, not just the attributes or data.
0
 

Author Comment

by:wsyy
ID: 34884722
I just noticed the following from Jdom API

void      XMLOutputter.output(Element element, java.io.OutputStream out)
          Print out an Element, including its Attributes, and all contained (child) elements, etc.

I think this would help if it can output to a string, not to console or a file.

Just a thought.

0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884772
This will guive you everything but without attributes.
Let me add attributes

Mind, that I added top elment to your file


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {

                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

output:
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>

Open in new window


test2.xml
0
 

Author Comment

by:wsyy
ID: 34884817
for_yan,. the top elements like <dvd></dvd> or <product></product>, as well as their attributes, are needed thanks.
0
 
LVL 47

Assisted Solution

by:for_yan
for_yan earned 500 total points
ID: 34884818

Now it writes it al:


import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}

                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window


Output:

<dvd id=A>
<title>Lord of the Rings: The Fellowship of the Ring</title>
<price>178</price>
<actor> Ian Holm</actor>
<actor> Elijah Wood</actor>
<actor> Ian McKellen</actor>
</dvd>
<dvd id=B>
<title>The Matrix</title>
<price>136</price>
<actor> Keanu Reeves</actor>
<actor> Laurence Fishburne</actor>
</dvd>
<dvd id=C>
<title>Amadeus</title>
<price>158</price>
<actor> F. Murray Abraham</actor>
<actor> Tom Hulce</actor>
<actor> Elizabeth Berridge</actor>
</dvd>
<dvd id=D>
<title>Chain Reaction</title>
<price>106</price>
<actor> Morgan Freeman</actor>
<actor> Keanu Reeves</actor>
</dvd>

Open in new window

0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884823
There was no products or product in your example, but that is not a problem of course
0
 

Author Closing Comment

by:wsyy
ID: 34884829
Big thanks to all!
0
 
LVL 47

Expert Comment

by:for_yan
ID: 34884843

This is with <products> and </products>
Still I needed to add one top element, like <data> and </data.
enclosing it all. I'm sure with proper
XML header there would be no need to
do it

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

/*
<product>
 <name>laptop</name>
 <price>120</price>
 <maker>Dell</maker>
</product>

*/
import java.util.ArrayList;



public class SimpleXMLParsing {


	static String filename = "C:\\temp\\test\\test2.xml";

    public static void getElementFromXML() {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser saxParser = factory.newSAXParser();
			DefaultHandler handler = new DefaultHandler() {
                boolean title = false;
                boolean length1 = false;
                boolean actor = false;
                boolean dvd = false;
                boolean products = false;
                String dvdAttValue;
                String dvdAtt;


                public void startElement(String uri, String localName,
						String qName, Attributes attributes)
						throws SAXException {

                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = true;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = true;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor= true;
					}
				if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd= true;
                    dvdAtt = attributes.getQName(0);
                    dvdAttValue = attributes.getValue(0);
                    System.out.println("<dvd " + dvdAtt + "=" + dvdAttValue + ">");

					}
				if (qName.toUpperCase().equalsIgnoreCase("products")) {
                           System.out.println("<products>");
                }


                    }
				public void endElement(String uri, String localName,
						String qName) throws SAXException {
                    if (qName.toUpperCase().equalsIgnoreCase("title")) {
						title = false;
					}

                    if (qName.toUpperCase().equalsIgnoreCase("length")) {
						length1 = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("actor")) {
						actor = false;
					}
					if (qName.toUpperCase().equalsIgnoreCase("dvd")) {
						dvd = false;
                        System.out.println("</dvd>");
					}
					if (qName.toUpperCase().equalsIgnoreCase("products")) {
						products = false;
                        System.out.println("</products>");
					}

                }
				public void characters(char ch[], int start, int length)
						throws SAXException {


                    if (title && dvd) {
                           System.out.println("<title>" + new String(ch, start, length) + "</title>");
                    }
                   if (length1 && dvd) {
                           System.out.println("<price>" + new String(ch, start, length) + "</price>");
                    }
                   if (actor && dvd) {
                           System.out.println("<actor> " + new String(ch, start, length) + "</actor>");
                    }


                    }


            };
			saxParser.parse(filename, handler);

        }
		catch (Exception e) {
			e.printStackTrace();
		}




    }

	public static void main(String args[])
	{
		SimpleXMLParsing.getElementFromXML();
    }


}

Open in new window

0

Featured Post

Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Suggested Courses
Course of the Month11 days, 1 hour left to enroll

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question