[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Regex help from legacy funky structure

Posted on 2011-10-31
15
Medium Priority
?
250 Views
Last Modified: 2012-05-12
Hello,

I have the following structure from a legacy system. It is NOT XML. Basically it has a start and end structure and it varies from the name: Group1, Group2, Grouo3, etc.

Inside each of those structures there is a finite set of properties, none of them have a closing piece. For instance <DESC> can have many lines but NO closing matching </DESC>. The contents of DESC, as the other fields, end when another field starts.

Hence, my question is: What is the best way to parse it using Regex since each structure starts with a different name?

Attached a sample file with that.

Thanks.
<group1>
<title>This is the first of the documents related to the site
<desc>This documents describes the land and surroundings
of the location
<date>2009-10-21
<rfn>1212-YUI
<location>Minas Gerais
</group1>

<group2>
<title>This is the second of the documents related to the beach site
<desc>The beach house as it sits and its architectural characteristics
according to the author.
<date>2010-01-09
<rfn>1214-YAT
<location>Bahia
</group2>

Open in new window

0
Comment
Question by:CarlosScheidecker
  • 9
  • 4
  • 2
15 Comments
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37060495
I'd do it in 2 steps to increase the maintainability:

1. Use this pattern to extract the groups:
Pattern re = Pattern.compile("<group(\\d+)>.*</group\\1>",Pattern.DOTALL);

2. Extract the fields from each group. I haven't provided a pattern yet, as I don't know if you'll want to use it or not.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37060496
Note that the number of the group is captured in a sub-pattern, and is back-referenced with \\1 to ensure we get the correct closing tag.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37060519
This almost works (the error at the very bottom but the tags correspond to
values up to that moment:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        while(m.find()){
            System.out.println(s.substring(m.start(), m.end()));
           // if(j== ss.length)break;
          System.out.println(ss[j+1]);
            j++;



        }




    }


}

Open in new window



Output:
<group1>

<title>
This is the first of the documents related to the site
<desc>
This documents describes the land and surroundingsof the location
<date>
2009-10-21
<rfn>
1212-YUI
<location>
Minas Gerais
</group1>

<group2>

<title>
This is the second of the documents related to the beach site
<desc>
The beach house as it sits and its architectural characteristicsaccording to the author.
<date>
2010-01-09
<rfn>
1214-YAT
<location>
Bahia
</group2>
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 14
	at LegacyText.main(LegacyText.java:38)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:110)

Open in new window

0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 47

Expert Comment

by:for_yan
ID: 37060548
This will pack them into HashMap:

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + " " + s1,ss[j+1]);
            j++;



        }

         System.out.println(map);


    }


}

Open in new window



{<group2> <date>=2010-01-09, <group2> <rfn>=1214-YAT, <group1> <date>=2009-10-21, <group2> <location>=Bahia, <group1> </group1>=, <group2> <desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group1> <desc>=This documents describes the land and surroundingsof the location, <group2> <title>=This is the second of the documents related to the beach site, <group1> <rfn>=1212-YUI, <group1> <title>=This is the first of the documents related to the site, <group1> <location>=Minas Gerais}

Open in new window

0
 
LVL 47

Expert Comment

by:for_yan
ID: 37060570
This would be even better,
skipping the closing group tags:

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
            j++;



        }

         System.out.println(map);


    }


}

Open in new window


Output:
{<group1>+<rfn>=1212-YUI, <group1>+<title>=This is the first of the documents related to the site, <group2>+<title>=This is the second of the documents related to the beach site, <group2>+<date>=2010-01-09, <group2>+<location>=Bahia, <group1>+<location>=Minas Gerais, <group2>+<desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group2>+<rfn>=1214-YAT, <group1>+<date>=2009-10-21, <group1>+<desc>=This documents describes the land and surroundingsof the location}

Open in new window

0
 
LVL 47

Expert Comment

by:for_yan
ID: 37060588
This has better printout:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();
        ArrayList<String> ar = new ArrayList<String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics\n" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
                ar.add(group + "+" + s1);
            j++;



        }

        for (String key: ar){

            System.out.println(key + " : " + map.get(key));
        }

         


    }


}

Open in new window


Output:

<group1>+<title> : This is the first of the documents related to the site
<group1>+<desc> : This documents describes the land and surroundings
of the location
<group1>+<date> : 2009-10-21
<group1>+<rfn> : 1212-YUI
<group1>+<location> : Minas Gerais
<group2>+<title> : This is the second of the documents related to the beach site
<group2>+<desc> : The beach house as it sits and its architectural characteristics
according to the author.
<group2>+<date> : 2010-01-09
<group2>+<rfn> : 1214-YAT
<group2>+<location> : Bahia

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 37073456
Almost there for_yan, I now need to put them into a record structure, The hashmap will  override them once I have more than one Group1 structure.

0
 
LVL 47

Expert Comment

by:for_yan
ID: 37073464
No, it will not - look at my latest code - it uses groupN + key as the ultimate key
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 37073475
Basically I need to get all Group1s, Group2s, and Group3s and represent them as objects (pojos) with the fields.

The code above represents the maps but if there is more than one Group1, then the fields are mixed up as you do not know to which Group1 structure that one belongs to.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37073481
You mean there could be more than one "Group1" in one file?
Because at this point Group1 is fully separated form Group2 from Group3 etc.
Yes, if there is more than one Group1 in the same file - they will be overwritten
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 37073484
Exactly for_yan. I need to capture a list of the different structures without them being overriden. That is, all Group1s and their respective fields, Group2s, Group3s, etc.

Say you have a Group Object that contains a map. Each Group object will have its map populated with the fields and their values. But a general Map would override the many different Group1s.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37073493
Actually is is doable to separate several group1's also.
You just need to maintain the ArrayList of group names

and in this place

if(s1.startsWith("<group")){group = s1;  j++; continue;}

you check if there was already group1, you can add group1-2 to the key
or if there was group1-2 you can make it group1-3, etc.

0
 
LVL 47

Accepted Solution

by:
for_yan earned 2000 total points
ID: 37073535
Like that:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();
        ArrayList<String> ar = new ArrayList<String>();
        HashMap<String,Integer> mapGroups  = new  HashMap<String,Integer>();


             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics\n" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>" +
                     "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>"
                       ;

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){
                if(mapGroups.get(s1) == null){
                    group = s1 + "0";
                    mapGroups.put(s1,0);

                }  else
                {
                    int numm =   mapGroups.get(s1);
                   // System.out.println("numm:" + numm);
                    group = s1 + (numm + 1);
                    mapGroups.put(s1, numm+1);
                }

                j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
                ar.add(group + "+" + s1);
            j++;



        }

        for (String key: ar){

            System.out.println(key + " : " + map.get(key));
        }

         


    }


}

Open in new window


<group1>0+<title> : This is the first of the documents related to the site
<group1>0+<desc> : This documents describes the land and surroundings
of the location
<group1>0+<date> : 2009-10-21
<group1>0+<rfn> : 1212-YUI
<group1>0+<location> : Minas Gerais
<group2>0+<title> : This is the second of the documents related to the beach site
<group2>0+<desc> : The beach house as it sits and its architectural characteristics
according to the author.
<group2>0+<date> : 2010-01-09
<group2>0+<rfn> : 1214-YAT
<group2>0+<location> : Bahia
<group1>1+<title> : This is the first of the documents related to the site
<group1>1+<desc> : This documents describes the land and surroundings
of the location
<group1>1+<date> : 2009-10-21
<group1>1+<rfn> : 1212-YUI
<group1>1+<location> : Minas Gerais

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 37073585
For_yan. Aside from GroupN I also have 3 other structures which do not start with the word Group.

Hence, I think on the Regex it should capture first all inside <GroupN>*</GroupN> and create an object with those fields. Then, it should capture the other structures such as <Arquivo>*</Arquivo> <Update>*</update>

Meaning, the name of the strucure I would store in a filed called Type. So Group1, Group2, Group3 and so on as well as Arquivo, Update, Notific which are the names of the other structures.

Basically the goal would be, capture all the structures and then parse their fields.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37073613
Or in here

isntead of
if(s1.startsWith("<group") )

you can try

 if(s1.startsWith("<group") || s1.startsWith("Arquivo") || s1.startsWith("Update")  )

and maybe it would be possible to pack it in one map this way, in case
you have finite number of these

0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
Suggested Courses
Course of the Month17 days, 15 hours left to enroll

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question