Regex help from legacy funky structure

Hello,

I have the following structure from a legacy system. It is NOT XML. Basically it has a start and end structure and it varies from the name: Group1, Group2, Grouo3, etc.

Inside each of those structures there is a finite set of properties, none of them have a closing piece. For instance <DESC> can have many lines but NO closing matching </DESC>. The contents of DESC, as the other fields, end when another field starts.

Hence, my question is: What is the best way to parse it using Regex since each structure starts with a different name?

Attached a sample file with that.

Thanks.
<group1>
<title>This is the first of the documents related to the site
<desc>This documents describes the land and surroundings
of the location
<date>2009-10-21
<rfn>1212-YUI
<location>Minas Gerais
</group1>

<group2>
<title>This is the second of the documents related to the beach site
<desc>The beach house as it sits and its architectural characteristics
according to the author.
<date>2010-01-09
<rfn>1214-YAT
<location>Bahia
</group2>

Open in new window

LVL 1
CarlosScheideckerAsked:
Who is Participating?
 
for_yanConnect With a Mentor Commented:
Like that:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();
        ArrayList<String> ar = new ArrayList<String>();
        HashMap<String,Integer> mapGroups  = new  HashMap<String,Integer>();


             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics\n" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>" +
                     "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>"
                       ;

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){
                if(mapGroups.get(s1) == null){
                    group = s1 + "0";
                    mapGroups.put(s1,0);

                }  else
                {
                    int numm =   mapGroups.get(s1);
                   // System.out.println("numm:" + numm);
                    group = s1 + (numm + 1);
                    mapGroups.put(s1, numm+1);
                }

                j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
                ar.add(group + "+" + s1);
            j++;



        }

        for (String key: ar){

            System.out.println(key + " : " + map.get(key));
        }

         


    }


}

Open in new window


<group1>0+<title> : This is the first of the documents related to the site
<group1>0+<desc> : This documents describes the land and surroundings
of the location
<group1>0+<date> : 2009-10-21
<group1>0+<rfn> : 1212-YUI
<group1>0+<location> : Minas Gerais
<group2>0+<title> : This is the second of the documents related to the beach site
<group2>0+<desc> : The beach house as it sits and its architectural characteristics
according to the author.
<group2>0+<date> : 2010-01-09
<group2>0+<rfn> : 1214-YAT
<group2>0+<location> : Bahia
<group1>1+<title> : This is the first of the documents related to the site
<group1>1+<desc> : This documents describes the land and surroundings
of the location
<group1>1+<date> : 2009-10-21
<group1>1+<rfn> : 1212-YUI
<group1>1+<location> : Minas Gerais

Open in new window

0
 
Terry WoodsIT GuruCommented:
I'd do it in 2 steps to increase the maintainability:

1. Use this pattern to extract the groups:
Pattern re = Pattern.compile("<group(\\d+)>.*</group\\1>",Pattern.DOTALL);

2. Extract the fields from each group. I haven't provided a pattern yet, as I don't know if you'll want to use it or not.
0
 
Terry WoodsIT GuruCommented:
Note that the number of the group is captured in a sub-pattern, and is back-referenced with \\1 to ensure we get the correct closing tag.
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

 
for_yanCommented:
This almost works (the error at the very bottom but the tags correspond to
values up to that moment:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        while(m.find()){
            System.out.println(s.substring(m.start(), m.end()));
           // if(j== ss.length)break;
          System.out.println(ss[j+1]);
            j++;



        }




    }


}

Open in new window



Output:
<group1>

<title>
This is the first of the documents related to the site
<desc>
This documents describes the land and surroundingsof the location
<date>
2009-10-21
<rfn>
1212-YUI
<location>
Minas Gerais
</group1>

<group2>

<title>
This is the second of the documents related to the beach site
<desc>
The beach house as it sits and its architectural characteristicsaccording to the author.
<date>
2010-01-09
<rfn>
1214-YAT
<location>
Bahia
</group2>
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 14
	at LegacyText.main(LegacyText.java:38)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:110)

Open in new window

0
 
for_yanCommented:
This will pack them into HashMap:

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + " " + s1,ss[j+1]);
            j++;



        }

         System.out.println(map);


    }


}

Open in new window



{<group2> <date>=2010-01-09, <group2> <rfn>=1214-YAT, <group1> <date>=2009-10-21, <group2> <location>=Bahia, <group1> </group1>=, <group2> <desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group1> <desc>=This documents describes the land and surroundingsof the location, <group2> <title>=This is the second of the documents related to the beach site, <group1> <rfn>=1212-YUI, <group1> <title>=This is the first of the documents related to the site, <group1> <location>=Minas Gerais}

Open in new window

0
 
for_yanCommented:
This would be even better,
skipping the closing group tags:

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
            j++;



        }

         System.out.println(map);


    }


}

Open in new window


Output:
{<group1>+<rfn>=1212-YUI, <group1>+<title>=This is the first of the documents related to the site, <group2>+<title>=This is the second of the documents related to the beach site, <group2>+<date>=2010-01-09, <group2>+<location>=Bahia, <group1>+<location>=Minas Gerais, <group2>+<desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group2>+<rfn>=1214-YAT, <group1>+<date>=2009-10-21, <group1>+<desc>=This documents describes the land and surroundingsof the location}

Open in new window

0
 
for_yanCommented:
This has better printout:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();
        ArrayList<String> ar = new ArrayList<String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics\n" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
                ar.add(group + "+" + s1);
            j++;



        }

        for (String key: ar){

            System.out.println(key + " : " + map.get(key));
        }

         


    }


}

Open in new window


Output:

<group1>+<title> : This is the first of the documents related to the site
<group1>+<desc> : This documents describes the land and surroundings
of the location
<group1>+<date> : 2009-10-21
<group1>+<rfn> : 1212-YUI
<group1>+<location> : Minas Gerais
<group2>+<title> : This is the second of the documents related to the beach site
<group2>+<desc> : The beach house as it sits and its architectural characteristics
according to the author.
<group2>+<date> : 2010-01-09
<group2>+<rfn> : 1214-YAT
<group2>+<location> : Bahia

Open in new window

0
 
CarlosScheideckerAuthor Commented:
Almost there for_yan, I now need to put them into a record structure, The hashmap will  override them once I have more than one Group1 structure.

0
 
for_yanCommented:
No, it will not - look at my latest code - it uses groupN + key as the ultimate key
0
 
CarlosScheideckerAuthor Commented:
Basically I need to get all Group1s, Group2s, and Group3s and represent them as objects (pojos) with the fields.

The code above represents the maps but if there is more than one Group1, then the fields are mixed up as you do not know to which Group1 structure that one belongs to.
0
 
for_yanCommented:
You mean there could be more than one "Group1" in one file?
Because at this point Group1 is fully separated form Group2 from Group3 etc.
Yes, if there is more than one Group1 in the same file - they will be overwritten
0
 
CarlosScheideckerAuthor Commented:
Exactly for_yan. I need to capture a list of the different structures without them being overriden. That is, all Group1s and their respective fields, Group2s, Group3s, etc.

Say you have a Group Object that contains a map. Each Group object will have its map populated with the fields and their values. But a general Map would override the many different Group1s.
0
 
for_yanCommented:
Actually is is doable to separate several group1's also.
You just need to maintain the ArrayList of group names

and in this place

if(s1.startsWith("<group")){group = s1;  j++; continue;}

you check if there was already group1, you can add group1-2 to the key
or if there was group1-2 you can make it group1-3, etc.

0
 
CarlosScheideckerAuthor Commented:
For_yan. Aside from GroupN I also have 3 other structures which do not start with the word Group.

Hence, I think on the Regex it should capture first all inside <GroupN>*</GroupN> and create an object with those fields. Then, it should capture the other structures such as <Arquivo>*</Arquivo> <Update>*</update>

Meaning, the name of the strucure I would store in a filed called Type. So Group1, Group2, Group3 and so on as well as Arquivo, Update, Notific which are the names of the other structures.

Basically the goal would be, capture all the structures and then parse their fields.
0
 
for_yanCommented:
Or in here

isntead of
if(s1.startsWith("<group") )

you can try

 if(s1.startsWith("<group") || s1.startsWith("Arquivo") || s1.startsWith("Update")  )

and maybe it would be possible to pack it in one map this way, in case
you have finite number of these

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.