asked on

Regex help from legacy funky structure

Hello,

I have the following structure from a legacy system. It is NOT XML. Basically it has a start and end structure and it varies from the name: Group1, Group2, Grouo3, etc.

Inside each of those structures there is a finite set of properties, none of them have a closing piece. For instance <DESC> can have many lines but NO closing matching </DESC>. The contents of DESC, as the other fields, end when another field starts.

Hence, my question is: What is the best way to parse it using Regex since each structure starts with a different name?

Attached a sample file with that.

Thanks.

<group1>
<title>This is the first of the documents related to the site
<desc>This documents describes the land and surroundings
of the location
<date>2009-10-21
<rfn>1212-YUI
<location>Minas Gerais
</group1>

<group2>
<title>This is the second of the documents related to the beach site
<desc>The beach house as it sits and its architectural characteristics
according to the author.
<date>2010-01-09
<rfn>1214-YAT
<location>Bahia
</group2>

Open in new window

Terry Woods

I'd do it in 2 steps to increase the maintainability:

1. Use this pattern to extract the groups:
Pattern re = Pattern.compile("<group(\\d+)>.*</group\\1>",Pattern.DOTALL);

2. Extract the fields from each group. I haven't provided a pattern yet, as I don't know if you'll want to use it or not.

Terry Woods

Note that the number of the group is captured in a sub-pattern, and is back-referenced with \\1 to ensure we get the correct closing tag.

for_yan

This almost works (the error at the very bottom but the tags correspond to
values up to that moment:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        while(m.find()){
            System.out.println(s.substring(m.start(), m.end()));
           // if(j== ss.length)break;
          System.out.println(ss[j+1]);
            j++;



        }




    }


}

Open in new window

Output:

<group1>

<title>
This is the first of the documents related to the site
<desc>
This documents describes the land and surroundingsof the location
<date>
2009-10-21
<rfn>
1212-YUI
<location>
Minas Gerais
</group1>

<group2>

<title>
This is the second of the documents related to the beach site
<desc>
The beach house as it sits and its architectural characteristicsaccording to the author.
<date>
2010-01-09
<rfn>
1214-YAT
<location>
Bahia
</group2>
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 14
	at LegacyText.main(LegacyText.java:38)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:110)

Open in new window

for_yan

This will pack them into HashMap:

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + " " + s1,ss[j+1]);
            j++;



        }

         System.out.println(map);


    }


}

Open in new window

{<group2> <date>=2010-01-09, <group2> <rfn>=1214-YAT, <group1> <date>=2009-10-21, <group2> <location>=Bahia, <group1> </group1>=, <group2> <desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group1> <desc>=This documents describes the land and surroundingsof the location, <group2> <title>=This is the second of the documents related to the beach site, <group1> <rfn>=1212-YUI, <group1> <title>=This is the first of the documents related to the site, <group1> <location>=Minas Gerais}

Open in new window

for_yan

This would be even better,
skipping the closing group tags:

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
            j++;



        }

         System.out.println(map);


    }


}

Open in new window

Output:

{<group1>+<rfn>=1212-YUI, <group1>+<title>=This is the first of the documents related to the site, <group2>+<title>=This is the second of the documents related to the beach site, <group2>+<date>=2010-01-09, <group2>+<location>=Bahia, <group1>+<location>=Minas Gerais, <group2>+<desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group2>+<rfn>=1214-YAT, <group1>+<date>=2009-10-21, <group1>+<desc>=This documents describes the land and surroundingsof the location}

Open in new window

for_yan

This has better printout:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LegacyText {

    public static void main(String[] args) {

        HashMap<String,String> map = new HashMap<String, String>();
        ArrayList<String> ar = new ArrayList<String>();

             String s = "<group1>" +
                     "<title>This is the first of the documents related to the site" +
                     "<desc>This documents describes the land and surroundings\n" +
                     "of the location" +
                     "<date>2009-10-21" +
                     "<rfn>1212-YUI" +
                     "<location>Minas Gerais" +
                     "</group1>" +
                     "" +
                     "<group2>" +
                     "<title>This is the second of the documents related to the beach site" +
                     "<desc>The beach house as it sits and its architectural characteristics\n" +
                     "according to the author." +
                     "<date>2010-01-09" +
                     "<rfn>1214-YAT" +
                     "<location>Bahia" +
                     "</group2>";

        String [] ss = s.split("<[^>]*>");

      //  for(String s1: ss){
        //    System.out.println(s1);
    //    }

        Pattern p = Pattern.compile("<[^>]*>");
        Matcher m = p.matcher(s);
        int j=0;
        String group = null;
        while(m.find()){
            String s1= s.substring(m.start(), m.end());
            if(s1.startsWith("<group")){group = s1;  j++; continue;}
             if(s1.startsWith("</group")){  j++; continue;}
           if(j== ss.length-1)break;
       //   System.out.println(ss[j+1]);
                map.put(group + "+" + s1,ss[j+1]);
                ar.add(group + "+" + s1);
            j++;



        }

        for (String key: ar){

            System.out.println(key + " : " + map.get(key));
        }

         


    }


}

Open in new window

Output:

<group1>+<title> : This is the first of the documents related to the site
<group1>+<desc> : This documents describes the land and surroundings
of the location
<group1>+<date> : 2009-10-21
<group1>+<rfn> : 1212-YUI
<group1>+<location> : Minas Gerais
<group2>+<title> : This is the second of the documents related to the beach site
<group2>+<desc> : The beach house as it sits and its architectural characteristics
according to the author.
<group2>+<date> : 2010-01-09
<group2>+<rfn> : 1214-YAT
<group2>+<location> : Bahia

Open in new window

CarlosScheidecker

ASKER

Almost there for_yan, I now need to put them into a record structure, The hashmap will override them once I have more than one Group1 structure.

for_yan

No, it will not - look at my latest code - it uses groupN + key as the ultimate key

CarlosScheidecker

ASKER

Basically I need to get all Group1s, Group2s, and Group3s and represent them as objects (pojos) with the fields.

The code above represents the maps but if there is more than one Group1, then the fields are mixed up as you do not know to which Group1 structure that one belongs to.

for_yan

You mean there could be more than one "Group1" in one file?
Because at this point Group1 is fully separated form Group2 from Group3 etc.
Yes, if there is more than one Group1 in the same file - they will be overwritten

CarlosScheidecker

ASKER

Exactly for_yan. I need to capture a list of the different structures without them being overriden. That is, all Group1s and their respective fields, Group2s, Group3s, etc.

Say you have a Group Object that contains a map. Each Group object will have its map populated with the fields and their values. But a general Map would override the many different Group1s.

for_yan

Actually is is doable to separate several group1's also.
You just need to maintain the ArrayList of group names

and in this place

if(s1.startsWith("<group")){group = s1; j++; continue;}

you check if there was already group1, you can add group1-2 to the key
or if there was group1-2 you can make it group1-3, etc.

ASKER CERTIFIED SOLUTION

for_yan

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CarlosScheidecker

ASKER

For_yan. Aside from GroupN I also have 3 other structures which do not start with the word Group.

Hence, I think on the Regex it should capture first all inside <GroupN>*</GroupN> and create an object with those fields. Then, it should capture the other structures such as <Arquivo>*</Arquivo> <Update>*</update>

Meaning, the name of the strucure I would store in a filed called Type. So Group1, Group2, Group3 and so on as well as Arquivo, Update, Notific which are the names of the other structures.

Basically the goal would be, capture all the structures and then parse their fields.

for_yan

Or in here

isntead of
if(s1.startsWith("<group") )

you can try

if(s1.startsWith("<group") || s1.startsWith("Arquivo") || s1.startsWith("Update") )

and maybe it would be possible to pack it in one map this way, in case
you have finite number of these