CarlosScheidecker
asked on
Regex help from legacy funky structure
Hello,
I have the following structure from a legacy system. It is NOT XML. Basically it has a start and end structure and it varies from the name: Group1, Group2, Grouo3, etc.
Inside each of those structures there is a finite set of properties, none of them have a closing piece. For instance <DESC> can have many lines but NO closing matching </DESC>. The contents of DESC, as the other fields, end when another field starts.
Hence, my question is: What is the best way to parse it using Regex since each structure starts with a different name?
Attached a sample file with that.
Thanks.
I have the following structure from a legacy system. It is NOT XML. Basically it has a start and end structure and it varies from the name: Group1, Group2, Grouo3, etc.
Inside each of those structures there is a finite set of properties, none of them have a closing piece. For instance <DESC> can have many lines but NO closing matching </DESC>. The contents of DESC, as the other fields, end when another field starts.
Hence, my question is: What is the best way to parse it using Regex since each structure starts with a different name?
Attached a sample file with that.
Thanks.
<group1>
<title>This is the first of the documents related to the site
<desc>This documents describes the land and surroundings
of the location
<date>2009-10-21
<rfn>1212-YUI
<location>Minas Gerais
</group1>
<group2>
<title>This is the second of the documents related to the beach site
<desc>The beach house as it sits and its architectural characteristics
according to the author.
<date>2010-01-09
<rfn>1214-YAT
<location>Bahia
</group2>
Note that the number of the group is captured in a sub-pattern, and is back-referenced with \\1 to ensure we get the correct closing tag.
This almost works (the error at the very bottom but the tags correspond to
values up to that moment:
Output:
values up to that moment:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LegacyText {
public static void main(String[] args) {
String s = "<group1>" +
"<title>This is the first of the documents related to the site" +
"<desc>This documents describes the land and surroundings" +
"of the location" +
"<date>2009-10-21" +
"<rfn>1212-YUI" +
"<location>Minas Gerais" +
"</group1>" +
"" +
"<group2>" +
"<title>This is the second of the documents related to the beach site" +
"<desc>The beach house as it sits and its architectural characteristics" +
"according to the author." +
"<date>2010-01-09" +
"<rfn>1214-YAT" +
"<location>Bahia" +
"</group2>";
String [] ss = s.split("<[^>]*>");
// for(String s1: ss){
// System.out.println(s1);
// }
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(s);
int j=0;
while(m.find()){
System.out.println(s.substring(m.start(), m.end()));
// if(j== ss.length)break;
System.out.println(ss[j+1]);
j++;
}
}
}
Output:
<group1>
<title>
This is the first of the documents related to the site
<desc>
This documents describes the land and surroundingsof the location
<date>
2009-10-21
<rfn>
1212-YUI
<location>
Minas Gerais
</group1>
<group2>
<title>
This is the second of the documents related to the beach site
<desc>
The beach house as it sits and its architectural characteristicsaccording to the author.
<date>
2010-01-09
<rfn>
1214-YAT
<location>
Bahia
</group2>
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 14
at LegacyText.main(LegacyText.java:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:110)
This will pack them into HashMap:
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LegacyText {
public static void main(String[] args) {
HashMap<String,String> map = new HashMap<String, String>();
String s = "<group1>" +
"<title>This is the first of the documents related to the site" +
"<desc>This documents describes the land and surroundings" +
"of the location" +
"<date>2009-10-21" +
"<rfn>1212-YUI" +
"<location>Minas Gerais" +
"</group1>" +
"" +
"<group2>" +
"<title>This is the second of the documents related to the beach site" +
"<desc>The beach house as it sits and its architectural characteristics" +
"according to the author." +
"<date>2010-01-09" +
"<rfn>1214-YAT" +
"<location>Bahia" +
"</group2>";
String [] ss = s.split("<[^>]*>");
// for(String s1: ss){
// System.out.println(s1);
// }
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(s);
int j=0;
String group = null;
while(m.find()){
String s1= s.substring(m.start(), m.end());
if(s1.startsWith("<group")){group = s1; j++; continue;}
if(j== ss.length-1)break;
// System.out.println(ss[j+1]);
map.put(group + " " + s1,ss[j+1]);
j++;
}
System.out.println(map);
}
}
{<group2> <date>=2010-01-09, <group2> <rfn>=1214-YAT, <group1> <date>=2009-10-21, <group2> <location>=Bahia, <group1> </group1>=, <group2> <desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group1> <desc>=This documents describes the land and surroundingsof the location, <group2> <title>=This is the second of the documents related to the beach site, <group1> <rfn>=1212-YUI, <group1> <title>=This is the first of the documents related to the site, <group1> <location>=Minas Gerais}
This would be even better,
skipping the closing group tags:
Output:
skipping the closing group tags:
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LegacyText {
public static void main(String[] args) {
HashMap<String,String> map = new HashMap<String, String>();
String s = "<group1>" +
"<title>This is the first of the documents related to the site" +
"<desc>This documents describes the land and surroundings" +
"of the location" +
"<date>2009-10-21" +
"<rfn>1212-YUI" +
"<location>Minas Gerais" +
"</group1>" +
"" +
"<group2>" +
"<title>This is the second of the documents related to the beach site" +
"<desc>The beach house as it sits and its architectural characteristics" +
"according to the author." +
"<date>2010-01-09" +
"<rfn>1214-YAT" +
"<location>Bahia" +
"</group2>";
String [] ss = s.split("<[^>]*>");
// for(String s1: ss){
// System.out.println(s1);
// }
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(s);
int j=0;
String group = null;
while(m.find()){
String s1= s.substring(m.start(), m.end());
if(s1.startsWith("<group")){group = s1; j++; continue;}
if(s1.startsWith("</group")){ j++; continue;}
if(j== ss.length-1)break;
// System.out.println(ss[j+1]);
map.put(group + "+" + s1,ss[j+1]);
j++;
}
System.out.println(map);
}
}
Output:
{<group1>+<rfn>=1212-YUI, <group1>+<title>=This is the first of the documents related to the site, <group2>+<title>=This is the second of the documents related to the beach site, <group2>+<date>=2010-01-09, <group2>+<location>=Bahia, <group1>+<location>=Minas Gerais, <group2>+<desc>=The beach house as it sits and its architectural characteristicsaccording to the author., <group2>+<rfn>=1214-YAT, <group1>+<date>=2009-10-21, <group1>+<desc>=This documents describes the land and surroundingsof the location}
This has better printout:
Output:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LegacyText {
public static void main(String[] args) {
HashMap<String,String> map = new HashMap<String, String>();
ArrayList<String> ar = new ArrayList<String>();
String s = "<group1>" +
"<title>This is the first of the documents related to the site" +
"<desc>This documents describes the land and surroundings\n" +
"of the location" +
"<date>2009-10-21" +
"<rfn>1212-YUI" +
"<location>Minas Gerais" +
"</group1>" +
"" +
"<group2>" +
"<title>This is the second of the documents related to the beach site" +
"<desc>The beach house as it sits and its architectural characteristics\n" +
"according to the author." +
"<date>2010-01-09" +
"<rfn>1214-YAT" +
"<location>Bahia" +
"</group2>";
String [] ss = s.split("<[^>]*>");
// for(String s1: ss){
// System.out.println(s1);
// }
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(s);
int j=0;
String group = null;
while(m.find()){
String s1= s.substring(m.start(), m.end());
if(s1.startsWith("<group")){group = s1; j++; continue;}
if(s1.startsWith("</group")){ j++; continue;}
if(j== ss.length-1)break;
// System.out.println(ss[j+1]);
map.put(group + "+" + s1,ss[j+1]);
ar.add(group + "+" + s1);
j++;
}
for (String key: ar){
System.out.println(key + " : " + map.get(key));
}
}
}
Output:
<group1>+<title> : This is the first of the documents related to the site
<group1>+<desc> : This documents describes the land and surroundings
of the location
<group1>+<date> : 2009-10-21
<group1>+<rfn> : 1212-YUI
<group1>+<location> : Minas Gerais
<group2>+<title> : This is the second of the documents related to the beach site
<group2>+<desc> : The beach house as it sits and its architectural characteristics
according to the author.
<group2>+<date> : 2010-01-09
<group2>+<rfn> : 1214-YAT
<group2>+<location> : Bahia
ASKER
Almost there for_yan, I now need to put them into a record structure, The hashmap will override them once I have more than one Group1 structure.
No, it will not - look at my latest code - it uses groupN + key as the ultimate key
ASKER
Basically I need to get all Group1s, Group2s, and Group3s and represent them as objects (pojos) with the fields.
The code above represents the maps but if there is more than one Group1, then the fields are mixed up as you do not know to which Group1 structure that one belongs to.
The code above represents the maps but if there is more than one Group1, then the fields are mixed up as you do not know to which Group1 structure that one belongs to.
You mean there could be more than one "Group1" in one file?
Because at this point Group1 is fully separated form Group2 from Group3 etc.
Yes, if there is more than one Group1 in the same file - they will be overwritten
Because at this point Group1 is fully separated form Group2 from Group3 etc.
Yes, if there is more than one Group1 in the same file - they will be overwritten
ASKER
Exactly for_yan. I need to capture a list of the different structures without them being overriden. That is, all Group1s and their respective fields, Group2s, Group3s, etc.
Say you have a Group Object that contains a map. Each Group object will have its map populated with the fields and their values. But a general Map would override the many different Group1s.
Say you have a Group Object that contains a map. Each Group object will have its map populated with the fields and their values. But a general Map would override the many different Group1s.
Actually is is doable to separate several group1's also.
You just need to maintain the ArrayList of group names
and in this place
if(s1.startsWith("<group") ){group = s1; j++; continue;}
you check if there was already group1, you can add group1-2 to the key
or if there was group1-2 you can make it group1-3, etc.
You just need to maintain the ArrayList of group names
and in this place
if(s1.startsWith("<group")
you check if there was already group1, you can add group1-2 to the key
or if there was group1-2 you can make it group1-3, etc.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
For_yan. Aside from GroupN I also have 3 other structures which do not start with the word Group.
Hence, I think on the Regex it should capture first all inside <GroupN>*</GroupN> and create an object with those fields. Then, it should capture the other structures such as <Arquivo>*</Arquivo> <Update>*</update>
Meaning, the name of the strucure I would store in a filed called Type. So Group1, Group2, Group3 and so on as well as Arquivo, Update, Notific which are the names of the other structures.
Basically the goal would be, capture all the structures and then parse their fields.
Hence, I think on the Regex it should capture first all inside <GroupN>*</GroupN> and create an object with those fields. Then, it should capture the other structures such as <Arquivo>*</Arquivo> <Update>*</update>
Meaning, the name of the strucure I would store in a filed called Type. So Group1, Group2, Group3 and so on as well as Arquivo, Update, Notific which are the names of the other structures.
Basically the goal would be, capture all the structures and then parse their fields.
Or in here
isntead of
if(s1.startsWith("<group") )
you can try
if(s1.startsWith("<group") || s1.startsWith("Arquivo") || s1.startsWith("Update") )
and maybe it would be possible to pack it in one map this way, in case
you have finite number of these
isntead of
if(s1.startsWith("<group")
you can try
if(s1.startsWith("<group")
and maybe it would be possible to pack it in one map this way, in case
you have finite number of these
1. Use this pattern to extract the groups:
Pattern re = Pattern.compile("<group(\\
2. Extract the fields from each group. I haven't provided a pattern yet, as I don't know if you'll want to use it or not.